Imagine NASA launching a rocket and successfully landing a rover on Mars. The rover then proceeds to execute its first task on the new world — take pictures of the surrounding land mass. The atmosphere in the control room here on Earth is tense as everyone eagerly awaits the first images of vast stretches of rusty-red land. Instead what comes up on the screen is a scary looking pop up with the message: “Error capturing image: NullReferenceException”. Now imagine the dread on the face of the programmer who wrote the code to capture the images.
I find it easy to relate to this because I once deployed a feature with a bug that prevented the feature adoption telemetry from flowing into my system. Luckily my service ran on Earth so I immediately deployed another payload with a hotfix to get the required data, but I’m sure you get the point. I learnt an important lesson that day (thankfully, early in my career): Treat feature telemetry and service logs with as much importance as the feature itself.
In this context, it is quite easy to understand why we bother so much about data. Just like how nobody would want to launch a rover on another planet that will not send any data back to Earth, nobody would want to ship a feature without a feedback mechanism for measuring its success. And just like we cannot send an astronaut to track a rocket’s trajectory, we cannot debug an issue in production without the appropriate logs.
The importance of collecting the right data in complex systems can only be even more justified based on humanity’s recent efforts to take a flight on another world. What happened behind the scenes as we saw Ingenuity take its first flight on Mars was a technical glitch that was later fixed thanks to the appropriate telemetry that it downlinked successfully to Earth.
There is already a ton of information out there about the concept of telemetry and service logs and how to implement them. So I will instead try to focus on a few techniques that have helped me in the past to get better at this to avoid a “data insufficient” situation:
- List down what you want to know. Think about what questions you want answered about your services and features, more than what data you want collected. Listing down what insights are interesting to you will help you derive what data is important to collect, more accurately. What you actually collect can then be a superset of the set you derived. This exercise can reduce the chances of ending up with insufficient data. For instance, if you are interested in knowing if the default configuration you provided for your feature makes sense to your customers, you then know you want to collect data about how many of them hit the override button to change the default and what they change it to. This data can then help you redefine your default settings.
- Resist the urge to debug. The most natural thing you’d want to do when you encounter a bug is … yeah well, put a breakpoint. This approach to root cause analysis is not only more time consuming, but will also mask any lack of sufficient logging in your system. So, resist the urge to step through to debug and instead make use of your existing logs to root cause the issue. If you are not able to, then it is an indication that your system lacks the necessary logs, and here’s a chance for you to add them. Your system will get progressively better and will one day help you debug the “hard to repro” bugs in production, the kind that can make you want to jump out the window.
- Anticipate what can go wrong. Instead of waiting for something to go wrong to identify any missing logs, think ahead about what can possibly go wrong, and what data would help you in such scenarios. Nobody wants things to go wrong but in case it does, you most definitely want to find yourself well-equipped with the right data to handle the failures. The best way to do this is by anticipating them ahead of time and figuring out what you will need.
- Test your telemetry and logs. This sounds like a no-brainer but is easy to miss. Test them just like you would test your feature. Manual tests/Automation/Unit tests, use what you find fit and if possible all. But do test your data collection workflows with the same rigor as the feature itself.
Hope this helps. Here’s to easier decisions and not jumping out the window. Happy coding! 🎉
Disclaimer: I’m a software developer who will code for free ice-cream. If you use any of my advice and lose millions of dollars, I’m not responsible in any way.