What's wrong with software debugging? [Q&A]
We've seen a tidal wave of developer-enabling technologies over the last ten years. From DevOps, to CI/CD, to containers and microservices -- all of these best practices and technology patterns aim to speed up the process of shipping code fast from the developer into production.
But while software has become increasingly easy to package and deploy, the process of diagnosing and fixing bugs in production has become much more difficult. When services crash in the middle of the night, developers still find themselves in the world of logs, hotfixes and desperation -- but now with much greater surface area to investigate as applications span distributed systems.
We spoke with Ilan Peleg, co-Founder and CEO at Tel Aviv-based Lightrun, to learn more about the growing complexity of software debugging, and what his company is doing to try to give developers better tooling.
BN: Why is software debugging today so much harder?
IP: Debuggers were built for single-instance applications. Meaning, the piece of software running in production was a singular, complete entity, with all the pieces required for its proper operation neatly situated next to each other, in the same binary format, on the user’s machine.
But today applications are abstracted from the underlying hardware. This is great for scalability but transforms the world we work in into one of vastly distributed systems, which makes debugging significantly more difficult. It's now not just hard to understand what is going on, but where it is going on too. It's hard to locate that single piece of binary code that’s misbehaving.
A microservices-based application is spread over many virtualized hosts, each possessing a replica (or multiple replicas) of the services. While containers make it easy to package and distribute software, it becomes very hard to understand not only which service is the problem when there is a bug, but also which replica of the service is the problem.
BN: What's fundamentally different about how Lightrun is approaching the software debugging problem?
IP: All observability solutions today rely on an old ops world paradigm. The idea is basically log everything we can, then analyze this insane amount of logs later.
Lightrun has the opinion that instead of sorting through all those logs, you should only ask for the relevant information when you need it, in real-time and on-demand. We're the first vendor in the world that allows you to connect to an application in real-time and define all sorts of temporal data -- including a wide array of custom metrics -- at the code level of the running applications and with high granularity. Rather than adding more logs into your production application and re-deploying it, we let developers add real-time, read-only logs to the application when an issue occurs. No hotfixes, no rollbacks and no infinite log buckets to sort through.
BN: How is this different from the majority of application performance monitoring and management tools?
IP: For starters, what the APM tools do is display and visualize information relating to the state of the hosts running your application, and a set information (like logs) that was defined during development. These help answer questions that are in the realm of 'known unknowns' -- it's hard to understand what’s going on in a specific part of the application, and the APM is there to shed a light on the situation.
But what about questions that belong to the realm of 'unknown unknowns'? Things that are obviously wrong with the system, but you can't account for them during development? Lightrun helps to -- almost surgically -- break apart the black boxes that sit inside your production system with real-time, contextual information that is defined in the present, while the application is running, as opposed to information that was defined in the past, when the application was developed.
The other aspect we felt needed to be rethought on debugging is why -- as a developer who lives inside his IDE and his CLI -- do I need to go into a separate ops tool to investigate bugs on my running production applications? Developers don't often open application performance monitoring or logging tools (where the logs reside) during their workday -- actually, they often don't even have access to them on a daily basis. And so access to this production information is not natural to them. They do, however, open their IDE every single day.
Our approach at Lightrun is to bring the knowledge closer to them, inside the IDE, instead of pushing it further away. We believe that issues should be investigated within the developer tooling itself -- a concept that can be referred to as 'Shift Left Observability,' meaning that within the software development lifecycle, the tooling on the left of the SDLC (IDEs, CLIs) that developers use to create the software is the ideal location of the debugging solution.
BN: When a developer gets that notification that something's broken in production, what advantage does Lightrun give over other debugging tools that they might have reached for?
IP: Instead of diving head-first into the logs to figure out the code path your application took and the path that resulted in the current issue, developers can gradually add Lightrun logs, metrics and traces to get real-time, code-level information from the running application. This resembles more the experience of debugging a local application -- something developers literally do every single day -- and less like assembling a puzzle with missing pieces, trying to make sense of what the full picture might look like.
Photo credit: McIek/Shutterstock