Couldn’t attend Transform 2022? Check out all the top sessions in our on-demand library now! Look here.
Visibility is critical to the success of any application. Defining perceptibility is difficult, however. Some people confuse it with monitoring or logging, and others think it’s essentially analysis, which is just part of observability.
Observability, when done correctly, will give you incredible insights into the deep internals of your system and enable you to ask complex improvement questions such as:
- Where is your system vulnerable?
- What are you doing well? What are you doing badly?
- What should be the next step in your product roadmap?
- Does code need to be reworked/rewritten?
- Where are your common points of failure?
These are all important questions to ask and can be answered with data-driven information created by implementing good observational practices.
In this article you will learn what observability is, why it is important and what problems observability helps to solve. You’ll also learn some observability best practices and how to implement them so you can start improving your application today.
What is observability?
Observability is how well you know what is happening in your software system without writing new code.
If you were asked which of your microservices are experiencing the most errors, what is the worst performing part of your system, or what is the most common frontend failure your customers experience, could you answer those questions? If your team has to go out and write code to answer them, it’s fair to say your system is undetectable. This means that your system is constantly playing a game of whack when new questions are asked.
Why is observability important?
Good observability enables you to achieve data-driven, positive business outcomes. Knowing what to work on, what to improve, and what to ignore can propel your business from success to success and save you time on things that don’t interest your customers or aren’t even real problems, such as offering a language on your website. site that your customers probably don’t use.
Observability is also critical for new software practices. In recent decades, software systems have become increasingly complex; however, monitoring best practices have not evolved at the same rate. Traditionally, web development was done using something like the LAMP (Linux, Apache, MySQL, PHP/Perl/Python) stack, which is one big database with some middleware, a web layer and a caching layer. The LAMP stack is very simple and fairly trivial to debug. All you need to do is scale all of the above and any issues can be quickly identified, resolved and released due to the monolithic nature of the application.
However, now software offerings, frameworks, paradigms and libraries have greatly increased the complexity of their systems through things like cloud infrastructure, distributed microservicesmultiple geographic locations, multiple languages, multiple software offerings, and container orchestration technology.
Observability can help you ask and answer important questions about your software system and all the different states it can go through by observing it.
According to Stripe’s The Developer Coefficient reportgood observability saves about 42% of a company’s development time, including debugging and refactoring.
What problems can observation help solve?
There are numerous benefits when you follow good observational practices and bake them directly into your software system, including the following:
Releases are faster
When you know more about your system, you can iterate faster. You’ll save your developers days of debugging vague, random problems.
For example, I have experience working for a multi-billion dollar company with millions of concurrent users. One of the tasks of the entire software team was to review the support queue logs and try to resolve them. However, this was an incredibly difficult task. All the team ever got in the ticket was a stack trace and an error log count. This essentially left the developers looking through the code for hours, trying to pinpoint the most likely reason for the error.
There were many cases where the (probable) reason was fixed, QA passed and released, but the developer was wrong and the process had to start all over again.
Good observability takes the guesswork out of this process and can provide much more context, data and help to troubleshoot issues in your system.
Incidents become easier to resolve
When you’re clear insights and data for key areas of your code and business, give your developers the context and information they need to get things done.
A company can never fix something it doesn’t measure. This also applies to incidents.
With important information such as the following, you can average time to recover of an incident:
- How do you replicate the incident?
- When does it happen?
- Is there a solution?
- Does a service error occur when you replicate the incident?
It helps you decide what to work on
As mentioned earlier, with the additional information you gain through good observational practices, you can decide what to work on.
For example, if a particular bug affects only 0.001 percent of the customer base, occurs in a rarely used language, and can be easily fixed by an update, it makes sense to focus on more serious system bugs. This gives you the best bang for your buck in terms of time developers spend on your system, and allows you to focus on solving customer problems, and ultimately the user experience.
When you’re observable, you know what your customers’ biggest frustrations are, and this information can help you improve your product roadmap or bug backlog.
Best Practices for Observability
There are a few best practices to follow when implementing observability, including the following:
Three pillars of observability
Remember the three pillars of perceptibility: logs, metrics, and traces. These are all different types of time series data and can help improve the observability of your system. Using a time series database, such as InfluxDB, makes it easier to work with and use this type of data effectively.
Each of these serves as a useful and important part of your system’s observability. For example, logs are timestamps of events that have occurred in your system. Statistics are numerical representations of data measured over time (ie 100 customers used your site over a one-hour period). Traces are a representation of flow-related events through your system (ie a customer who lands on your landing page, adds a T-shirt to their cart, and then purchases that shirt).
Each of these provides unique and powerful insights into your system and can help you improve it.
Run A/B Testing
A/B testing is an important tool for driving improvements in your product and your code.
By observing your system, you can make changes to your system/refactoring and directly measure customer impact.
An example would be to move your site’s navigation from the footer to the header, where most sites normally place it. From here you can measure the time it takes people to navigate to where they need to go, session duration or purchase time as a direct result of moving your navigation breadcrumb to the header.
You can remove the underperforming version of your test and use your A/B test to boost your positive Key Performance Indicator (KPI) metrics.
Don’t throw away the context
To make your system truly observable, you need to maintain as much context as possible. Everything happens within the context of time, and time series data maintains that context. It is also metadata around the events you observe. Context helps you better understand the whole picture of a problem you are facing and leads to faster solutions.
For example, if your system starts to experience an error at some point, context can be the key to really observing and deciphering the cause. So if your system doesn’t start getting an error until Friday, you may realize that the errors are caused by an automated database backup script that is also happening at that time. However, if you haven’t captured all the context and information around that particular log, the standalone log is useless. A solution like InfluxDB can help to store, manage and use this kind of data.
Context includes things like the following:
- The time of your event.
- The count of your event.
- The user associated with your event.
- The day of the event.
Preserve unique identifiers throughout the system
In systems where multiple parts of the system need to communicate, a single event can usually be an alias.
For example, if your frontend page sends a customer to a payment page, you may have a unique identifier for the customer that is difficult to correlate with the payment they just made. This is considered an anti-pattern.
You need to make sure that all the different parts of your system speak one unified language. If you don’t, you will only achieve observability in one part of your system. As soon as it becomes difficult to correlate one error between two different systems, you have an unobservable system again.
Observability versus monitoring
Monitoring and observability are often confused; however, it is important to understand their differences so that you can implement both accurately.
Monitoring deals with known unknowns. For example, if you know you don’t have much information in your API related to your payment backend, you can add logs to it to check that system. Monitoring is generally more reactive and used to monitor a particular part of your system.
Monitoring is important, but differs from observability.
Observability generally deals with unknown unknowns. For example, you may not even know that you don’t have a lot of information in your payment backend system, and this is observability. You begin to understand your system better, and when you get a deep, intricate picture of your system, you can identify your gaps and where you need to improve.
This is less reactive and is normally commonly referred to as work of discovery.
In this article, you learned about the importance of perceptibility and the common questions that frequently come up when encountering perceptibility, such as why it is important and what problems it solves. You also learned how observability and monitoring differ from each other.
Kealan Parr is a senior software engineer at Amber Labs.
Welcome to the VentureBeat Community!
DataDecisionMakers is where experts, including the technical people who do data work, can share data-related insights and innovation.
If you want to read about the latest ideas and up-to-date information, best practices and the future of data and data technology, join DataDecisionMakers.
You might even consider contributing an article yourself!
Read more from DataDecisionMakers