07/04/2025 | News release | Distributed by Public on 07/04/2025 03:28
Ideally, you should be using distributed tracing to trace requests through your system, but Kafka decouples producers and consumers, which means there are no direct transactions to trace between them. Kafka also uses asynchronous processes, which have implicit, not explicit, dependencies. That makes it challenging to understand how your microservices are working together.
However, it is possible to monitor your Kafka clusters with distributed tracing and OpenTelemetry. You can then analyze and visualize your traces in an open-source distributed tracing tool like Jaeger or a full observability platform like New Relic. In this post, I will leverage a simple application to show how you can achieve this.
OpenTelemetry typically comes in two flavors:
When I talk about these flavors, I typically use the analogy above. You can either buy a ready-made cake and enjoy it or buy all the ingredients and make the cake yourself. With OpenTelemetry, the approach is very similar and the flavors are:
The sample application (available in this public GitHub repository) that I am using in this blog is based on this high-level architecture:
It contains these components:
Let's start with zero-code instrumentation, aka automatic instrumentation.
Each of the different services contain a `run.sh` script to get the service up and running. The script looks like this:
The key line in this is the first one. Here we are defining the JAVA_TOOL_OPTIONS and configuring the `-javaagent` to point to the location of the OpenTelemetry Java agent.
The next three lines configure how we want to deal with the different telemetry signals. In our case, I define the traces, metrics and logs to be exported via OpenTelemetry Line Protocol (OTLP).
There are three additional environment variables that are quite important to configure:
This is basically all we need to configure. Everything else is dealt by the OpenTelemetry Java agent. No need to change anything in our source code.
Let's see what level of visibility into the services we can achieve from zero-code instrumentation.
When navigating to my New Relic account, I can see all services reporting into separate entities.
Let's start by exploring the kafka-java-producer service.
The Summary view offers a great overview of all the most important telemetry and metrics I should be focusing on.
As part of this blog, I am mostly interested in the Distributed Tracing section, so let's dive deeper into this area.
By looking at a single trace, this allows me to view the detailed information on how long this specific trace took to execute and where the time was spent.
We also automatically draw an Entity map of all the different services involved in a given trace.
The interesting area that I want to draw your attention to lies in the trace and span breakdown. You can see how the trace gets initiated on the producer, the consumer then picks up the message and how the consumer then also makes two separate calls to the downstream service.
What is interesting here is the span that says "Uninstrumented time". This is code in the consumer where the agent was not able to capture some more detailed information about what is going on in its internal methods.
This already shows the limits of zero-code instrumentation. The agent by default will not instrument all the various methods and source code, but rather stops - by design - at some level to get deeper visibility into your code.
In the previous section, you saw how zero-code instrumentation has some limits when it comes to visibility into your application. This is exactly where manual instrumentation comes into play.
I have configured the same application, but this time, no agent at all is configured when starting the application.
I simply use the Maven wrapper to run the application.
The other configuration details are then part of my application.properties:
These properties are then used in my Spring Boot application code to define the configuration for OpenTelemetry for traces, metrics and logs.
Before I jump into the details of how I implemented some manual instrumentation, let's have a look at the result first.
Do you notice how the span, which previously was called out with "Uninstrumented time", now shows much more detailed information? I now can see these additional spans:
The one that says "WhyTheHeckDoWeSleepHere" seems to be taking the most time. No wonder, as the name suggests .
Let's have a look at the source code to reveal the manual instrumentation I put in place.
In the method named ExecuteLongRunningTask I have created a new span on the current tracer by using the spanBuilder() Method.
In addition to that, you may also notice that - just for the fun of it - I created another span called "WhyTheHeckDoWeSleepHere" that contains an artificial unit of work or rather a sleep instruction on the current thread.
These concepts to leverage the OpenTelemetry SDK allow me to be much more specific in getting insights and details into my application and source code. But, as you can imagine, also have the caveat that I need to have some dependencies and custom code available in my source code.
I hope I was able to show you how easy it can be to leverage OpenTelemetry in order to get insights into your application and services. We looked into zero-code instrumentation to get started without any code changes, but the level of details may be limited. We then also looked into manual instrumentation. This allowed us to be more specific and customize the instrumentation, but the effort to get started is a little higher.
I encourage you to have a look into OpenTelemetry and its fascinating capabilities. Let me know your thoughts and please get in touch if you have any questions or need further information.
Happy coding!