Datadog Inc.

11/15/2024 | News release | Distributed by Public on 11/15/2024 11:53

Detect and troubleshoot Windows Blue Screen errors with Datadog

Windows Blue Screen errors-also known as bug checks, STOP codes, kernel errors, or the Blue Screen of Death (BSOD)-are triggered when the operating system detects a critical issue that compromises system stability. To prevent further damage or data corruption, the OS determines that the safest course of action is to shut down immediately. The system then restarts and displays the well-known BSOD.

Often, Blue Screen errors are caused by problematic or incompatible device drivers, including updates that introduce bugs or conflicts within the system's kernel. For most organizations, large portions of their infrastructure are installed with a standard set of agents and third-party software. This means that when automated driver updates occur, any issues that create kernel errors can disrupt your entire fleet of hosts and create significant business outages.

In this post, we'll discuss how Datadog can help you detect Blue Screen errors when they occur, as well as troubleshoot their root cause, so you can swiftly recover system operations.

Track system failures with the Datadog Windows Crash integration

In order to quickly resolve Blue Screen errors, you'll first need tools in place to help you detect them. A Blue Screen error is unmistakable when you encounter it on your primary Windows device. However, when you manage a fleet of remote hosts, it's unrealistic to monitor each individually for Blue Screen errors and other system crashes.

Datadog's Window Crash Detection integration generates an event via the Agent each time it detects a Windows system crash on startup. Using Datadog's Windows Crash Detection monitor template, you can configure out-of-the-box alerts that notify you whenever a crash occurs.

The generated Windows Crash error will tell you the time at which the crash took place, the offending module responsible for the crash, as well as a bugcheck code. These three pieces of information can help you surface key context as to why your system crashed in the first place.

The first piece of information is the time of the crash. While this may initially be easy to overlook, the time of the crash can often help you align it with other ongoing or historical events. For instance, your organization may automate certain driver updates at a specific time of day. If the crash occurred at the time of these automated updates, you can quickly make the connection and roll back the deployment to the previous version. This also applies to incidents-you can check your active and stable incidents in Datadog Incident Management to see whether any overlapped with your crash and could have affected your specific host.

In many cases, your organization may automate several driver updates at the same time, making it difficult to identify the specific driver responsible for the crash. To narrow your search, the offending module identified in the crash event lists the specific driver file responsible.

First, check for available driver updates. It's possible that other system updates changed your system to be incompatible with the driver at fault. If the issue is tied to recent changes to the newest version, it's likely that you're not the only victim of a system crash, and you should surface relevant information that notifies you of the issue. From here, you can either roll back to the previous working version or contact the driver developers for assistance.

Finally, the bugcheck code can provide more granular reasoning into why the system failed. The Windows Bug Check Code Reference can help you decipher different parameters of the code, as well as note common causes, paths to resolution, and additional remarks. While the Datadog crash event provides a summarized version of your system failure with the standard information needed for troubleshooting, you may want a complete record of the physical memory used by Windows at the time of the crash. By default, Windows will save a memory dump to your local disk with the file path C:\Windows\MEMORY.DMP, which you can then view with a debugger of your choice, such as windbg. If you still are in need of additional context to narrow your investigation, you can filter Windows Event logs to System logs with "Critical" and "Error" event levels leading up to your system failure to see whether they provide additional context into the issue you're encountering.

It's also important to note that while the crash detection event may point to a driver, the underlying cause may have originated from hardware issues. For example, if the offending module for a crash shows nvlddmkm.sys-the GPU driver for systems with a Nvidia GPU installed-in addition to the driver updates previously mentioned, we'll need to check our host's power draw, CPU usage, remaining memory, and other metrics that can indicate hardware issues at the time of incident. Since Datadog's universal tagging identifies the host for each crash, you can investigate these hosts in our Host Map, where you can view additional health, network, and process information.

Organizations that self-manage their infrastructure can direct the issue to their IT team to check on faulty hardware and whether it is isolated to a specific machine. If you rely on cloud-managed hosts, your cloud provider will automatically tag any faulty hardware and notify you of degradation.

Start monitoring Windows with Datadog

Blue Screen errors can create unexpected disruptions across your network, but with Datadog's integration and infrastructure visibility, you can quickly detect and investigate the drivers or hardware responsible. You can learn more about the Windows Crash Detection integration in our documentation. You can also read about our Windows Kernel Memory integration, which gives you additional visibility into your Windows Kernel memory usage.

If you don't already have a Datadog account, sign up for a free 14-day trial today.