A few weeks ago, we faced a problem that any platform engineer dreads: one of our nodes rebooted unexpectedly. The cause? The iDRAC watchdog forcefully terminated it.
But what led iDRAC to decide it was time to shut down the node? A preliminary investigation concluded that there wasn’t any kind of hardware issue. That meant only one thing: the root cause was related to one of our processes running on the node.
Finding a potential culprit though is no small feat: there are thousands of processes actively running on a node of that type and investigating each one would take an eternity. That’s when we knew we had to come up with a clever and handy solution: SystemTap.
For those of you who don’t know, SystemTap (stap) serves as a stethoscope for your Linux kernel. Instead of hearing heartbeats, it allows you to monitor syscalls, kernel functions, and the hidden operations your system performs.
More technically, it allows users to write custom scripts using a specialized scripting language to insert dynamic instrumentation into a running system, providing deep insights into performance bottlenecks, resource usage, and system behavior – all without needing to reboot or recompile the kernel.
Once we identified the tool that best fit our needs, we wrote a small SystemTap script to monitor every interaction with the /dev/watchdog
file:
function output(msg) {
printk(1, msg);
}
probe begin { printf("Staring watchdog-interaction detector\n");
output("Staring watchdog-interaction detector\n");}
probe end { printf("Stopping watchdog-interaction detector\n");
output("Stopping watchdog-interaction detector\n");}
probe kernel.function("watchdog_release")
{
printf("time %d name %s(%d) %s -> command: %s gid: %d uid: %d\n",gettimeofday_ms(),task_execname(task_current()),pid(),probemod(),cmdline_str(), gid(), uid());
output(sprintf("time %d name %s(%d) %s -> command: %s gid: %d uid: %d\n",gettimeofday_ms(),task_execname(task_current()),pid(),probemod(),cmdline_str(), gid(), uid()));
print_backtrace();
output(sprint_backtrace());
}
But we didn’t want to have to install the kernel’s debug dependencies on the nodes in order to run it. Instead, we spun up a privileged container and installed the required development packages there. Since the kernel is shared between the two, we could take advantage of this setup.
# Install systemtap
dnf install systemtap
# Ensure the needed dependencies are present
stap-prep
#Run the script
stap watchdog_release.stp -g
Everything was ready; we just needed to wait for our mysterious process to take the bait and open /dev/watchdog
again.
Aaaaand, the bait worked perfectly. A day later the node rebooted once again, but this time, we were ready for it.
A quick inspection in the journal dmesg
revealed the pid of the process!
Aug 14 10:55:59 node06.rdopenshift.si.wp.lan.62.62.10.in-addr.arpa kernel: watchdog-interaction-detector - time 1755168959644 name systemd(453161) kernel -> command: /usr/sbin/init gid: 0 uid: 0
Aug 14 10:55:59 node06 kernel: watchdog_release+0x0 [kernel]
__fput+0x91 [kernel]
task_work_run+0x59 [kernel]
exit_to_user_mode_loop+0x122 [kernel]
exit_to_user_mode_prepare+0xb6 [kernel]
syscall_exit_to_user_mode+0x12 [kernel]
do_syscall_64+0x69 [kernel]
entry_SYSCALL_64_after_hwframe+0x77 [kernel]
Aug 14 10:55:59 node06 kernel: watchdog-interaction-detector - time 1755168959646 name systemd(453161) kernel -> command: /usr/sbin/init gid: 0 uid: 0
Aug 14 10:55:59 node06 kernel: watchdog_release+0x0 [kernel]
__fput+0x91 [kernel]
task_work_run+0x59 [kernel]
exit_to_user_mode_loop+0x122 [kernel]
exit_to_user_mode_prepare+0xb6 [kernel]
syscall_exit_to_user_mode+0x12 [kernel]
do_syscall_64+0x69 [kernel]
entry_SYSCALL_64_after_hwframe+0x77 [kernel]
Aug 14 10:55:59 node06 kernel: watchdog: watchdog0: watchdog did not stop!
That’s all we needed to be able to isolate and fix the problem!😎
I shared this brief debug story to inspire you to use system tracing. When you hit a wall with traditional methods like logging or breakpoints, try digging deeper. The right tool at the right time can save you hours of frustration and lead you to the root cause faster than you’d expect.
Interested in further reading? You might find this article useful!
Did you find this article interesting? Does it match your skill set? Programming is at the heart of how we develop customized solutions. In fact, we’re currently hiring for roles just like this and others here at Würth Phoenix.