19. 09. 2025 Alessandro Taufer Development, DevOps

How to Debug Your Kernel Calls

Unexpected reboots, who doesn’t love them?

A few weeks ago, we faced a problem that any platform engineer dreads: one of our nodes rebooted unexpectedly. The cause? The iDRAC watchdog forcefully terminated it.

But what led iDRAC to decide it was time to shut down the node? A preliminary investigation concluded that there wasn’t any kind of hardware issue. That meant only one thing: the root cause was related to one of our processes running on the node.

Finding a potential culprit though is no small feat: there are thousands of processes actively running on a node of that type and investigating each one would take an eternity. That’s when we knew we had to come up with a clever and handy solution: SystemTap.

What is SystemTap?

For those of you who don’t know, SystemTap (stap) serves as a stethoscope for your Linux kernel. Instead of hearing heartbeats, it allows you to monitor syscalls, kernel functions, and the hidden operations your system performs.

More technically, it allows users to write custom scripts using a specialized scripting language to insert dynamic instrumentation into a running system, providing deep insights into performance bottlenecks, resource usage, and system behavior – all without needing to reboot or recompile the kernel.

Collecting the evidence

Once we identified the tool that best fit our needs, we wrote a small SystemTap script to monitor every interaction with the /dev/watchdog file:

function output(msg) {
        printk(1, msg);
}

probe begin { printf("Staring watchdog-interaction detector\n");
              output("Staring watchdog-interaction detector\n");}

probe end { printf("Stopping watchdog-interaction detector\n");
            output("Stopping watchdog-interaction detector\n");}

probe kernel.function("watchdog_release")
{
    printf("time %d name %s(%d) %s -> command: %s gid: %d uid: %d\n",gettimeofday_ms(),task_execname(task_current()),pid(),probemod(),cmdline_str(), gid(), uid());
    output(sprintf("time %d name %s(%d) %s -> command: %s gid: %d uid: %d\n",gettimeofday_ms(),task_execname(task_current()),pid(),probemod(),cmdline_str(), gid(), uid()));
    print_backtrace();
    output(sprint_backtrace());
}

But we didn’t want to have to install the kernel’s debug dependencies on the nodes in order to run it. Instead, we spun up a privileged container and installed the required development packages there. Since the kernel is shared between the two, we could take advantage of this setup.

# Install systemtap
dnf install systemtap

# Ensure the needed dependencies are present
stap-prep

#Run the script
stap watchdog_release.stp -g

Everything was ready; we just needed to wait for our mysterious process to take the bait and open /dev/watchdog again.

Finding the culprit

Aaaaand, the bait worked perfectly. A day later the node rebooted once again, but this time, we were ready for it.

A quick inspection in the journal dmesg revealed the pid of the process!

Aug 14 10:55:59 node06.rdopenshift.si.wp.lan.62.62.10.in-addr.arpa kernel: watchdog-interaction-detector - time 1755168959644 name systemd(453161) kernel -> command: /usr/sbin/init gid: 0 uid: 0
Aug 14 10:55:59 node06 kernel: watchdog_release+0x0 [kernel]
                                                                           __fput+0x91 [kernel]
                                                                           task_work_run+0x59 [kernel]
                                                                           exit_to_user_mode_loop+0x122 [kernel]
                                                                           exit_to_user_mode_prepare+0xb6 [kernel]
                                                                           syscall_exit_to_user_mode+0x12 [kernel]
                                                                           do_syscall_64+0x69 [kernel]
                                                                           entry_SYSCALL_64_after_hwframe+0x77 [kernel]
Aug 14 10:55:59 node06 kernel: watchdog-interaction-detector - time 1755168959646 name systemd(453161) kernel -> command: /usr/sbin/init gid: 0 uid: 0
Aug 14 10:55:59 node06 kernel: watchdog_release+0x0 [kernel]
                                                                           __fput+0x91 [kernel]
                                                                           task_work_run+0x59 [kernel]
                                                                           exit_to_user_mode_loop+0x122 [kernel]
                                                                           exit_to_user_mode_prepare+0xb6 [kernel]
                                                                           syscall_exit_to_user_mode+0x12 [kernel]
                                                                           do_syscall_64+0x69 [kernel]
                                                                           entry_SYSCALL_64_after_hwframe+0x77 [kernel]
Aug 14 10:55:59 node06 kernel: watchdog: watchdog0: watchdog did not stop!

That’s all we needed to be able to isolate and fix the problem!😎

Conclusions

I shared this brief debug story to inspire you to use system tracing. When you hit a wall with traditional methods like logging or breakpoints, try digging deeper. The right tool at the right time can save you hours of frustration and lead you to the root cause faster than you’d expect.

Interested in further reading? You might find this article useful!

These Solutions are Engineered by Humans

Did you find this article interesting? Does it match your skill set? Programming is at the heart of how we develop customized solutions. In fact, we’re currently hiring for roles just like this and others here at Würth Phoenix.

Author

Alessandro Taufer

Leave a Reply

Your email address will not be published. Required fields are marked *

Archive