One of your servers is reporting moderate CPU usage, enough available memory, and storage that’s busy but not saturated. The dashboard looks healthy, the application does not.
Requests to it are slowing down, background jobs are taking longer, and latency is rising even though no resource appears fully exhausted.
The problem here is that most infrastructure metrics describe resource consumption. They tell us how much CPU is used, how much memory is allocated and how much I/O is being performed. They don’t directly tell us whether workloads are being forced to wait.
Pressure Stall Information, or PSI, measures that waiting.
PSI is a Linux kernel mechanism that records how much time tasks lose because CPU, memory or I/O resources are unavailable. Instead of asking how busy a resource is, PSI asks whether useful work can continue.
That distinction matters because utilization and contention are not the same thing.
A CPU can be busy without causing a problem. A batch workload running near 100% utilization may still deliver exactly the expected throughput. At the same time, a system with much lower aggregate utilization can suffer latency because a container is constrained, runnable tasks are queued, memory reclaim is active or synchronous I/O is blocking progress.
Traditional metrics describe the state of a resource, PSI describes the effect of resource scarcity on the workload.
Capacity, utilization and pressure describe three different aspects of a system:
Consider CPU utilization. A high value tells us that the processors are busy, but it doesn’t tell us whether runnable tasks are waiting to be scheduled. High utilization may be completely healthy when the system is processing work efficiently.
Conversely, moderate aggregate CPU utilization can hide contention inside a container with a restrictive CPU limit. The host still has capacity, but the processes inside that container cannot use it.
Memory metrics have a similar problem. Linux deliberately uses available memory for caching, so high memory usage is not automatically dangerous. On the other hand, a system can report some available memory while applications are already being delayed by page reclaim and cache churn.
Storage throughput also describes completed activity rather than its effect on applications. A disk may appear to operate below its maximum throughput even while latency-sensitive tasks spend substantial time blocked on synchronous I/O.
PSI does not replace these measurements: It provides the missing information between resource activity and application impact.
On a Linux system with PSI enabled, pressure information is exposed through three files:
/proc/pressure/cpu
/proc/pressure/memory
/proc/pressure/io
Reading one of these files produces output similar to the following:
some avg10=2.18 avg60=0.92 avg300=0.31 total=8473291
full avg10=0.47 avg60=0.11 avg300=0.04 total=913442
There are two important pressure categories: some and full.
some pressuresome means that at least one task was stalled while waiting for the resource.
Other tasks may still be running and performing useful work. A non-zero some value therefore does not necessarily indicate a serious incident. It means that contention exists and that part of the workload is being delayed.
For CPU pressure, this usually means that one or more runnable tasks are waiting to be scheduled.
For memory, it means that tasks are being delayed by memory-related work such as reclaim.
For I/O, it means that tasks are waiting for I/O operations to complete.
full pressurefull means that all non-idle tasks in the measured scope were stalled simultaneously.
This represents a more severe condition because the workload as a whole was unable to make progress during that time. Sustained memory or I/O full pressure can indicate thrashing or a serious resource bottleneck.
At the whole-system level, CPU full pressure is a special case. The Linux kernel considers it undefined and reports it as zero for compatibility. For system-wide CPU pressure, some is the relevant signal.
PSI reports averages over three windows:
avg10: approximately the last 10 secondsavg60: approximately the last 60 secondsavg300: approximately the last five minutesThese values represent the percentage of wall-clock time during which tasks were stalled.
For example:
some avg10=4.00
means that during the recent 10-second window, at least one task was stalled on that resource for approximately 4% of the elapsed time.
It does not mean that the resource was 4% utilized or that 4% of the processes were affected.
Comparing the time windows helps distinguish short spikes from persistent problems. A high avg10 combined with low avg60 and avg300 usually points to a recent burst. If all three values remain elevated, the pressure is more likely to be sustained.
The total field is a cumulative stall-time counter expressed in microseconds. Monitoring rules can calculate its rate over custom intervals, making it useful for detecting events that may be hidden by the predefined averages.
Although PSI uses the same general format for all three resources, each pressure type has a different meaning.
CPU pressure appears when tasks are ready to run, but cannot obtain processor time.
Possible causes include:
CPU pressure should be considered together with CPU utilization, throttling and application latency.
High utilization with low pressure usually means the CPUs are busy but still handling the workload effectively. High pressure indicates that tasks are spending meaningful time waiting to execute.
Memory pressure appears when memory scarcity causes tasks to stall.
This may happen during page reclaim, cache churn or other memory-management activity. Common causes include undersized memory limits, excessive working sets, memory leaks or too many workloads competing for the same capacity.
Memory PSI is particularly useful because memory utilization is often ambiguous. High memory usage with almost no pressure may simply mean that Linux is using memory efficiently for caches. Lower apparent usage with sustained pressure may indicate that reclaim is already affecting the application.
Instead of answering the question “How much memory is occupied?”, PSI metrics answer “Is the lack of readily available memory preventing the workload from making progress?”
I/O pressure indicates that tasks are stalled waiting for I/O operations.
Possible causes include storage latency, competing workloads, filesystem writeback, slow network storage or swapping.
I/O PSI should be correlated with device latency, throughput, queue depth and application response time. It tells us that I/O is preventing progress, but it doesn’t identify the exact disk, file or operation responsible.
Host-level utilization can hide problems occurring within individual resource-control boundaries.
A Kubernetes node may have available CPU while one of its containers experiences heavy CPU pressure because its cgroup is limited to a fraction of a core. Similarly, one pod may experience memory pressure even though the node as a whole still appears healthy.
Linux can track PSI for individual cgroups, which makes it possible to compare workload pressure with host pressure.
This comparison provides useful diagnostic patterns:
This is where PSI becomes especially useful in containerized environments. It allows operators to move beyond the health of the machine and examine whether a particular workload is actually able to progress. And now it’s even simpler: Starting from Kubernetes 1.36, PSI metrics became generally available and are now enabled and exported by default.
PSI should be displayed alongside existing infrastructure and application metrics, not on a separate dashboard without context.
For CPU investigations, correlate:
For memory, correlate:
For I/O, correlate:
There is no universal PSI threshold that defines an unhealthy system.
A latency-sensitive API, a database and a batch-processing worker may tolerate very different amounts of pressure. Short spikes may also be harmless, while lower but sustained pressure can cause a significant service-level impact.
Good alerts should therefore prefer sustained pressure over isolated samples and, where possible, combine pressure with an application symptom.
For example, CPU pressure together with rising request latency is far more actionable than CPU pressure alone.
PSI detects contention, but it’s not a root-cause detector.
High CPU pressure doesn’t explain whether the problem is inefficient code, excessive concurrency, an undersized CPU limit or poor workload placement, while high memory pressure doesn’t distinguish a memory leak from an overly small allocation or cache churn, and high I/O pressure doesn’t identify a slow device, query or file operation.
Additional telemetry is still required:
PSI narrows the investigation by showing which resource is preventing progress, and where the impact is occurring.
Infrastructure monitoring has traditionally focused on consumption, where we measure how much CPU is used, how much memory is occupied and how much data storage devices transfer.
These measurements remain essential, but they are incomplete.
Applications do not experience utilization percentages. They experience waiting. They wait to be scheduled on a CPU, they wait while the kernel reclaims memory, and they wait for I/O operations to complete. Those delays are where resource contention becomes latency and lost throughput.
Pressure Stall Information makes that lost time measurable.
Its main contribution is not simply another dashboard panel. It introduces a different way to think about resource problems: a resource shortage is not defined only by how much capacity has been consumed. It is defined by whether useful work can continue.
Utilization tells us that a resource is busy, PSI tells us when workloads are paying the price.
Did you find this article interesting? Are you an “under the hood” kind of person? We’re really big on automation and we’re always looking for people in a similar vein to fill roles like this one as well as other roles here at Würth IT Italy.