29. 06. 2026 Luigi Miazzo APM, Kubernetes

Beyond Utilization: Understanding Pressure Stall Information

One of your servers is reporting moderate CPU usage, enough available memory, and storage that’s busy but not saturated. The dashboard looks healthy, the application does not.

Requests to it are slowing down, background jobs are taking longer, and latency is rising even though no resource appears fully exhausted.

The problem here is that most infrastructure metrics describe resource consumption. They tell us how much CPU is used, how much memory is allocated and how much I/O is being performed. They don’t directly tell us whether workloads are being forced to wait.

Pressure Stall Information, or PSI, measures that waiting.

PSI is a Linux kernel mechanism that records how much time tasks lose because CPU, memory or I/O resources are unavailable. Instead of asking how busy a resource is, PSI asks whether useful work can continue.

That distinction matters because utilization and contention are not the same thing.

A CPU can be busy without causing a problem. A batch workload running near 100% utilization may still deliver exactly the expected throughput. At the same time, a system with much lower aggregate utilization can suffer latency because a container is constrained, runnable tasks are queued, memory reclaim is active or synchronous I/O is blocking progress.

Traditional metrics describe the state of a resource, PSI describes the effect of resource scarcity on the workload.

Utilization Is Not Pressure

Capacity, utilization and pressure describe three different aspects of a system:

  • Capacity tells us how much of a resource exists: the number of CPU cores, the amount of memory or the capabilities of a storage device
  • Utilization tells us how much of that capacity is currently being consumed
  • Pressure tells us whether tasks are losing time because they cannot access the resource they need

Consider CPU utilization. A high value tells us that the processors are busy, but it doesn’t tell us whether runnable tasks are waiting to be scheduled. High utilization may be completely healthy when the system is processing work efficiently.

Conversely, moderate aggregate CPU utilization can hide contention inside a container with a restrictive CPU limit. The host still has capacity, but the processes inside that container cannot use it.

Memory metrics have a similar problem. Linux deliberately uses available memory for caching, so high memory usage is not automatically dangerous. On the other hand, a system can report some available memory while applications are already being delayed by page reclaim and cache churn.

Storage throughput also describes completed activity rather than its effect on applications. A disk may appear to operate below its maximum throughput even while latency-sensitive tasks spend substantial time blocked on synchronous I/O.

PSI does not replace these measurements: It provides the missing information between resource activity and application impact.

What PSI Measures

On a Linux system with PSI enabled, pressure information is exposed through three files:

/proc/pressure/cpu
/proc/pressure/memory
/proc/pressure/io

Reading one of these files produces output similar to the following:

some avg10=2.18 avg60=0.92 avg300=0.31 total=8473291
full avg10=0.47 avg60=0.11 avg300=0.04 total=913442

There are two important pressure categories: some and full.

some pressure

some means that at least one task was stalled while waiting for the resource.

Other tasks may still be running and performing useful work. A non-zero some value therefore does not necessarily indicate a serious incident. It means that contention exists and that part of the workload is being delayed.

For CPU pressure, this usually means that one or more runnable tasks are waiting to be scheduled.

For memory, it means that tasks are being delayed by memory-related work such as reclaim.

For I/O, it means that tasks are waiting for I/O operations to complete.

full pressure

full means that all non-idle tasks in the measured scope were stalled simultaneously.

This represents a more severe condition because the workload as a whole was unable to make progress during that time. Sustained memory or I/O full pressure can indicate thrashing or a serious resource bottleneck.

At the whole-system level, CPU full pressure is a special case. The Linux kernel considers it undefined and reports it as zero for compatibility. For system-wide CPU pressure, some is the relevant signal.

Understanding the Numbers

PSI reports averages over three windows:

  • avg10: approximately the last 10 seconds
  • avg60: approximately the last 60 seconds
  • avg300: approximately the last five minutes

These values represent the percentage of wall-clock time during which tasks were stalled.

For example:

some avg10=4.00

means that during the recent 10-second window, at least one task was stalled on that resource for approximately 4% of the elapsed time.

It does not mean that the resource was 4% utilized or that 4% of the processes were affected.

Comparing the time windows helps distinguish short spikes from persistent problems. A high avg10 combined with low avg60 and avg300 usually points to a recent burst. If all three values remain elevated, the pressure is more likely to be sustained.

The total field is a cumulative stall-time counter expressed in microseconds. Monitoring rules can calculate its rate over custom intervals, making it useful for detecting events that may be hidden by the predefined averages.

CPU, Memory and I/O Pressure

Although PSI uses the same general format for all three resources, each pressure type has a different meaning.

CPU pressure

CPU pressure appears when tasks are ready to run, but cannot obtain processor time.

Possible causes include:

  • Too many runnable tasks for the available CPU capacity
  • Contention between workloads on the same host
  • Restrictive container CPU limits
  • Workload placement on an overloaded node
  • Sudden concurrency or traffic increases

CPU pressure should be considered together with CPU utilization, throttling and application latency.

High utilization with low pressure usually means the CPUs are busy but still handling the workload effectively. High pressure indicates that tasks are spending meaningful time waiting to execute.

Memory pressure

Memory pressure appears when memory scarcity causes tasks to stall.

This may happen during page reclaim, cache churn or other memory-management activity. Common causes include undersized memory limits, excessive working sets, memory leaks or too many workloads competing for the same capacity.

Memory PSI is particularly useful because memory utilization is often ambiguous. High memory usage with almost no pressure may simply mean that Linux is using memory efficiently for caches. Lower apparent usage with sustained pressure may indicate that reclaim is already affecting the application.

Instead of answering the question “How much memory is occupied?”, PSI metrics answer “Is the lack of readily available memory preventing the workload from making progress?”

I/O pressure

I/O pressure indicates that tasks are stalled waiting for I/O operations.

Possible causes include storage latency, competing workloads, filesystem writeback, slow network storage or swapping.

I/O PSI should be correlated with device latency, throughput, queue depth and application response time. It tells us that I/O is preventing progress, but it doesn’t identify the exact disk, file or operation responsible.

Why PSI Matters for Containers

Host-level utilization can hide problems occurring within individual resource-control boundaries.

A Kubernetes node may have available CPU while one of its containers experiences heavy CPU pressure because its cgroup is limited to a fraction of a core. Similarly, one pod may experience memory pressure even though the node as a whole still appears healthy.

Linux can track PSI for individual cgroups, which makes it possible to compare workload pressure with host pressure.

This comparison provides useful diagnostic patterns:

  • High container pressure and low node pressure: The workload may be constrained by its own limit or configuration.
  • High pressure in many containers and at node level: The node may be overloaded or overcommitted.
  • High utilization and low pressure: The resource may be busy but productive.
  • Low aggregate utilization and high workload pressure: Host-level averages may be hiding a local bottleneck.

This is where PSI becomes especially useful in containerized environments. It allows operators to move beyond the health of the machine and examine whether a particular workload is actually able to progress. And now it’s even simpler: Starting from Kubernetes 1.36, PSI metrics became generally available and are now enabled and exported by default.

Monitoring and Alerting

PSI should be displayed alongside existing infrastructure and application metrics, not on a separate dashboard without context.

For CPU investigations, correlate:

  • CPU utilization
  • CPU pressure
  • CPU throttling
  • Application latency and throughput

For memory, correlate:

  • Memory usage and working set
  • Memory pressure
  • Page faults or reclaim metrics
  • OOM kills and restarts
  • Application latency

For I/O, correlate:

  • I/O pressure
  • Device latency
  • Throughput and queue depth
  • Application latency

There is no universal PSI threshold that defines an unhealthy system.

A latency-sensitive API, a database and a batch-processing worker may tolerate very different amounts of pressure. Short spikes may also be harmless, while lower but sustained pressure can cause a significant service-level impact.

Good alerts should therefore prefer sustained pressure over isolated samples and, where possible, combine pressure with an application symptom.

For example, CPU pressure together with rising request latency is far more actionable than CPU pressure alone.

What PSI Cannot Tell You

PSI detects contention, but it’s not a root-cause detector.

High CPU pressure doesn’t explain whether the problem is inefficient code, excessive concurrency, an undersized CPU limit or poor workload placement, while high memory pressure doesn’t distinguish a memory leak from an overly small allocation or cache churn, and high I/O pressure doesn’t identify a slow device, query or file operation.

Additional telemetry is still required:

  • Application metrics reveal user-visible effects
  • Distributed traces locate slow requests and dependencies
  • Profiling identifies expensive code
  • cgroup and scheduler metrics explain resource constraints
  • Storage metrics reveal device-level behavior
  • Logs and events provide operational context

PSI narrows the investigation by showing which resource is preventing progress, and where the impact is occurring.

From Resource Monitoring to Progress Monitoring

Infrastructure monitoring has traditionally focused on consumption, where we measure how much CPU is used, how much memory is occupied and how much data storage devices transfer.

These measurements remain essential, but they are incomplete.

Applications do not experience utilization percentages. They experience waiting. They wait to be scheduled on a CPU, they wait while the kernel reclaims memory, and they wait for I/O operations to complete. Those delays are where resource contention becomes latency and lost throughput.

Pressure Stall Information makes that lost time measurable.

Its main contribution is not simply another dashboard panel. It introduces a different way to think about resource problems: a resource shortage is not defined only by how much capacity has been consumed. It is defined by whether useful work can continue.

Utilization tells us that a resource is busy, PSI tells us when workloads are paying the price.

These Solutions are Engineered by Humans

Did you find this article interesting? Are you an “under the hood” kind of person? We’re really big on automation and we’re always looking for people in a similar vein to fill roles like this one as well as other roles here at Würth IT Italy.

Luigi Miazzo

Luigi Miazzo

Software Developer - IT System & Service Management Solutions at Würth IT Italy

Author

Luigi Miazzo

Software Developer - IT System & Service Management Solutions at Würth IT Italy

Leave a Reply

Your email address will not be published. Required fields are marked *

Archive