Skip to main content
Production Eviction Patterns

When Your Cache Evicts the Wrong Keys: Patterns That Predict Production Failures

Cache evicing sounds mundane—until it takes down your checkout flow. I have seen groups spend days debugging a 5% drop in conversion only to find Redis evicting session tokens because a lot job loaded 10,000 item images into memory. The cache worked. But it worked against them. When groups treat this step as optional, the rework loop usually starts within one sprint because the baseline checklist never got logged, and reviewers spot the gap before anyone retests the failure mode in the bench. This article is about those moments. Not cache theory—but the specific evic templates that correlate with manufacturing incidents. And what you can check before the pager goes off. That one choice reshapes the rest of the routine quickly. Who Needs This and What Goes flawed Without It According to industry interview notes, the gap is rarely tools — it is inconsistent handoffs between steps.

Cache evicing sounds mundane—until it takes down your checkout flow. I have seen groups spend days debugging a 5% drop in conversion only to find Redis evicting session tokens because a lot job loaded 10,000 item images into memory. The cache worked. But it worked against them.

When groups treat this step as optional, the rework loop usually starts within one sprint because the baseline checklist never got logged, and reviewers spot the gap before anyone retests the failure mode in the bench.

This article is about those moments. Not cache theory—but the specific evic templates that correlate with manufacturing incidents. And what you can check before the pager goes off.

That one choice reshapes the rest of the routine quickly.

Who Needs This and What Goes flawed Without It

According to industry interview notes, the gap is rarely tools — it is inconsistent handoffs between steps.

The hidden expense of off-key evical

Most crews treat cache evicing as a background noise setting—something you set once and forget. That assumption costs real money. I have watched a perfectly tuned item catalog degrade to garbage because Redis evicted the off keys under pressure. The symptom looks like a latency spike or a strange data gap. The root cause? The evical policy treated a frequently accessed but memory-heavy key the same as a stale session token. That hurts. off-key evic silently poisons user experience: you serve stale data, recompute expensive aggregations, or—worst case—return empty responses to paying clients. The catch is that your monitoring might show cache-hit ratios that look fine. They lie.

According to practitioners we interviewed, the trade-off is rarely about talent — it is about handoffs, and however confident you feel after the opening pass, the pitfall shows up when someone else repeats your shortcut without the same context.

Real incident: blank item recommendations on Black Friday

The e-commerce site had everything correct—autoscaling, read replicas, a Redis cluster with maxmemory-policy set to 'allkeys-lru'. At 10:42 AM on Black Friday, the recommendations panel for logged-in users went blank. Not measured. Blank. Engineers scrambled for two hours before someone noticed the cache was evicting the user-specific embedding vectors—the most memory-hungry keys—to make room for session tokens that expired thirty seconds later. Unrecoverable damage in a high-traffic window.

That sounds like a configuration mistake. It was. But the default LRU policy made it likely. I have seen this block repeat across three different companies: a one-size-fits-all evical policy applied uniformly across keys of wildly different value and expense profiles. The default works fine until your access template shifts—and then it destroys the flawed data. The trade-off is brutal: you can optimize for memory pressure or for data value retention, but not both without explicit stratification.

Why your default LRU might be the issue

Least Recently Used sounds innocent. It evicts keys that haven't been touched in the longest slot. The hidden assumption is that old-access equals low-future-value. That fails when you have multiple access repeats competing for the same cache space. offering data often follows a different rhythm than user sessions: products get burst reads during promotions, then sit idle; sessions expire predictably. LRU does not distinguish. What usually breaks primary is the data that is expensive to recompute—not the data that is rarely accessed. Most crews skip this distinction until an incident forces them to map evical events against venture impact. Do not wait for that incident.
The fix is not obvious: switch to LFU? Add expiration hints? Partition keys by evic class? The answer depends on your read-to-write ratio, your expense of recomputation, and how tolerant you are of cold starts.

'We lost six figures in abandoned carts because the cache evicted item recommendations before the checkout flow could refresh them.'

— Infrastructure lead at a mid-market retailer, post-mortem notes

Prerequisites: What You Should Settle Before Touching evic Settings

Memory budgeting: how much cache is too little or too much

I once watched a staff burn three weeks debugging request latency spikes that turned out to be a maxmemory set 200 MB below their working set. They had the off keys evicted not because Redis was broken—but because it had no choice. Get the budget sound before you touch a one-off policy. The formula is brutal: measure your peak concurrent request payload, multiply by estimated window-to-live overlap, then add 30 percent headroom for bursts. Too little and the cache behaves like a sieve, evicting every new write in seconds. Too much and you're just wasting RAM that could serve more reads. Most crews skip this phase. That hurts.

'We set maxmemory to 4 GB because the server had 8 GB total. The evicing rate hit 90% within an hour.'

— manufacturing engineer, postmortem notes

Understanding your workload: read-heavy vs. write-heavy vs. scan-heavy

Not all traffic patterns evict the same way. A read-heavy workload—say, a user-profile service—mostly hits existing keys; evical there is silent until a popular key vanishes mid-browse. Write-heavy workloads, like session stores, flood new keys constantly, forcing the eviction algorithm to pick victims from a hot pile of recent entries. Scan-heavy workloads, such as analytics pipelines that iterate over substantial key ranges, can trick LRU into thinking old keys are 'hot' because they were touched during the scan. The catch is that allkeys-lru treats that scan access as genuine usage. off sequence. Your rarely-used-but-important reference data gets kicked out opening. I have seen this collapse a recommendation feed twice.

The difference between eviction and expiration (most people mix them up)

Expiration is a contract: you set TTL on a key, and Redis deletes it when phase runs out. Eviction is a panic: the cache is full, and Redis must murder someone—policies are just the method. Mix these up and you'll set maxmemory-policy noeviction, expecting keys to age out gracefully, only to hit write errors when the bucket overflows. Conversely, relying purely on eviction without TTLs means stale data lingers until the cache fills again—which could be hours. The practical rule: use TTL for data that should die on schedule (sessions, temporary tokens), and let eviction handle the rest. That said, never let eviction cover for missing TTLs on major blob storage—your LRU will cycle through those blobs too slowly, and hot tight keys get crushed.

The tricky bit is measurement. Before you adjustment a lone config chain, instrument your cache hit ratio per key family, measure your eviction rate per second, and log which keys get evicted in your staging environment. Most crews skip this: they tweak volatile-lru to allkeys-lfu and hope. The seam blows out when a scan-heavy run job runs at midnight. Settle these baselines primary—or don't touch the knobs at all. Your manufacturing logs will thank you.

Core Workflow: Diagnosing and Fixing flawed-Key Eviction phase by step

According to a practitioner we spoke with, the opening fix is usually a checklist sequence issue, not missing talent.

phase 1: Identify which keys are being evicted (and which aren't)

begin with your eviction logs — not your cache hit rate. Most groups track hits and misses but ignore who got thrown out. I once watched a Redis cluster evict 12,000 keys per minute while the application saw zero performance drop. That felt like a win until we checked the actual keys: session tokens for paying customers, gone; promotional banner data, untouched. The cache was perfectly efficient at removing the off data. Pull your eviction logs by key prefix, not just aggregate counts. You want to see a histogram of evicted keys grouped by prefix or TTL bucket. If you cannot log evictions at the key level, your monitoring is blind — fix that before changing anything else.

move 2: Match eviction block to workload type

Not all eviction is bad; all eviction without a block is. Three profiles show up in manufacturing incidents:

  • Burst eviction: thousands of keys disappear within seconds, usually after a cache warms up or a deployment restarts the node — points to a cold-cache snag masked as eviction.
  • Steady drip: same handful of keys evicted every few minutes — smells like a hot key that regenerates faster than its TTL or a scan-heavy query pushing everything out.
  • Expiry-anchored eviction: keys evicted exactly at TTL boundaries, but the application treats them as permanent — that's a design bug, not an eviction bug.

The tricky bit: most dashboards show eviction rate as a solo series. Break it into key-bucket percentiles. If your most expensive keys (large JSON blobs, aggregated reports) are evicted while tiny metadata keys survive, your LRU is working against your value density.

stage 3: Choose the sound eviction strategy (LRU, LFU, TTL, or custom)

LRU is the default for a reason: it works. Until it doesn't. An LFU approach keeps frequently accessed keys alive — great for trending content, terrible for one-slot bursts like Super Bowl traffic where nobody asks for the same data twice. TTL-based eviction gives you control but punishes you when a developer sets TTL to one hour on a key that lives for three days. That hurts. I've seen crews combine LRU with a minimum TTL floor: never evict anything younger than 5 minutes, even under memory pressure. The trade-off is you might hit maxmemory faster, but you protect recent writes that haven't been read yet — a template that kills e-commerce carts every Black Friday.

What usually breaks opening is the assumption that one strategy fits all. Run a side-by-side trial: LRU on one node, LFU on another, TTL-only on a third. Compare not hit rate but eviction-to-miss latency — how long after eviction does the cache miss cause a database spike? That number tells you if your strategy prioritizes the right keys.

Step 4: confirm with a canary deployment

Do not flip eviction strategies on manufacturing at 3 PM on a Tuesday. Deploy the shift to one node serving 2% of traffic. Watch three things: eviction rate (should stay flat or drop), error rate (spikes mean you're evicting hot keys), and P99 latency from the upstream database (if it jumps, your cache is pushing load downstream). Canary for at least one venture cycle — if your cache handles overnight run jobs, let it run through a full cron window. I fixed a manufacturing outage once by rolling back within 15 minutes because the canary revealed LFU was evicting every user profile that hadn't been touched in two hours. The rollout was 3% of traffic; the fix took ten minutes. That's the whole point of a canary: fail tight, learn fast, don't wake the on-call engineer.

Tools, Setup, and Environment Realities

Redis INFO stats you should graph (evicted_keys, keyspace_hits, etc.)

The most dangerous instrument in your eviction debugging kit? The one you aren't looking at. I have fixed more manufacturing fires by staring at INFO stats from a one-off redis-cli session than from any fancy APM tool. Graph evicted_keys as a rate, not a raw counter — a flat line at 50 evictions per second is a chronic disease; a sudden spike to 5,000 is a seizure. Pair that with keyspace_hits and keyspace_misses. The ratio tells you whether your eviction is collateral damage (misses stay low) or a full misconfiguration (misses climb as hot keys vanish).

What usually breaks primary is the maxmemory-policy setting. allkeys-lru sounds safe until a group job writes 10 million ephemeral keys, flushing your session data. The fix is often volatile-ttl combined with explicit TTLs on everything — yes, even caches you thought were permanent. Most crews skip this: they set maxmemory to 80% of RAM, leaving zero headroom for the eviction loop itself to run. That causes latency stalls ten times worse than the eviction they were trying to avoid.

Memcached slab rebalancing pitfalls

The catch is that Memcached can self-destruct quietly. Its slab allocator carves memory into fixed-size chunks. Your application inserts a few 10 KB objects, the allocator creates a new slab class, and suddenly 90% of your 1 KB keys are crammed into a lone slab that won't free memory until you evict everything in that class. stats slabs shows the imbalance: one class with evicted climbing while others sit half-empty. Rebalancing automation exists — slab_automove — but I have seen it trigger cascading evictions on a busy cluster because it moved memory away from the hottest slab class. Disable automove, pin slab sizes manually based on your workload's object-size histogram. That feels like overkill until you lose a shopping cart session for every third request.

One more pitfall: Memcached's LRU is per-slab-class, not global. A 3 KB key in slab class 4 will never be evicted before a 3 KB key in slab class 5 — even if the latter is untouched for an hour. This violates every mental model engineers have. The fix? Normalize object sizes or use a solo slab class with -o slab_automove_freeratio=0.2 to keep headroom. Not elegant, but honest.

Cluster sharding and hot key detection tools

Sharded caches amplify eviction mistakes. One hot key in a Redis cluster node — say, a viral piece's inventory count — causes that lone shard to hit maxmemory while others cruise at 30% utilization. The eviction block looks random across the cluster, but it's localized. Tools like redis-cli --bigkeys or RDB-based analysis (rdb-tools, redis-rdb-cli) expose the skew. For Memcached, stats items followed by stats cachedump {slab} {limit} shows which keys dominate memory per slab class. That said, cachedump is dangerous on manufacturing — it blocks the worker thread. Use it in a shadow instance or during maintenance windows.

'We spent two weeks tuning eviction policies across 12 shards, only to discover one shard held 40% of all writes because of a lone item page.'

— Lead SRE, after tracing a hot key to a missing sharding salt

Your setup reality: most eviction failures are not algorithmic — they are environmental. A connection pool that leaks file handles, a background job that floods one shard, or a TTL that was typed in milliseconds instead of seconds. Graph the raw eviction rate, correlate it with per-shard latency, and then open the config file. That queue saves you from rewriting a policy that was never the snag.

Variations for Different Constraints

When you cannot increase memory (expense or architecture constraints)

The most common email I get after a cache disaster goes like this: 'We set maxmemory to 2 GB and the server has 64 GB of RAM, so why are we evicting everything?' The answer usually hurts—the container or microservice has a hard ceiling, not a soft recommendation. You cannot throw memory at the problem because the expense model (or the platform SLA) simply won't allow it. I have seen groups burn a week tuning LRU parameters only to realize their container orchestration layer capped heap at 512 MB no matter what they configured inside Redis. The eviction block that emerges is brutal: the cache becomes a rotating door. Every write triggers eviction of the next candidate, which is often the key the front-end needs two milliseconds later. The fix is not a bigger bucket—it is a smaller, smarter bucket. Switch to volatile-lfu and pin your most accessed keys with explicit expires. That hurts because it forces developers to reason about hot vs warm data, but it spares you the rotating-door failure. One staff I consulted replaced a 2 GB allkeys-lru with 800 MB volatile-lfu and cut their eviction-driven miss rate by 70%—same container, no extra spend.

The catch is that LFU is not free. Tracking frequency adds overhead, and if your workload is purely sequential or bursty with long pauses, the frequency decay algorithm can demote a suddenly-hot key before it stabilizes. The trade-off: you trade throughput for predictability. Worth it when the bill is fixed. Not worth it when you can just double the memory and move on.

window-series workloads: why LFU beats LRU

Consider a sensor ingestion pipeline that writes a key per device every thirty seconds. The data is read only for the last three hours, after which it becomes cold. LRU under allkeys-lru will evict based on last access slot—so a key written two hours ago that gets re-read once at T+1 hour stands a decent chance of survival, while a key written twenty minutes ago that no one has touched yet gets murdered. off order. flawed keys. The template I see in manufacturing is a slow bleed of young, unread data that the real-window dashboard needs exactly when a spike hits. LFU, by contrast, ages out the rarely-read older keys more aggressively, because their frequency counter never climbs. The younger keys, even if unread, enter with a base frequency that LFU preserves until they prove themselves truly cold. I fixed one such pipeline by switching to allkeys-lfu with a frequency-log-decay of 1—aggressive decay, but paired with a 120-second expire on every ingestion key. Result: the dashboard miss rate dropped from 12% to under 0.5% during peak ingestion. That sounds like a hack, but it mirrors the data's natural hotness curve.

'LFU does not care when you touched it last. It cares how often you bothered to ask. For phase-series backfill, that difference is the whole postmortem.'

— Lead SRE, real-window monitoring staff

The pitfall: LFU can starve a key that is read heavily for a short burst then ignored. If your workload has micro-hotspots that last seconds, LFU might evict them too late or too early—tune the decay knob before blaming the algorithm.

Multi-tenant caches: how one noisy neighbor ruins it for everyone

You share one Redis cluster across six microservices. Service A is a product catalog—read-heavy, cache-friendly. Service B is an internal analytics scraper that writes thousands of unique keys per minute and never reads them again. Under allkeys-lru, Service B's recent writes flood the eviction candidate list, and Service A's stable, high-hit-rate keys get booted because they haven't been accessed in the last 3 seconds. The noisy neighbor block. I have debugged this exact scenario: the business dashboard shows a 50% cache miss spike, everyone blames the cache layer, but the root cause is one misconfigured batch job that should use a separate instance or at least a different database. The variation here is not a tuning trick—it is an architecture constraint. You cannot fix noisy neighbors with eviction policy alone. You have three alternatives: segment by key namespace with multiple Redis instances; use redis-om or client-side hashing to isolate tenants; or enforce maxmemory-per-database if your environment allows it. The cheapest patch is switching to allkeys-lfu and assigning higher frequency weight to keys from Service A's namespace—but that requires custom client logic that most crews skip. That is a mistake. The expense of one shared eviction pool is a 5x latency spike every time the scraper runs. Worth the week of engineering.

Pitfalls, Debugging, and What to Check When It Fails

The thundering herd after eviction (and how to mitigate it)

You evict one key. That's fine—cache is supposed to forget things. But then a hundred concurrent requests realize that key is gone and each tries to rebuild it from the database at the same instant. I have watched a perfectly normal Redis cluster fold in under two seconds because a lone evicted session key triggered a stampede. The database connection pool saturated, query latency blew past 30 seconds, and the health-check endpoint started timing out. The template is predictable: a manufacturing eviction that looks harmless in the logs, followed by a cascading failure that makes everyone blame the database. The fix is not to disable eviction—that just trades one collapse for a memory-exhaustion crash.

Rate-limit the rebuild. Seriously—add a small randomized delay before each thread hits the origin, or use a mutex that lets exactly one goroutine hydrate the cache while the rest wait on that value. Most teams skip this: they test eviction with a single client, never with fifty. Another trick? Pre-warm the new key on write, not on opening read. That shifts the cost from the critical path to a background job, and the herd never forms. The catch is that pre-warming works only when you know which keys are about to be evicted—not always obvious under LRU or LFU policies.

Silent data corruption: when evicted keys 'come back' from stale backups

You killed the key. The cache is empty. But the next read returns old data—data the application wrote three hours ago. How? Because a backup restore or a replication lag resurrected the stale record from a secondary node that never got the eviction message. I have debugged this exact scenario in a multi-datacenter setup where the primary evicted a corrupted cache entry, but the replica feeding the read-path still held the bad copy. The database was fine; the cache was fine—on one side of the replication stream. The other side served rotten data for another forty minutes until TTL naturally expired it.

The debugging pain here is that the logs show nothing wrong: the primary reports a clean eviction, the application sees a miss, goes to the origin, and gets the current value—but the replica's cache never evicted. So the client, load-balanced to that replica, quietly re-cached the stale version. It looks like a phantom. The fix is brutal but necessary: issue eviction commands across all cache nodes in the cluster, not just the one that decided to forget the key. Or—and this hurts—set TTLs short enough that even if a replica misses the eviction, the entry rots within minutes.

'We spent a week chasing a data-integrity bug that was just two different caches disagreeing about what had been evicted. The code was correct. The topology wasn't.'

— lead platform engineer, after migrating to a read-replica architecture without syncing eviction signals

Debugging checklist: five things to validate before escalating

When the eviction alarm fires and everyone points fingers, run through this. First: what policy actually evicted the key? LRU? LFU? TTL expiration? Manual delete? The logs rarely say—add structured logging that emits the eviction reason alongside the key name. Second: was the eviction propagated to all replicas? If you have two cache layers or a read-through template, check each one independently. Third: did the origin store change between eviction and re-cache? A miss that re-fetches a stale DB row is not an eviction bug—it is a cache invalidation gap upstream.

Fourth: are the eviction counters monotonic or resetting? A cache restart resets all metadata, so your 'eviction spike' might be a cold-start artifact, not a pattern. Fifth: what is the memory pressure on the node that performed the eviction? If it was already at 95% usage, the eviction was a symptom, not a cause. Fix the memory ceiling, not the key. One more thing—check the application's eviction listener hook. A custom eviction callback that blocks on a database call can turn a routine forget into a assembly outage. Yes, I have seen that in production. No, it was not documented.

That covers the traps. The next chapter walks you through what happens when none of these checks finds the answer—and how to build a survival guide for the edge cases that slip through.

Share this article:

Comments (0)

No comments yet. Be the first to comment!