22:18 – The lights in the SOC are dimmed to “barely legal for human vision.” My third coffee of the night is mostly sugar syrup at this point. The floor hums at 47.3 dB—someone measured it last month during an argument about whether the white noise was “soothing” or “psychologically damaging.” We still haven’t settled it.
The first alert of the graveyard shift arrives like clockwork:
P1 – Multiple nodes reporting thermal shutdown – Zone 6 Cold Aisle 14–18
I grab the thermal camera (affectionately nicknamed “Predator Vision”) and head out. The data hall feels colder than usual, which is never a good sign. Cold aisles should be cold. When they feel extra cold, something is usually very wrong upstream.
23:04 – I find it almost immediately. One of the CRAH (Computer Room Air Handler) units on the roof has decided tonight is the night it stops spinning. The giant belt that drives the blower fan has snapped clean in half and is now wrapped around the pulley like modern art gone wrong.
There’s already a thin sheet of frost forming on the supply grille.
I radio the facilities lead:
“Unit 6-3 is toast. Belt disintegrated. We’re pulling warm air from the hot aisle now.”
His reply comes through static: “I’m 40 minutes out. Can you baby it?”
I look at the dead fan the size of a small car. “Baby it with what? CPR?”
23:37 – Improvisation hour. I find an ancient roll of high-strength Kevlar rigging tape (the kind they use to strap down shipping containers) in the back of a maintenance cart labeled “DO NOT TOUCH – PROPERTY OF 2017.” Close enough.
I spend twenty sweaty minutes jury-rigging the two belt halves together in the world’s most cursed timing-belt transplant. It’s ugly. It’s temporary. It’s probably violating seventeen safety codes.
I hit the start button.
The motor groans, the Frankenstein belt twitches… and then it catches. The fan slowly wakes up like a bear coming out of hibernation. Air starts moving again. The supply temp begins dropping from 28.1 °C back toward the promised land of 19–21 °C.
I text the on-call Slack channel one word:
“Survived”
Three laughing emojis and a raccoon sticker appear instantly. Kevin’s legacy lives on.
01:12 AM – False sense of security destroyed.
New alert:
CRITICAL – All access switches in Pod C rebooting in cascade – suspected bad code push
I sprint (well, fast-walk aggressively) to the network closet. Sure enough, every leaf switch in the pod is cycling through POST like they’re auditioning for a reboot montage. The upstream spines are starting to black-hole traffic.
I yank the console cable into my laptop and pray the bastard who pushed the code left commit messages.
They did not.
The last message in the changelog simply reads:
“should be fine now :)”
I hate that smiley face more than I’ve ever hated anything.
01:44 AM – I roll back to the previous image while simultaneously fielding three panicked DMs from the network team who are all remote and useless right now. The switches come back online in waves. BGP sessions re-establish. The customer dashboards stop looking like heart monitors flatlining.
Someone in Slack posts a gif of a dumpster fire being extinguished by another, smaller dumpster fire. Poetic.
02:55 AM – Quiet returns. Too quiet.
I’m doing my third perimeter walk when I hear it: a very distinct drip… drip… drip…
I follow the sound to the far corner of Zone 8. There, in the ceiling tile grid directly above a row of shiny new NVMe arrays, is a growing dark spot. Water. Clear, cold, expensive-to-mitigate water.
I climb the ladder, push the tile aside, and shine my flashlight up into the abyss.
A sprinkler head has developed a tiny weep. Not enough to trigger a full discharge (thank every deity in the NOC pantheon), but enough to slowly turn our multi-million-dollar storage shelf into a very wet terrarium.
I grab a five-gallon bucket from the spill kit, balance it on top of a spare rack post, and position it under the drip like the world’s saddest chandelier.
Then I open what is now the most important ticket of the night:
P1 – Active roof leak over production storage. Mitigation in place: one (1) yellow bucket. Escalating to facilities + plumbing emergency contractor.
I add a photo of the bucket catching water like it’s collecting tips.
03:40 AM – The leak has slowed to a nervous tic instead of a steady drip. The bucket is maybe one-quarter full. I’ve nicknamed it “Lake Michigan Jr.”
04:22 AM – Facilities finally arrives with tarps, plastic sheeting, and the look of someone who has already accepted that sleep is no longer part of tonight’s storyline.
We seal the offending sprinkler head with plumber’s tape, plastic, and liberal amounts of hope. The roofer will be here at first light.
05:10 AM – Shift handoff.
I pass the incoming admin:
- One resurrected CRAH belt
- One recovered network pod
- One active-but-contained roof leak
- A bucket that now has its own incident number
He looks at the bucket, then at me.
“Kevin send a cousin?”
I shrug. “Maybe the raccoons unionized after all.”
He snorts. “Get out of here before something else explodes.”
I walk through the mantrap as the sky outside turns the color of bad coffee. Another night survived. No FM-200 dumped. No Tier-1 customer completely lost. No new wildlife ambassadors adopted.
I start the car and mutter to the dashboard:
“You beautiful, cursed concrete box. See you in twelve hours.”
Then I drive toward sunrise, already mentally writing tomorrow’s incident report titled:
“How We Saved the Internet with Duct Tape, Rage, and a Five-Gallon Bucket”
Because that’s just Tuesday around here.
(Or Wednesday. Or 3 a.m. Thursday. Time is fake in the datacenter.)
