When Everything That Can Happen, Does
05:47 AM – My phone vibrates itself off the nightstand. The on-call alert tone is the same four-bar chiptune that has haunted my dreams since 2019. Today’s message reads:
CRITICAL – Rack 17 PDU B tripped – 42 servers offline – Customer: Tier-1 hyperscaler
I’m already pulling on yesterday’s jeans before I finish reading. Coffee can wait. Sleep definitely can.
06:12 AM – Walking the “quiet corridor” between the security mantrap and the white-noise roar of the main floor. Thirty-thousand servers hum like a continuous thunderstorm trapped in a concrete box. The temperature is exactly 22.0 °C. Someone, somewhere, will still open a ticket saying “it feels hot.”
I reach Row 17. The PDU (power distribution unit) that died is making the special kind of clicking noise that means “I’m not just dead, I’m angrily dead.” Red lights strobe like an 80s arcade game having a seizure.
06:28 AM – While I’m on the phone with the vendor (“Yes I already power-cycled it twice… yes I held the reset button for ten Mississippis…”) I hear the unmistakable POP followed by the smell of hot electronics from two aisles over.
I glance at my phone. New alert:
CRITICAL – Smoke detected – Zone 4-11
Of course.
06:41 AM – I arrive to find two very confused night-shift security guards staring at a server that’s gently smoking from the back like it just tried to hot-box itself. The machine is a six-year-old GPU node that was scheduled for decommissioning… next quarter. Apparently it decided to go out with fireworks instead.
The fire-suppression system is still in “pre-discharge” warning mode. Everyone is silently begging it not to dump FM-200 and ruin the morning.
I kill power to the entire half-rack, grab the nearest CO₂ extinguisher just in case, and start triage. Turns out a capacitor in the power supply finally gave up and turned into a tiny roman candle.
07:19 AM – The hyperscaler account manager has already called. Three times. It’s somehow my fault that their machine-learning training run got interrupted 14 hours in. I explain – politely – that physics doesn’t accept change requests.
08:03 AM – Back in the break room. The coffee machine is broken again (third time this month). We now have a Post-it note on it that reads:
OUT OF ORDER – FIGHT ME
I drink Red Bull #1. It tastes like regret and battery acid.
09:40 AM – The morning change window starts. Today’s big ticket: swapping forty spine switches to new 400G models. Everything is scripted, rehearsed, triple-checked.
At 09:47 someone fat-fingers the BGP announcement and announces our entire /16 prefix with a next-hop of 127.0.0.1.
The internet briefly forgets we exist.
For eleven glorious minutes our ASN becomes the black hole of North America.
The network engineer who did it is already rage-typing “WITHDRAW WITHDRAW WITHDRAW” into the terminal. I’m just standing behind him holding moral support Red Bull #2.
10:03 AM – Routing restored. Collective blood pressure drops from “stroke imminent” to “mild panic attack.”
11:22 AM – I find a raccoon living inside the cable tray in Zone 2.
Yes. A full-grown raccoon. Eating insulation off a fiber bundle like it’s cotton candy.
We stare at each other for a solid five seconds. He hisses. I hiss back (instinct). Then I radio facilities:
“Uh… we have a furry asset in the overhead. Repeat, live asset in cable tray.”
Facilities guy’s reply: “Is it paying rent?”
12:50 PM – Raccoon has been humanely relocated. We lost six strands of single-mode fiber. The replacement will take 48 hours. The storage array that was using those links is now running on a single degraded path.
Someone has already named the raccoon “Kevin” and started a Slack channel #justice-for-kevin.
14:30 PM – Lightning strikes a utility pole 0.8 miles away. The transfer switch thinks about switching to generator… then changes its mind. Lights flicker once. Every UPS in the building goes to battery for 3.7 seconds.
Fourteen servers decide this is the perfect moment to corrupt their RAID arrays.
I start opening incident tickets faster than I can read them.
16:10 PM – I’m sitting on the data-hall floor eating a gas-station burrito next to a rolling tool chest. My phone buzzes again.
P2 – Customer reports elevated latency – please advise
Latency is 0.8 ms higher than usual because we’re down to one fiber path thanks to Kevin.
I type the most diplomatic sentence of my career:
“We are experiencing a brief fiber-related impairment due to an earlier wildlife ingress event. Restoration ETA 48 hours.”
They reply with one word:
“wildlife?”
I send them a photo of Kevin sitting on top of a cable tray looking smug.
They reply with 😂😂😂 and “please keep us updated on Kevin”
17:40 PM – Shift relief arrives. I hand over the twelve active incidents, the half-eaten burrito, and Kevin’s unofficial Instagram account.
As I walk out the mantrap, the sun is already setting. I’ve been awake 13 hours and solved exactly zero root causes.
But forty-two servers are back online, the prefix leak is contained, no FM-200 was dumped, and Kevin is (probably) eating someone else’s trash instead of ours.
That’s a win in this line of work.
I start the car, turn the AC to arctic, and whisper to no one in particular:
“See you tomorrow, you beautiful disaster.”
Then I drive home wondering if raccoons can unionize.
(They probably already have better benefits than me.)
