Names changed to protect the dead and innocent.How I learned to stop worrying and love KubernetesThe bright afternoon sun shines into my open office. Its 2:05PM I notice as I stare blankly at my Laptop."It's gone," says Fred, one of the senior DevOps engineers."What's gone?" I ask."ALL OF IT," Fred replies.I fire up Chrome and type `https://player.bigcorp.tv`. Nothing. A white background in small Monaco font reads: "Server Error". I press F12 and check developer tools. The status code column reads a singular 502. Our front-end is served from one service and the backend API, for which I am responsible, is served from another service. Using a terminal I check the backend with a curl command to see if I can hit publicly accessible data. Another 502. I quickly navigate to `https://internal.bigcorp.tv/status`, and see "Server Error" again. I type `https://internal.bigcorp.tv/status` into the location bar. Our internal status is also throwing the now familiar "Server Error." "Where is Logan?" I ask. "He left for lunch," replies Fred. We login to the root AWS account and start checking our super resilient, uncrashable Mesos infrastructure that was costing up to $100,000 a month. Mesos has worker nodes that require connection to the control plane nodes to operate. Fred explains that the control plane nodes, running on virtual machines, in AWS had been corrupted and rebuilt that morning. The reason our infrastructure was uncrashable was our lead DevOps engineer, Ryan, had extra control plane nodes running in Mesos itself. This self-hosted model allowed us to always spin up Mesos masters in such an event. The DevOps Manager, Logan, was away at lunch.The control plane running in Mesos had split brained and was unable to take over the leader role when the AWS nodes were rebuilt. Fred had thought restarting them would quickly synchronize all the control plane nodes, and if anything was wrong we had backups of the control plane data. When this happened, unbeknownst to Fred, they lost the only good copy of the master data. Docker containers, if not mounted to a machine on their host, have ephemeral disk space. "The last backup was 4 months ago," Fred informs me. "Does Will know?" I ask. Will is my boss and the Director of our Digital TV Product."IF THIS IS NOT UP IMMEDIATELY THEY ARE GOING TO FIRE ME AND ALL OF YOU," yells Will. As usual the concern of Will not earning income while his high profile attorney wife cucks him is first of his concerns. "WHERE THE FUCK ARE LOGAN AND BRADEN?" Braden is the mastermind of this Mesos Architecture, likely he could fix it, however, Braden is spending his vacation by visiting Burning Man for the first time. His boss Logan is MIA at lunch. Will gives Logan a call and informs him the entire platform is down for customers and every internal media management employee at the company.2:45 PM. Logan enters the office and scowls at me. Surprisingly, he walks straight to where the action is. I half-expected him to retreat to the DevOps room and tinker with his modular synthesizer letting everyone else clean up the mess. I've had multiple public fights with him about the cost of running this Mesos Frankenstein. My budget partially pays for the $100,000 a month infrastructure. "The point of a container orchestration platform is to scale down so we can spend LESS money," I would scream after noting the bill had gone up more than 10x a month after migrating to his Mesos infrastructure. Now, despite how awesome the Mesos platform is, and how all my teams would love it, it is down.By 4:00 PM afternoon bar patrons, I mean alcoholics, have been without bar TV almost 3 hours. The only thing Logan has accomplished is pacing nervously around the office and occasionally breathing stale, hungover breath into Fred's ear. The head of North America is now hovering, with Will my boss. Unfortunately the mobile and TV apps were crashing every time you tap or click the TV icon. Our platform provides digital TV to a European TV product. Europe is asleep by now, but the customers, and our European execs, will definitely start calling by morning. If you have never been yelled at by a German exec, its just as terrifying as old WW2 clips have you believe. Everyone looks increasingly nervous. "We can just run the services on bare AWS without Mesos," I suggest. Logan had finally made contact with Braden via text. He stated the Mesostein was in jeopardy and if they didn't fix it, he would certainly lose his job. Even though Braden had not slept the night before and may or may not have had pharmaceutical assistance with not sleeping he decided to hurriedly leave Burning Man and drive the 8 hours back to the office, targeting arrival before 11PM.The backend is a simple Golang app which is easy to run with a single command. I demo to Will and the NA lead how I can route the DNS directly to the box via auto scaling groups, which gets us scaling out of the box. The database for the backend was not impacted and running in RDS, so this works, and we see TV show titles and M3U8 playlist URLs in a JSON blob. At 4:15 PM we have a strategy. It didn't take long to get the front-end application running again, which was a simple node.js app. Hitting the staging URL the site was back-up, but not the internal management tools of it. At 5:15PM we shout "HOORAY!" The site is back up. Unfortunately, the service which serves the M3U8 playlists responsible for playing the video is a Java web-service, whose lead has just left the company. In parallel the video playout team had been trying to get the service running, but they are not familiar with the dark arts of Linux CLI and the AWS console. Fred and the DevOps team is still trying to jolt Frankenmesos back to life so they are of little help. A new engineer on the video team frustrated with the verbosity and complexity of the service had built a skunk works project which generated M3U8 playlists for video. It was missing the advertising stitching capability, but it would play video if you pointed our video player at it without ads. We demoed this to Will. "We can just change the M3U8 url in the database for the non-working video service to this service," we say. "But its completely untested," Will says. "Yeah but no video works, what do we have to lose?", I reply. "Fuck it," he says.The next 4 hours we spent spinning up new services by hand on AWS running the video service with nohup after SSHing into them on public IP addresses. Around 7PM I receive a phone call from Braden. He tells me to check my text messages. I look and see a message he's sent with a picture. The picture contains a small BMW hatchback connected to a small UHaul. Both are totally destroyed, the contents of the UHaul are strewn about the highway complete with a bunny onesie and California mountains in the background.Midnight.We were now watching video on our player, streaming from a completely untested service. The database was updated with the new service URL just as our European customers started waking up to watch. The video team was now free to go home, and the DevOps team was directed to help them get the old video service running in the morning. At 7 AM I arrive back in the office with a $5 pour over coffee in hand. The video team beat me there. They were still trying to save face and get their application working after being bested by a junior engineer and weekend project. Despite the panic of the previous day, they were close to getting back up and running.I take the team and Fred out to a food truck lunch for take out. We pass Logan as we approach the second office building. "I'm out," he says. "Leaving already after yesterday?" I ask, but shit like this isn't surprising for him. "They fired me," he replies. "I guess we will always have Mesos," I say. "Hey, come over into my office," Will catches me offguard holding my Butter chicken, which is burning my hand, but soon to burn my asshole. "How would you like to run the DevOps team as well?" he says. "Only if we can delete that hell spawn of an infrastructure," I reply.Our next months bill came to $7000--a savings of $93,000 a month.[Edited on January 6, 2025 at 9:50 PM. Reason : a]
1/6/2025 9:41:39 PM
Idk, somebody probably should have been fired for insisting on a 100k a month system that crashes. This stuff is really common, especially setting up Mesos (or Nomad).
1/7/2025 7:34:36 AM
I advocated for it for months but didn't have the political capital to make it happen until it blew up.
1/7/2025 8:19:56 AM
I am positive it will make more sense when I fully read it.
1/7/2025 11:30:26 PM
1/9/2025 3:01:58 PM
^Because knowing those technologies is how you get a job in big tech, so you have to trick a small tech company into using it so you can put in on your resume
1/11/2025 11:53:28 PM
^ very true that's what people think and why people want to do it. In addition, praying for silver bullets to make their job easier.In reality to get the job it's grinding leetcode for 6 months and getting a referral. Alternatively, being the 1 or 2 guys in the 100 person data structures class that got an A+ and destroyed everyone's curve also works.
1/12/2025 7:55:44 PM
What exactly is this
1/15/2025 7:39:14 PM
I posted this somewhere several months ago, but you might find it interesting. Kind of a low-stakes bug, not exactly a firefight or war story.---One of the applications I work on has an API endpoint for updating customer information. For whatever reason, updates to the name of the customer's company are passed as a header in the HTTP request named CompanyName.Support escalated a case where a customer could not successfully sync their information to this endpoint. The API sits behind Cloudflare's Web Application Firewall. The WAF was rejecting the request with the reason "Invalid UTF-8 encoding."Let's say for the sake of example that the name of the company is Télébec LP. We captured a request and saw these bytes for the value of the CompanyName header:
0x54 0xE9 0x6C 0xE9 0x62 0x65 0x63 0x20 0x4C 0x50
0x54 0xC3 0xA9 0x6C 0xC3 0xA9 0x62 0x65 0x63 0x20 0x4C 0x50
0x54 T0xC3 0xA9 é 0x6C l0xC3 0xA9 é 0x62 b 0x65 e 0x63 c 0x20 (space) 0x4C L 0x50 P
0x54 0x26 0x23 0x32 0x33 0x33 0x3B 0x6C 0x26 0x23 0x32 0x33 0x33 0x3B 0x62 0x65 0x63 0x20 0x4C 0x50
0x54 T0x26 &0x23 #0x32 20x33 30x33 30x3B ;0x6C l0x26 &0x23 #0x32 20x33 30x33 30x3B ;0x62 b0x65 e0x63 c0x20 (space)0x4C L0x50 P
1/17/2025 9:28:08 AM
^ good one Twitter used to famously ask variations on UTF8 bugs in their interviews. "Write a function to validate 140 characters.""in a previous role, what is major incident or outage that occurred and your role in fixing it. After fixing it, what are some impacts your contributions made?"My answer to a common interview question and an attempt to make it more entertaining.[Edited on January 17, 2025 at 9:30 AM. Reason : A]
1/17/2025 9:28:42 AM
????😮👍[Edited on January 17, 2025 at 12:37 PM. Reason : ]
1/17/2025 12:35:34 PM
just put it on Red Hat OpenShift Service on AWS
2/1/2025 12:07:54 PM
Gross. None of that existed when this happened.
2/2/2025 11:31:35 PM
when did this happen?
2/5/2025 10:09:18 AM
2016ish. Kubernetes was out but still new, so if you wanted a container orchestration thing it would likely be mesos.
2/5/2025 10:11:42 AM
wordI've been working on OpenShift since maybe 2018
2/5/2025 12:30:32 PM
How have the yaml mines treated you?
2/5/2025 12:38:03 PM
I'm in marketing (but I also create content for RHEL and Ansible)I've used OpenShift before, though, and it is fun to talk aboutone of the things we're quite proud of is defining and explaining technology concepts like if you Google "yaml" you'll get this article: https://www.redhat.com/en/topics/automation/what-is-yaml
2/5/2025 11:56:28 PM
Propaganda Minister for the YAML mines. Sweet. That's some good SEO if you make it to the top of that query.
2/6/2025 9:34:15 AM
our content team takes an SEO-first approachbut white hat SEO - we answer real questions that real humans askno keyword stuffing and no Mr Beast-style bending the knee to the algorithmjust marketing that strives to be on the same tier of quality as our documentation and supportI've been on this team for a decade and I've been a manager of technical marketing content for 3 yearsmy article "What is middleware?" is a footnote in an amicus brief to the Supreme Court - I'm still very proud of that
2/7/2025 9:44:55 AM
idk how I missed this thread, but good reads all around
7/24/2025 11:46:55 AM
more like memorials of themselves
7/24/2025 6:15:13 PM
lol u rite
7/25/2025 11:53:24 AM
Hm, something fucked up the character entities in my post. I had very carefully encoded them so the entity references wouldn't be rendered as the characters, but something seems to have changed them.
7/29/2025 9:16:49 AM
When money was free and interest rates were low...What do you mean, we don't need a CDN for a video platform? I reply to my boss. I'm currently the lead on another technical project at the company, but I've been pulled in to advise our new streaming platform. Currently, the V1 video platform is a Ruby on Rails app. It just generates a plain text HTML website with images for the video content. Cards as they say in the biz. The cards are cached in a CDN, Akamai, so our computers are not overloaded when people are browsing the website. The video files, which the web players stream, are also stored in Akamai as the editors upload it. It's a very CDN centric use-case, but my boss has grander ambitions of a V2. The company is currently spending a couple of engineers' salary on Akamai per month and my boss, who has a PhD in physics and worked at CERN, is convinced we can save that Akamai money. The company has a global movie release they want to stream like a regular linear TV channel. "TV ON THE INTERNET." A live stream of a movie release that will be released around the world based on their time zone sequentially. So, at 9 PM Eastern time, the movie will air, and then Chicago Central time, and so on. They're basically recreating linear TV, but digitally on the Internet. The marketing for this is in the tens of millions of dollars and they expect up to 100 million viewers. And instead of just using Akamai they want to build an untested new version of the TV platform from the ground up. This is not to say my boss is dumb, but perhaps suffering from a small case of Hubris. This is what happens when you put a CERN physicist in charge of a web application a state school dropout could build.
1/29/2026 8:41:39 AM
The super duper high scale TV dynamic microservices platform was ready for its first major load test with three months left on the clock. We were going to simulate just 10,000 concurrent viewers on the system. My boss, the architecture and visionary, was beaming ear to ear. 100 instances of High Memory and High CPU instances in AWS were ready to go. Traffic is ramping up, 10 simulated users, 100. We see the screen start to turn red, showing 500 errors and failures in the backend site. Mr. CERN who was beaming now looks like someone has shot his mother in front of him. "TWO requests per second." A DevOps engineer says."That can't be right, we have 100 instances. It must be a configuration issue.""If it's slow maybe we can just use Akamai," the department head replies.The knife twists.It was a pattern. The database no one questioned. The 25-megabyte payload product demanded. The microservices architecture is built on network calls. Given the disaster, leadership shifted my boss from technical and people leader to just architect. They hired a director to handle the people's side. He deferred to Mr. CERN on everything technical and seemed content to let the clock run down.We rescheduled the demo to help my boss get the kinks worked out. As I strolled by his office, $5 cold brew in hand daily, he had a look of exasperation and bewilderment. On this day I stopped by to ask if he's okay. He tells me he still can't get the site to scale more than 4 requests per second even after spending 16 hours a day trying to debug it. He claims to have done 100 changes that should help but all have been futile. I repeat my concerns about network latency and he shows me a Java profiler he has hooked up trying to find the problem. Nothing jumps out as a smoking gun. I tell him profilers do not show network bottlenecks, you need to instrument the code and time how long each call takes. He asks me to help him look into it.The first thing I do is use a parser generator, ANTLR, to automatically add print statements logging how long each call takes in the code base. This is a fairly crude and hacky way to do performance profiling but it works. When I run the application and simulate some traffic I notice it is still slow, even locally, and the application uses all available CPUs on my machine. If it was just a network issue it would not be turning my laptop into the surface of the sun. Maybe I was wrong. When I sum all the network calls, I see the largest culprit is the toString method on the objects containing the layout and title descriptions for movie cards. When I bring this to my boss's attention he just points to his sacrosanct profiler output. He was going to need more convincing.After a few hours of Googling, I discovered that Java performance issues sometimes don't show up in Java profilers. The CPU gets so busy that the profiler drops information—the picture gets cloudy. Brendan Gregg became famous diagnosing these kinds of issues with flamegraphs; it landed him a job at Netflix. His tool of choice was Linux `perf`, which showed what the CPU was actually doing. He also had techniques for measuring off-CPU time—those network calls I kept thinking about. I was going to find the problem.My research suggested that if the CPU is hot and spinning, it's probably not network calls. Network bottlenecks show low CPU utilization. I fired up `perf record` while running the test and got some cool visualizations. They corroborated my print statement report: the bulk of time was spent in toString calls—specifically JSON serialization and deserialization. Those 25MB payloads were causing the CPU to do more work than the network transfer itself. This surprised me. Each method converted into a service call caused that huge payload to be serialized and deserialized. In dynamic microservice mode, the system just spun cycles moving responses between boxes. Adding recommendations for each movie? Another call. Timezone-specific information? Another call. The whole idea of serializing responses for a webapp to read a record from a database was flawed.The clock is now showing 45 days remaining to launch. Millions of dollars of marketing budget for the movie is on the line. People could lose their $10,000 a month luxury apartments. My boss, the architect, was forever obstinate. I presented my new perf record graphs. It showed the time taken by JSON serialization was 99.999% of the work done by the application. We needed a way to make the payloads smaller and not pass them around from machine to machine augmenting them with metadata. He was convinced it was something else, since each machine would be available to do the work. I talked to the new director. They didn't want to rock the boat. They have a family to support and just want to let it play out. Luckily he did relay my concerns to leadership and the department head. They asked for my opinion. I showed them the data. I explained we needed a major architectural change, but the architect who'd made the decisions wouldn't budge. They made a suggestion.What about going back to the CDN model with Akamai? "Since we cannot rebuild the entire platform again, that's probably a good fall back." I say. Over the next two months we finally convinced our boss to turn off the dynamic microservices feature. Upon doing this it works like a webapp that does the work on the server that receives the web request. This increases throughput to a whopping 100 requests per second when running 100 nodes. With caching up front this will allow us to survive. We effectively run the next load tests against Akamai. The site stays up and survives the tests this time. My boss is gutted.The night of the huge movie release is finally here. All 120 employees are hanging around to watch what happens. Some because they have nothing else to do, others like you watch a Nascar race for the accidents. The first timezone rolls around, we have it enabled so we can see the movie in the office like we were in the timezone. Shouts of hooray as the credits open overtake the movie's audio. My boss was staring blankly at his shoes while everyone cheered. This was supposed to be his moment. Little did I know a conversation had already happened. He was really attending his own wake as an observer, not a celebration of his success. It's now the all-hands after the movie launch. The head of the company addresses us. "Remember last year when I said WE ARE TV." That was a dark time in my life.
1/29/2026 8:42:35 AM
1/29/2026 7:28:06 PM
^^ the entire stack sounds like a garbage heap of the absolute worst decisions. also, if it has to live any amount of time further, consider open telemetry
1/30/2026 1:30:22 PM
This was 10 years ago. Architect was fired and it was rebuilt as v3, a small golang app that only took 25mb of ram to run, which launched right after the above kubernetes story.
1/30/2026 2:35:19 PM
Good stories
2/1/2026 7:54:33 PM
^^ ahhh makes way more sense
2/2/2026 2:14:46 PM