Names changed to protect the dead and innocent.How I learned to stop worrying and love KubernetesThe bright afternoon sun shines into my open office. Its 2:05PM I notice as I stare blankly at my Laptop."It's gone," says Fred, one of the senior DevOps engineers."What's gone?" I ask."ALL OF IT," Fred replies.I fire up Chrome and type `https://player.bigcorp.tv`. Nothing. A white background in small Monaco font reads: "Server Error". I press F12 and check developer tools. The status code column reads a singular 502. Our front-end is served from one service and the backend API, for which I am responsible, is served from another service. Using a terminal I check the backend with a curl command to see if I can hit publicly accessible data. Another 502. I quickly navigate to `https://internal.bigcorp.tv/status`, and see "Server Error" again. I type `https://internal.bigcorp.tv/status` into the location bar. Our internal status is also throwing the now familiar "Server Error." "Where is Logan?" I ask. "He left for lunch," replies Fred. We login to the root AWS account and start checking our super resilient, uncrashable Mesos infrastructure that was costing up to $100,000 a month. Mesos has worker nodes that require connection to the control plane nodes to operate. Fred explains that the control plane nodes, running on virtual machines, in AWS had been corrupted and rebuilt that morning. The reason our infrastructure was uncrashable was our lead DevOps engineer, Ryan, had extra control plane nodes running in Mesos itself. This self-hosted model allowed us to always spin up Mesos masters in such an event. The DevOps Manager, Logan, was away at lunch.The control plane running in Mesos had split brained and was unable to take over the leader role when the AWS nodes were rebuilt. Fred had thought restarting them would quickly synchronize all the control plane nodes, and if anything was wrong we had backups of the control plane data. When this happened, unbeknownst to Fred, they lost the only good copy of the master data. Docker containers, if not mounted to a machine on their host, have ephemeral disk space. "The last backup was 4 months ago," Fred informs me. "Does Will know?" I ask. Will is my boss and the Director of our Digital TV Product."IF THIS IS NOT UP IMMEDIATELY THEY ARE GOING TO FIRE ME AND ALL OF YOU," yells Will. As usual the concern of Will not earning income while his high profile attorney wife cucks him is first of his concerns. "WHERE THE FUCK ARE LOGAN AND BRADEN?" Braden is the mastermind of this Mesos Architecture, likely he could fix it, however, Braden is spending his vacation by visiting Burning Man for the first time. His boss Logan is MIA at lunch. Will gives Logan a call and informs him the entire platform is down for customers and every internal media management employee at the company.2:45 PM. Logan enters the office and scowls at me. Surprisingly, he walks straight to where the action is. I half-expected him to retreat to the DevOps room and tinker with his modular synthesizer letting everyone else clean up the mess. I've had multiple public fights with him about the cost of running this Mesos Frankenstein. My budget partially pays for the $100,000 a month infrastructure. "The point of a container orchestration platform is to scale down so we can spend LESS money," I would scream after noting the bill had gone up more than 10x a month after migrating to his Mesos infrastructure. Now, despite how awesome the Mesos platform is, and how all my teams would love it, it is down.By 4:00 PM afternoon bar patrons, I mean alcoholics, have been without bar TV almost 3 hours. The only thing Logan has accomplished is pacing nervously around the office and occasionally breathing stale, hungover breath into Fred's ear. The head of North America is now hovering, with Will my boss. Unfortunately the mobile and TV apps were crashing every time you tap or click the TV icon. Our platform provides digital TV to a European TV product. Europe is asleep by now, but the customers, and our European execs, will definitely start calling by morning. If you have never been yelled at by a German exec, its just as terrifying as old WW2 clips have you believe. Everyone looks increasingly nervous. "We can just run the services on bare AWS without Mesos," I suggest. Logan had finally made contact with Braden via text. He stated the Mesostein was in jeopardy and if they didn't fix it, he would certainly lose his job. Even though Braden had not slept the night before and may or may not have had pharmaceutical assistance with not sleeping he decided to hurriedly leave Burning Man and drive the 8 hours back to the office, targeting arrival before 11PM.The backend is a simple Golang app which is easy to run with a single command. I demo to Will and the NA lead how I can route the DNS directly to the box via auto scaling groups, which gets us scaling out of the box. The database for the backend was not impacted and running in RDS, so this works, and we see TV show titles and M3U8 playlist URLs in a JSON blob. At 4:15 PM we have a strategy. It didn't take long to get the front-end application running again, which was a simple node.js app. Hitting the staging URL the site was back-up, but not the internal management tools of it. At 5:15PM we shout "HOORAY!" The site is back up. Unfortunately, the service which serves the M3U8 playlists responsible for playing the video is a Java web-service, whose lead has just left the company. In parallel the video playout team had been trying to get the service running, but they are not familiar with the dark arts of Linux CLI and the AWS console. Fred and the DevOps team is still trying to jolt Frankenmesos back to life so they are of little help. A new engineer on the video team frustrated with the verbosity and complexity of the service had built a skunk works project which generated M3U8 playlists for video. It was missing the advertising stitching capability, but it would play video if you pointed our video player at it without ads. We demoed this to Will. "We can just change the M3U8 url in the database for the non-working video service to this service," we say. "But its completely untested," Will says. "Yeah but no video works, what do we have to lose?", I reply. "Fuck it," he says.The next 4 hours we spent spinning up new services by hand on AWS running the video service with nohup after SSHing into them on public IP addresses. Around 7PM I receive a phone call from Braden. He tells me to check my text messages. I look and see a message he's sent with a picture. The picture contains a small BMW hatchback connected to a small UHaul. Both are totally destroyed, the contents of the UHaul are strewn about the highway complete with a bunny onesie and California mountains in the background.Midnight.We were now watching video on our player, streaming from a completely untested service. The database was updated with the new service URL just as our European customers started waking up to watch. The video team was now free to go home, and the DevOps team was directed to help them get the old video service running in the morning. At 7 AM I arrive back in the office with a $5 pour over coffee in hand. The video team beat me there. They were still trying to save face and get their application working after being bested by a junior engineer and weekend project. Despite the panic of the previous day, they were close to getting back up and running.I take the team and Fred out to a food truck lunch for take out. We pass Logan as we approach the second office building. "I'm out," he says. "Leaving already after yesterday?" I ask, but shit like this isn't surprising for him. "They fired me," he replies. "I guess we will always have Mesos," I say. "Hey, come over into my office," Will catches me offguard holding my Butter chicken, which is burning my hand, but soon to burn my asshole. "How would you like to run the DevOps team as well?" he says. "Only if we can delete that hell spawn of an infrastructure," I reply.Our next months bill came to $7000--a savings of $93,000 a month.[Edited on January 6, 2025 at 9:50 PM. Reason : a]
1/6/2025 9:41:39 PM
Idk, somebody probably should have been fired for insisting on a 100k a month system that crashes. This stuff is really common, especially setting up Mesos (or Nomad).
1/7/2025 7:34:36 AM
I advocated for it for months but didn't have the political capital to make it happen until it blew up.
1/7/2025 8:19:56 AM
I am positive it will make more sense when I fully read it.
1/7/2025 11:30:26 PM
1/9/2025 3:01:58 PM
^Because knowing those technologies is how you get a job in big tech, so you have to trick a small tech company into using it so you can put in on your resume
1/11/2025 11:53:28 PM
^ very true that's what people think and why people want to do it. In addition, praying for silver bullets to make their job easier.In reality to get the job it's grinding leetcode for 6 months and getting a referral. Alternatively, being the 1 or 2 guys in the 100 person data structures class that got an A+ and destroyed everyone's curve also works.
1/12/2025 7:55:44 PM
What exactly is this
1/15/2025 7:39:14 PM
I posted this somewhere several months ago, but you might find it interesting. Kind of a low-stakes bug, not exactly a firefight or war story.---One of the applications I work on has an API endpoint for updating customer information. For whatever reason, updates to the name of the customer's company are passed as a header in the HTTP request named CompanyName.Support escalated a case where a customer could not successfully sync their information to this endpoint. The API sits behind Cloudflare's Web Application Firewall. The WAF was rejecting the request with the reason "Invalid UTF-8 encoding."Let's say for the sake of example that the name of the company is Télébec LP. We captured a request and saw these bytes for the value of the CompanyName header:
0x54 0xE9 0x6C 0xE9 0x62 0x65 0x63 0x20 0x4C 0x50
0x54 0xC3 0xA9 0x6C 0xC3 0xA9 0x62 0x65 0x63 0x20 0x4C 0x50
0x54 T0xC3 0xA9 é 0x6C l0xC3 0xA9 é 0x62 b 0x65 e 0x63 c 0x20 (space) 0x4C L 0x50 P
0x54 0x26 0x23 0x32 0x33 0x33 0x3B 0x6C 0x26 0x23 0x32 0x33 0x33 0x3B 0x62 0x65 0x63 0x20 0x4C 0x50
0x54 T0x26 &0x23 #0x32 20x33 30x33 30x3B ;0x6C l0x26 &0x23 #0x32 20x33 30x33 30x3B ;0x62 b0x65 e0x63 c0x20 (space)0x4C L0x50 P
1/17/2025 9:28:08 AM
^ good one Twitter used to famously ask variations on UTF8 bugs in their interviews. "Write a function to validate 140 characters.""in a previous role, what is major incident or outage that occurred and your role in fixing it. After fixing it, what are some impacts your contributions made?"My answer to a common interview question and an attempt to make it more entertaining.[Edited on January 17, 2025 at 9:30 AM. Reason : A]
1/17/2025 9:28:42 AM
????😮👍[Edited on January 17, 2025 at 12:37 PM. Reason : ]
1/17/2025 12:35:34 PM
just put it on Red Hat OpenShift Service on AWS
2/1/2025 12:07:54 PM
Gross. None of that existed when this happened.
2/2/2025 11:31:35 PM
when did this happen?
2/5/2025 10:09:18 AM
2016ish. Kubernetes was out but still new, so if you wanted a container orchestration thing it would likely be mesos.
2/5/2025 10:11:42 AM
wordI've been working on OpenShift since maybe 2018
2/5/2025 12:30:32 PM
How have the yaml mines treated you?
2/5/2025 12:38:03 PM
I'm in marketing (but I also create content for RHEL and Ansible)I've used OpenShift before, though, and it is fun to talk aboutone of the things we're quite proud of is defining and explaining technology concepts like if you Google "yaml" you'll get this article: https://www.redhat.com/en/topics/automation/what-is-yaml
2/5/2025 11:56:28 PM
Propaganda Minister for the YAML mines. Sweet. That's some good SEO if you make it to the top of that query.
2/6/2025 9:34:15 AM
our content team takes an SEO-first approachbut white hat SEO - we answer real questions that real humans askno keyword stuffing and no Mr Beast-style bending the knee to the algorithmjust marketing that strives to be on the same tier of quality as our documentation and supportI've been on this team for a decade and I've been a manager of technical marketing content for 3 yearsmy article "What is middleware?" is a footnote in an amicus brief to the Supreme Court - I'm still very proud of that
2/7/2025 9:44:55 AM