User not logged in - login - register
Home Calendar Books School Tool Photo Gallery Message Boards Users Statistics Advertise Site Info
go to bottom | |
 Message Boards » » Fire Fights and War Stories Page [1]  
CaelNCSU
All American
7855 Posts
user info
edit post

Names changed to protect the dead and innocent.

How I learned to stop worrying and love Kubernetes

The bright afternoon sun shines into my open office. Its 2:05PM I notice as I stare blankly at my Laptop.

"It's gone," says Fred, one of the senior DevOps engineers.

"What's gone?" I ask.

"ALL OF IT," Fred replies.

I fire up Chrome and type `https://player.bigcorp.tv`. Nothing. A white background in small Monaco font reads: "Server Error". I press F12 and check developer tools. The status code column reads a singular 502. Our front-end is served from one service and the backend API, for which I am responsible, is served from another service. Using a terminal I check the backend with a curl command to see if I can hit publicly accessible data. Another 502. I quickly navigate to `https://internal.bigcorp.tv/status`, and see "Server Error" again. I type `https://internal.bigcorp.tv/status` into the location bar. Our internal status is also throwing the now familiar "Server Error." "Where is Logan?" I ask. "He left for lunch," replies Fred.

We login to the root AWS account and start checking our super resilient, uncrashable Mesos infrastructure that was costing up to $100,000 a month. Mesos has worker nodes that require connection to the control plane nodes to operate. Fred explains that the control plane nodes, running on virtual machines, in AWS had been corrupted and rebuilt that morning. The reason our infrastructure was uncrashable was our lead DevOps engineer, Ryan, had extra control plane nodes running in Mesos itself. This self-hosted model allowed us to always spin up Mesos masters in such an event. The DevOps Manager, Logan, was away at lunch.

The control plane running in Mesos had split brained and was unable to take over the leader role when the AWS nodes were rebuilt. Fred had thought restarting them would quickly synchronize all the control plane nodes, and if anything was wrong we had backups of the control plane data. When this happened, unbeknownst to Fred, they lost the only good copy of the master data. Docker containers, if not mounted to a machine on their host, have ephemeral disk space. "The last backup was 4 months ago," Fred informs me. "Does Will know?" I ask. Will is my boss and the Director of our Digital TV Product.

"IF THIS IS NOT UP IMMEDIATELY THEY ARE GOING TO FIRE ME AND ALL OF YOU," yells Will. As usual the concern of Will not earning income while his high profile attorney wife cucks him is first of his concerns. "WHERE THE FUCK ARE LOGAN AND BRADEN?" Braden is the mastermind of this Mesos Architecture, likely he could fix it, however, Braden is spending his vacation by visiting Burning Man for the first time. His boss Logan is MIA at lunch. Will gives Logan a call and informs him the entire platform is down for customers and every internal media management employee at the company.

2:45 PM.

Logan enters the office and scowls at me. Surprisingly, he walks straight to where the action is. I half-expected him to retreat to the DevOps room and tinker with his modular synthesizer letting everyone else clean up the mess. I've had multiple public fights with him about the cost of running this Mesos Frankenstein. My budget partially pays for the $100,000 a month infrastructure. "The point of a container orchestration platform is to scale down so we can spend LESS money," I would scream after noting the bill had gone up more than 10x a month after migrating to his Mesos infrastructure. Now, despite how awesome the Mesos platform is, and how all my teams would love it, it is down.

By 4:00 PM afternoon bar patrons, I mean alcoholics, have been without bar TV almost 3 hours. The only thing Logan has accomplished is pacing nervously around the office and occasionally breathing stale, hungover breath into Fred's ear. The head of North America is now hovering, with Will my boss. Unfortunately the mobile and TV apps were crashing every time you tap or click the TV icon. Our platform provides digital TV to a European TV product. Europe is asleep by now, but the customers, and our European execs, will definitely start calling by morning. If you have never been yelled at by a German exec, its just as terrifying as old WW2 clips have you believe. Everyone looks increasingly nervous. "We can just run the services on bare AWS without Mesos," I suggest.

Logan had finally made contact with Braden via text. He stated the Mesostein was in jeopardy and if they didn't fix it, he would certainly lose his job. Even though Braden had not slept the night before and may or may not have had pharmaceutical assistance with not sleeping he decided to hurriedly leave Burning Man and drive the 8 hours back to the office, targeting arrival before 11PM.

The backend is a simple Golang app which is easy to run with a single command. I demo to Will and the NA lead how I can route the DNS directly to the box via auto scaling groups, which gets us scaling out of the box. The database for the backend was not impacted and running in RDS, so this works, and we see TV show titles and M3U8 playlist URLs in a JSON blob. At 4:15 PM we have a strategy.

It didn't take long to get the front-end application running again, which was a simple node.js app. Hitting the staging URL the site was back-up, but not the internal management tools of it. At 5:15PM we shout "HOORAY!" The site is back up. Unfortunately, the service which serves the M3U8 playlists responsible for playing the video is a Java web-service, whose lead has just left the company. In parallel the video playout team had been trying to get the service running, but they are not familiar with the dark arts of Linux CLI and the AWS console. Fred and the DevOps team is still trying to jolt Frankenmesos back to life so they are of little help.

A new engineer on the video team frustrated with the verbosity and complexity of the service had built a skunk works project which generated M3U8 playlists for video. It was missing the advertising stitching capability, but it would play video if you pointed our video player at it without ads. We demoed this to Will. "We can just change the M3U8 url in the database for the non-working video service to this service," we say. "But its completely untested," Will says. "Yeah but no video works, what do we have to lose?", I reply. "Fuck it," he says.

The next 4 hours we spent spinning up new services by hand on AWS running the video service with nohup after SSHing into them on public IP addresses. Around 7PM I receive a phone call from Braden. He tells me to check my text messages. I look and see a message he's sent with a picture. The picture contains a small BMW hatchback connected to a small UHaul. Both are totally destroyed, the contents of the UHaul are strewn about the highway complete with a bunny onesie and California mountains in the background.

Midnight.

We were now watching video on our player, streaming from a completely untested service. The database was updated with the new service URL just as our European customers started waking up to watch. The video team was now free to go home, and the DevOps team was directed to help them get the old video service running in the morning.

At 7 AM I arrive back in the office with a $5 pour over coffee in hand. The video team beat me there. They were still trying to save face and get their application working after being bested by a junior engineer and weekend project. Despite the panic of the previous day, they were close to getting back up and running.

I take the team and Fred out to a food truck lunch for take out. We pass Logan as we approach the second office building. "I'm out," he says. "Leaving already after yesterday?" I ask, but shit like this isn't surprising for him. "They fired me," he replies. "I guess we will always have Mesos," I say. "Hey, come over into my office," Will catches me offguard holding my Butter chicken, which is burning my hand, but soon to burn my asshole. "How would you like to run the DevOps team as well?" he says. "Only if we can delete that hell spawn of an infrastructure," I reply.

Our next months bill came to $7000--a savings of $93,000 a month.


[Edited on January 6, 2025 at 9:50 PM. Reason : a]

1/6/2025 9:41:39 PM

qntmfred
retired
42129 Posts
user info
edit post

Idk, somebody probably should have been fired for insisting on a 100k a month system that crashes. This stuff is really common, especially setting up Mesos (or Nomad).

1/7/2025 7:34:36 AM

CaelNCSU
All American
7855 Posts
user info
edit post

I advocated for it for months but didn't have the political capital to make it happen until it blew up.

1/7/2025 8:19:56 AM

StTexan
God bless the USA!
12138 Posts
user info
edit post

I am positive it will make more sense when I fully read it.

Quote :
"The story recounts a chaotic day at a tech company when their entire platform, built on a costly and complex Mesos infrastructure, crashes. The outage disrupts both customer services and internal tools, sparking panic and desperation.

Key events:
• The Problem: A corrupted control plane caused a cascade of failures, exacerbated by outdated backups and Mesos’ overcomplicated setup.
• Attempts to Fix: The DevOps team struggles to revive the platform while key personnel, including the system architect (at Burning Man) and the DevOps manager (at lunch), are unavailable.
• Creative Solutions: The protagonist proposes running services directly on AWS. This works for some systems, but a critical video service remains down. A junior engineer’s untested project is used as a stopgap to restore video streaming without ads.
• The Fallout: Despite some successes, the DevOps manager is fired, and the protagonist is offered the chance to lead the team. They accept on the condition that the bloated infrastructure is scrapped.

Outcome: The team rebuilds a simpler, more cost-effective setup on AWS, reducing monthly costs from $100,000 to $7,000."


[Edited on January 7, 2025 at 11:37 PM. Reason : Yep]

1/7/2025 11:30:26 PM

smoothcrim
Universal Magnetic!
19002 Posts
user info
edit post

Quote :
" "We can just run the services on bare AWS without Mesos," I suggest. "

I have fought this fight multiple times. I have no idea why these people insist on the leaning tower of abstractions. k8s and mesos are pretty trash and offer zero value in public cloud, especially if you're only using 1

1/9/2025 3:01:58 PM

moron
All American
35599 Posts
user info
edit post

^
Because knowing those technologies is how you get a job in big tech, so you have to trick a small tech company into using it so you can put in on your resume

1/11/2025 11:53:28 PM

CaelNCSU
All American
7855 Posts
user info
edit post

^ very true that's what people think and why people want to do it. In addition, praying for silver bullets to make their job easier.

In reality to get the job it's grinding leetcode for 6 months and getting a referral. Alternatively, being the 1 or 2 guys in the 100 person data structures class that got an A+ and destroyed everyone's curve also works.

1/12/2025 7:55:44 PM

emnsk
All American
3439 Posts
user info
edit post

What exactly is this

1/15/2025 7:39:14 PM

FroshKiller
All American
51970 Posts
user info
edit post

I posted this somewhere several months ago, but you might find it interesting. Kind of a low-stakes bug, not exactly a firefight or war story.

---

One of the applications I work on has an API endpoint for updating customer information. For whatever reason, updates to the name of the customer's company are passed as a header in the HTTP request named CompanyName.

Support escalated a case where a customer could not successfully sync their information to this endpoint. The API sits behind Cloudflare's Web Application Firewall. The WAF was rejecting the request with the reason "Invalid UTF-8 encoding."

Let's say for the sake of example that the name of the company is Télébec LP. We captured a request and saw these bytes for the value of the CompanyName header:

0x54 0xE9 0x6C 0xE9 0x62 0x65 0x63 0x20 0x4C 0x50


In UTF-8, the character é is two bytes: 0xC3 0xA9. We don't see those bytes here. So whatever the heck this is, it isn't UTF-8, and the WAF was right to block it according to that ruleset.

The expected UTF-8 encoding is this:

0x54 0xC3 0xA9 0x6C 0xC3 0xA9 0x62 0x65 0x63 0x20 0x4C 0x50


Or to align it with the characters:


0x54 T
0xC3 0xA9 é
0x6C l
0xC3 0xA9 é
0x62 b
0x65 e
0x63 c
0x20 (space)
0x4C L
0x50 P


What's interesting here is that all the bytes apart from the ones representing the character é are identical. Whatever this encoding is, it seems to be a fixed-length encoding where each character is a single byte and which has some overlap with UTF-8 when it comes to representing your common Latin alphabet characters and the space character.

There's a pretty well known fixed-length encoding that overlaps with UTF-8 like that: ASCII, or US-ASCII if you prefer. US-ASCII doesn't have a representation of the character é. but there are a few extended ASCII encodings that do, namely ISO-8859-1 and its bastard cousin, Windows-1252. Both of these encode the character é as 0xE9, which is not a valid UTF-8 code point, hence Cloudflare's objection.

But we're in a pickle here. You might think it'd be simple to serialize the string with UTF-8 rather than what I suspect is Windows-1252 since the request is coming from an older .NET Framework application hosted on Windows Server, but it's not that simple. We are stuck with a particular HTTP client implementation. We are not at liberty to change the dependency or add a new one. This serialization behavior is a bug in the HTTP client library that we can't change or fix ourselves.

My teammate working on the problem suggested that we could Base64-encode the string. This is appealing at first. Base64 only uses printable ASCII characters, so it's safe according to the strict HTTP header spec and overlaps completely with UTF-8, so the WAF will be satisfied. But it would mean that the API endpoint itself would have to decode the header value, which means a code change in two places. Worse, either all other client implementations would have to be updated to encode the value or the server would have to detect whether the value requires decoding. If we got that part wrong, "Télébec LP" would become "VMOpbMOpYmVjIExQ" instead.

One thing we could do is try to transcode the string to ASCII using some kind of a replacement strategy for the unsupported characters. By default in .NET, that'd result in "Télébec LP" becoming "T?l?bec LP" (note the question marks), and that sucks. We could approximate it with a transliteration, something like "Telebec LP" or even "Te'le'bec LP" if you like, but these also suck in my opinion. Would you know what the question mark is supposed to be? Would you know whether e' was literally e' or meant to be é (or something else completely)?

What we need is a way to escape the string so that it consists entirely of valid US-ASCII characters. If the string doesn't contain any characters outside the US-ASCII set, it shouldn't even change. And if there are escaped characters, we should still be able to recognize the non-escaped characters. And ideally, we should be able to tell the escaped characters are escaped and what they should actually be.

It turns out there's a good encoding for this already: numeric character references, specifically the type you've probably seen in HTML and XML before. The character é can be represented as "é" this way. The 233 is the decimal value of 0xE9, the extended ASCII codepoint of the character, and corresponds to its codepoint in the Universal Coded Character Set (UCS). In .NET, we can use the HttpUtility module's HtmlEncode and HtmlDecode methods to handle encoding & decoding strings with values outside the US-ASCII range.

That means "Télébec LP" becomes "Télébec LP" in the header, or in terms of specific bytes:

0x54 0x26 0x23 0x32 0x33 0x33 0x3B 0x6C 0x26 0x23 0x32 0x33 0x33 0x3B 0x62 0x65 0x63 0x20 0x4C 0x50


Aligned with the characters:


0x54 T
0x26 &
0x23 #
0x32 2
0x33 3
0x33 3
0x3B ;
0x6C l
0x26 &
0x23 #
0x32 2
0x33 3
0x33 3
0x3B ;
0x62 b
0x65 e
0x63 c
0x20 (space)
0x4C L
0x50 P


Breaking this down in sequence:

1. We start with a string which may (or may not!) contain characters that cannot be represented in printable ASCII.
2. We use HttpUtility.HtmlEncode to escape any characters with equivalent numeric character references in the UCS, i.e. replace those characters with escape sequences composed of US-ASCII characters.
3. We give that string to the buggy HTTP client implementation when we set the CompanyName header, which technically serializes the string to bytes using the wrong encoding (Windows-1252) still produces ASCII-compatible output because Windows-1252 and ASCII overlap in the range of characters used here.
4. The request sails through the Cloudflare WAF with ease, because the header's value is now a valid UTF-8 sequence--again, thanks to overlap between UTF-8 and US-ASCII in this range.

This preserves the intended name of the company yet sets the header in a way that satisfies the WAF's ruleset. The only downside is that we do technically need to make a change in our API to decode the header value, but even if we didn't do that, we would still get a valid string from the header that a human could recognize and correct. And while that human might not know off the top of their head that "é" represents the character é, it's still better than seeing a question mark and having no idea what character was intended.

So that's the hack. Now, what's the right solution? Well, we shouldn't be putting stuff like this in HTTP headers. It belongs in the request's body, where we can use whatever bytes we want, and we should specify the encoding to be used in the Content-Type header.

Of course, we can't break anything by changing the existing endpoint. That means it'd be a new endpoint that client applications would have to add support for, which means we'd still be on the hook to maintain the older endpoint until clients moved to the new one. But the hack allows us to move forward with minimal changes on both sides. And if we add the new endpoint right away, then the next client application that runs into this problem has one clear solution: use the new endpoint!

1/17/2025 9:28:08 AM

CaelNCSU
All American
7855 Posts
user info
edit post

^ good one

Twitter used to famously ask variations on UTF8 bugs in their interviews. "Write a function to validate 140 characters."

"in a previous role, what is major incident or outage that occurred and your role in fixing it. After fixing it, what are some impacts your contributions made?"

My answer to a common interview question and an attempt to make it more entertaining.

[Edited on January 17, 2025 at 9:30 AM. Reason : A]

1/17/2025 9:28:42 AM

moron
All American
35599 Posts
user info
edit post

????
😮�👍�

[Edited on January 17, 2025 at 12:37 PM. Reason : ]

1/17/2025 12:35:34 PM

Snewf
All American
64204 Posts
user info
edit post

just put it on Red Hat OpenShift Service on AWS

2/1/2025 12:07:54 PM

CaelNCSU
All American
7855 Posts
user info
edit post

Gross. None of that existed when this happened.

2/2/2025 11:31:35 PM

Snewf
All American
64204 Posts
user info
edit post

when did this happen?

2/5/2025 10:09:18 AM

CaelNCSU
All American
7855 Posts
user info
edit post

2016ish. Kubernetes was out but still new, so if you wanted a container orchestration thing it would likely be mesos.

2/5/2025 10:11:42 AM

Snewf
All American
64204 Posts
user info
edit post

word

I've been working on OpenShift since maybe 2018

2/5/2025 12:30:32 PM

CaelNCSU
All American
7855 Posts
user info
edit post

How have the yaml mines treated you?

2/5/2025 12:38:03 PM

Snewf
All American
64204 Posts
user info
edit post

I'm in marketing (but I also create content for RHEL and Ansible)

I've used OpenShift before, though, and it is fun to talk about

one of the things we're quite proud of is defining and explaining technology concepts

like if you Google "yaml" you'll get this article: https://www.redhat.com/en/topics/automation/what-is-yaml

2/5/2025 11:56:28 PM

CaelNCSU
All American
7855 Posts
user info
edit post

Propaganda Minister for the YAML mines. Sweet. That's some good SEO if you make it to the top of that query.

2/6/2025 9:34:15 AM

Snewf
All American
64204 Posts
user info
edit post

our content team takes an SEO-first approach
but white hat SEO - we answer real questions that real humans ask

no keyword stuffing and no Mr Beast-style bending the knee to the algorithm
just marketing that strives to be on the same tier of quality as our documentation and support

I've been on this team for a decade and I've been a manager of technical marketing content for 3 years

my article "What is middleware?" is a footnote in an amicus brief to the Supreme Court - I'm still very proud of that

2/7/2025 9:44:55 AM

kiljadn
All American
44709 Posts
user info
edit post

idk how I missed this thread, but good reads all around


Quote :
"I have no idea why these people insist on the leaning tower of abstractions."


i do!

motherfuckers love building monuments to themselves

7/24/2025 11:46:55 AM

smoothcrim
Universal Magnetic!
19002 Posts
user info
edit post

more like memorials of themselves

7/24/2025 6:15:13 PM

kiljadn
All American
44709 Posts
user info
edit post

lol u rite

7/25/2025 11:53:24 AM

FroshKiller
All American
51970 Posts
user info
edit post

Hm, something fucked up the character entities in my post. I had very carefully encoded them so the entity references wouldn't be rendered as the characters, but something seems to have changed them.

7/29/2025 9:16:49 AM

CaelNCSU
All American
7855 Posts
user info
edit post

When money was free and interest rates were low...

What do you mean, we don't need a CDN for a video platform? I reply to my boss. I'm currently the lead on another technical project at the company, but I've been pulled in to advise our new streaming platform.  Currently, the V1 video platform is a Ruby on Rails app. It just generates a plain text HTML website with images for the video content.  Cards as they say in the biz.  The cards are cached in a CDN, Akamai, so our computers are not overloaded when people are browsing the website.  The video files, which the web players stream, are also stored in Akamai as the editors upload it. It's a very CDN centric use-case, but my boss has grander ambitions of a V2.

The company is currently spending a couple of engineers' salary on Akamai per month and my boss, who has a PhD in physics and worked at CERN, is convinced we can save that Akamai money.  The company has a global movie release they want to stream like a regular linear TV channel.  

"TV ON THE INTERNET."  

A live stream of a movie release that will be released around the world based on their time zone sequentially. So, at 9 PM Eastern time, the movie will air, and then Chicago Central time, and so on.  They're basically recreating linear TV, but digitally on the Internet.  The marketing for this is in the tens of millions of dollars and they expect up to 100 million viewers.  And instead of just using Akamai they want to build an untested new version of the TV platform from the ground up.  This is not to say my boss is dumb, but perhaps suffering from a small case of Hubris.  This is what happens when you put a CERN physicist in charge of a web application a state school dropout could build.

Quote :
"
> “A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over with a working simple system.”  -- Gall's Law
"


Warm fluffy cookies fill my mouth as I chomp on a cookie from a local high end grocery known for their chocolate chip cookies.  $1,000,000 a year, the sales consultant that brought the cookie says.  The consultant is selling a database, couchbase, that saves everything in memory so it's lightning fast.  So fast it can serve our 100,000,000 theoretical web users.  My boss only wants the fastest components for our V2 TV platform.  When I ask basic questions like, "what is the largest file the database supports?" or "what happens when a new database node gets added?" The sales consultant would glance at Mr. CERN and everyone seems kind of annoyed. I stopped asking. I wasn't really on the project anyway.  Just helping where I could. So I focused on hiring.

"You look like you'd make a good communist," said the Romanian we were interviewing for the backend API team lead position.  He was solid on his coding and passionate, "Code is like poetry to me," but incredibly eccentric. He was clearly a good hire technically.  Hire I guess.

"You're asking me this?!" replies a guy in collared button down shirt with the top button undone revealing a tattoo of a nude woman moaning and the phrase "DTF?" which we assume did not mean Din Tai Fung, our favorite dumpling restaurant.  The question was a basic JavaScript interview question covering variable hoisting.  First hit on Google.  We asked him to answer, and he never did. Maybe this was some kind of social media bit.  No Hire.

Hiring is now complete.  Our new TV team and organization had about 120 people.  We had product people, marketing, and engineering.  A few PhD level hires, like in natural language processing.  They were over in a corner working on a search box.  Seriously high level shit.  The experience of the team was egregious.  All other manners of tech companies Apple, Netflix, and Hulu were represented. We would go on to spend somewhere between around $100 million on headcount trying to revamp this new TV platform.  "WE ARE TV," the North American CEO would say excitedly, awkwardly at the company all-hands in front of all 2000 employees.  They put up a literal countdown clock, visible like the morning sun from most of the open plan office.  It was ticking down second by second.  We were 10 months away from TV ON THE INTERNET.

A warm morning, like every morning, in sunny Southern California walking from our office to our local coffee shop.  The Master's in Statistics hire, our data science guy, dodged needles in his Vibram toe shoes, grinning.  About 15 of us had just passed the 2nd in command under the department head.  No time to worry about the optics of a 15 person coffee break and what that costs per hour, we were hot on the heels of a $5 Cold Brew.  Besides, our engineering head that was the vision of this new V2 platform was more intent on hiding in his office than trying to manage and delegate work.  There was little for the team to do other than wait for him to get comfortable enough to share his work.  

When CERN physicist graced the team with his brilliance, they were allowed to write integrations, like the data science group, toe shoes and all, consuming a stream to figure out how many people were watching.  Or the front-end team to render the cards so they could replicate what the current HTML based site did.  The sheer genius of the new v2 version was that the markup and rendering was munged with the data--mixed concerns as they say in the business.  Instead of using HTML like most sites did, the site contained a JSON based markup.  That way a discerning department head could change the layout and location of where something showed up on the site.  This required a totally in-house rendering engine built on top of regular browser technology like CSS.

Luckily for Mr CERN, he had a totally dispassionate lackey to help him build his vision.  One of my best coworkers got pulled into the project to help build TV on the INTERNET.  He loathed software development, yet needed something to pay for his daughter's violin lessons.  Whatever monstrosity he was told to build he would do it.  We gathered around as tv-v2.bigcorp.com is typed into the browser.  After six months of cold brew coffee runs we wait patiently for the page to load.  After a few seconds the movie cards appear.  Dude biking down a building and lady ollie-ing on a skateboard over some board office workers show prominently on the page.  I kind of wondered why the page had loaded so slowly if this was the fastest internet site on the planet.

"Dynamic Microservices."  CERN Physicist says.
"What?" I reply?

We realized all the markup in JSON with the large sets of recommendation pages could cause some performance issues.  So we created a way to distribute the load.  Everything will be a service call.  When a web request comes in, the handling of the database queries will pass to other services, and those services will do the queries and pass the data back.  The beauty of it, is that it uses Vert.X a reactive Java framework to do it.  So we just write the code as normal and the services will handle it.  

"Isn't that just adding network latency for no reason?" I ask.
"Well no because everything is broken down in small units of work the machines will handle."

In my career every slow page load or computer performance issue was related to making network calls.  Think of it like if you were building a table and every time you hit a nail you had to ship the whole table to another shop three hours away to pound the nail in.  Adding those extra network hops is an eternity for a computer, adding thousands of them needlessly really slows things down. I've seen this before.  At a previous job, users waited seven minutes for a page to load. The backend was querying thousands of order items one network request at a time.

When I measured some of the payloads coming back from the site they were almost 25MB, the size of a whole video game like Doom, graphics, audio, and code in every single request to get information about a video with a title and description.  Just the metadata, not counting the video bandwidth cost would be almost $2 million a month.  When I brought this up to leadership they just replied, "if we get that many people it will be a good problem to have."

[Edited on January 29, 2026 at 8:47 AM. Reason : b]

1/29/2026 8:41:39 AM

CaelNCSU
All American
7855 Posts
user info
edit post

The super duper high scale TV dynamic microservices platform was ready for its first major load test with three months left on the clock.  We were going to simulate just 10,000 concurrent viewers on the system.  My boss, the architecture and visionary, was beaming ear to ear.  100 instances of High Memory and High CPU instances in AWS were ready to go.  Traffic is ramping up, 10 simulated users, 100.  We see the screen start to turn red, showing 500 errors and failures in the backend site.  Mr. CERN who was beaming now looks like someone has shot his mother in front of him.  

"TWO requests per second."  A DevOps engineer says.
"That can't be right, we have 100 instances. It must be a configuration issue."
"If it's slow maybe we can just use Akamai," the department head replies.
The knife twists.

It was a pattern. The database no one questioned. The 25-megabyte payload product demanded. The microservices architecture is built on network calls.  Given the disaster, leadership shifted my boss from technical and people leader to just architect. They hired a director to handle the people's side. He deferred to Mr. CERN on everything technical and seemed content to let the clock run down.

We rescheduled the demo to help my boss get the kinks worked out.  As I strolled by his office, $5 cold brew in hand daily, he had a look of exasperation and bewilderment.  On this day I stopped by to ask if he's okay.  He tells me he still can't get the site to scale more than 4 requests per second even after spending 16 hours a day trying to debug it.  He claims to have done 100 changes that should help but all have been futile.  I repeat my concerns about network latency and he shows me a Java profiler he has hooked up trying to find the problem.  Nothing jumps out as a smoking gun.  I tell him profilers do not show network bottlenecks, you need to instrument the code and time how long each call takes.  He asks me to help him look into it.

The first thing I do is use a parser generator, ANTLR, to automatically add print statements logging how long each call takes in the code base.  This is a fairly crude and hacky way to do performance profiling but it works.  When I run the application and simulate some traffic I notice it is still slow, even locally, and the application uses all available CPUs on my machine.  If it was just a network issue it would not be turning my laptop into the surface of the sun.  Maybe I was wrong.  When I sum all the network calls, I see the largest culprit is the toString method on the objects containing the layout and title descriptions for movie cards.  When I bring this to my boss's attention he just points to his sacrosanct profiler output.  He was going to need more convincing.

After a few hours of Googling, I discovered that Java performance issues sometimes don't show up in Java profilers. The CPU gets so busy that the profiler drops information—the picture gets cloudy. Brendan Gregg became famous diagnosing these kinds of issues with flamegraphs; it landed him a job at Netflix. His tool of choice was Linux `perf`, which showed what the CPU was actually doing. He also had techniques for measuring off-CPU time—those network calls I kept thinking about. I was going to find the problem.

My research suggested that if the CPU is hot and spinning, it's probably not network calls. Network bottlenecks show low CPU utilization. I fired up `perf record` while running the test and got some cool visualizations. They corroborated my print statement report: the bulk of time was spent in toString calls—specifically JSON serialization and deserialization. Those 25MB payloads were causing the CPU to do more work than the network transfer itself. This surprised me. Each method converted into a service call caused that huge payload to be serialized and deserialized. In dynamic microservice mode, the system just spun cycles moving responses between boxes. Adding recommendations for each movie? Another call. Timezone-specific information? Another call. The whole idea of serializing responses for a webapp to read a record from a database was flawed.

The clock is now showing 45 days remaining to launch.  Millions of dollars of marketing budget for the movie is on the line.  People could lose their $10,000 a month luxury apartments.  My boss, the architect, was forever obstinate.  I presented my new perf record graphs.  It showed the time taken by JSON serialization was 99.999% of the work done by the application.  We needed a way to make the payloads smaller and not pass them around from machine to machine augmenting them with metadata.  He was convinced it was something else, since each machine would be available to do the work.  I talked to the new director.  They didn't want to rock the boat.  They have a family to support and just want to let it play out.  Luckily he did relay my concerns to leadership and the department head.  They asked for my opinion. I showed them the data. I explained we needed a major architectural change, but the architect who'd made the decisions wouldn't budge. They made a suggestion.

What about going back to the CDN model with Akamai?  "Since we cannot rebuild the entire platform again, that's probably a good fall back." I say.  Over the next two months we finally convinced our boss to turn off the dynamic microservices feature.  Upon doing this it works like a webapp that does the work on the server that receives the web request.  This increases throughput to a whopping 100 requests per second when running 100 nodes.  With caching up front this will allow us to survive.  We effectively run the next load tests against Akamai.  The site stays up and survives the tests this time.  My boss is gutted.

The night of the huge movie release is finally here.  All 120 employees are hanging around to watch what happens. Some because they have nothing else to do, others like you watch a Nascar race for the accidents.  The first timezone rolls around, we have it enabled so we can see the movie in the office like we were in the timezone.  Shouts of hooray as the credits open overtake the movie's audio.  My boss was staring blankly at his shoes while everyone cheered.  This was supposed to be his moment.  Little did I know a conversation had already happened.  He was really attending his own wake as an observer, not a celebration of his success.  

It's now the all-hands after the movie launch. The head of the company addresses us. "Remember last year when I said WE ARE TV." That was a dark time in my life.

1/29/2026 8:42:35 AM

The Coz
Tempus Fugitive
29410 Posts
user info
edit post





1/29/2026 7:28:06 PM

smoothcrim
Universal Magnetic!
19002 Posts
user info
edit post

^^ the entire stack sounds like a garbage heap of the absolute worst decisions. also, if it has to live any amount of time further, consider open telemetry

1/30/2026 1:30:22 PM

CaelNCSU
All American
7855 Posts
user info
edit post

This was 10 years ago. Architect was fired and it was rebuilt as v3, a small golang app that only took 25mb of ram to run, which launched right after the above kubernetes story.

1/30/2026 2:35:19 PM

DonMega
Save TWW
4249 Posts
user info
edit post

Good stories

2/1/2026 7:54:33 PM

smoothcrim
Universal Magnetic!
19002 Posts
user info
edit post

^^ ahhh makes way more sense

2/2/2026 2:14:46 PM

 Message Boards » Tech Talk » Fire Fights and War Stories Page [1]  
go to top | |
Admin Options : move topic | lock topic

© 2026 by The Wolf Web - All Rights Reserved.
The material located at this site is not endorsed, sponsored or provided by or on behalf of North Carolina State University.
Powered by CrazyWeb v2.39 - our disclaimer.