How not to manage production systems


I’ve been at Brex for one week and the onboarding is going great.

One of the onboarding sessions was about the company’s build & deployment systems. It got me thinking about how my previous employers managed their production systems. And how cavalier some of them were.

No two companies will have identical internal controls, but you develop a sense of what’s reasonable after you’ve been around for a while. (You can bin companies by employee headcount, revenue, Daily Active Users, etc. It doesn’t matter. You’ll still get a sense of what processes are appropriate for a particular level.) And you’ll get good at noting when a company significantly deviates from the norm.

I’ve been around the block a few times and I’ve seen some shit. Here’s a story that ought to amaze you.

Once upon a time, there was a company. It was around 10 years old, had around 50 employees, and had more than 600,000 daily active users. So it was a non-trivial startup that had good marketplace traction and a substantial workforce. I worked there for a few years.

It operated like this:

Employees frequently ssh’d into production with carte blanche to do anything. They mucked with databases using psql or Python scripts, interactively poked at Redis clusters, and deployed random branches to servers. We had one dev admin account for all the servers, with no monitoring of employees’ actions. There were zero monitoring of CRUD operations on customer accounts.

Any engineering employee could do any of this and backend engineers often did. I could read any field/attribute/chat from your account, or modify any attribute on your account, and it’d be nigh impossible for the company to know who did it or when it happened.

These things didn’t happen during a single red-alert, all-hands-on-deck disaster. They weren’t done in a once-a-year OMG FIX THIS P0 BUG NOW DAMMIT event. They happened every day.

Informally, this company says it handles users’ data with confidentiality and respect.

Hahahahaha.

Formally, one of its public documents says:

We take steps to ensure that your information is treated securely and in accordance with this Privacy Policy. Unfortunately, the Internet cannot be guaranteed to be 100% secure, and we cannot ensure or warrant the security of any information you provide to us.

I’d submit “the Internet” isn’t the only or biggest problem here. These words camouflage an absolutely horrific reality.

The engineers knew these were bad practices and wanted better controls on production. Whenever the topic was raised, management would say there were no resources or schedule time available to make the requisite changes. “We can’t do it now, but maybe in the next sprint or the sprint after that.” OK, sure.

The engineers had to operate this way to make progress on their tickets and projects. Without a disaster, management was sanguine about the existing approach and the company merrily skipped along.

And after a while it all starts to feel normal. “Ssh into a server on the production VPN and use Python interactively to modify user’s XXX to be YYY? On my own without any formal review? OK, sure.” Totes commonplace.

The company’s management always wanted more product features, and prioritized features far above reducing technical debt. Or automating manual processes… Or documenting how things worked… Or improving the dev or ops tools… Or anything else.

And it was insanely understaffed while I was there. From what I hear, it still is.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.