Wednesday, June 14, 2023

Lessons Learned from 11 Years of Hosting a SaaS

Tanda will be turning 11 pretty soon. A reader suggested it would be fun to reflect on things I’ve learned in running the app on the internet during that time.

I sat on this post for ages because deployment, hosting, and infrastructure management in general was possibly the most challenging and frustrating part of my job for a decade. Mostly that’s because I constantly dove into the deep end and didn’t know what I was doing a lot of the time. Unfortunately when you have a production app that lots of people use, you don’t always have the time to learn things properly.

This post is the story of some of the phases we went through, written in the hope if you see yourself on the same path you can skip a few of the worst bits.

We started on Heroku, because in 2012 if you did any Ruby on Rails tutorial that included deploying your app, you ended up with a Heroku account.

Heroku’s ease of use was unparalleled. But this didn’t mean much to somebody who’d never deployed a web app before. I knew that internet guide authors thought it was the simplest way to deploy an app, but I didn’t understand how big an improvement it was over what came before it.

What I did understand were its weaknesses:

Prescriptiveness: Heroku worked very well, as long you used it exactly as intended. We were pretty close to this; a web app with a database, some background workers, and a cache.

For us there was one slight difference, which is that our app needed to occasionally handle long requests (file uploads) from slow clients (phones or tablets running in places with bad signal). We didn’t pioneer mobile file uploads, but the right way to configure Unicorn to handle them was just different enough from the defaults that it caused me a lot of grief.

Because I knew nothing about deploying web applications, I decided this was Heroku’s fault.

These days I know a little better and can appreciate the complexity of what they were trying to do, but I doubt I was the only person to take their elevator pitch of being everything you need to deploy an app a bit too literally.

Costs: Heroku cost a lot more than alternatives like running your own VPS. Of course - it did a lot more! But because this was my first time deploying anything myself I didn’t appreciate the second part of that, and just saw dollar signs when comparing it to alternatives.

I imagine this is similar to the experience of newcomers to Rails today, vs those who discovered it (as DHH did) after coming from Java or something else in the bad old days. Your appreciation of how much better this is than what used to be there will not be the same. Luckily, these days you could easily get that same appreciation by trying to build a full stack Javascript app and then coming back to Rails.

Anyway, cost is what eventually led to us migrating off Heroku. Our last Heroku invoice was… $104.95. 🤦‍♂️

About a year into Tanda I had an intern from my university who was super interested in infrastructure and cost optimisation. He basically convinced me that paying for Heroku was like setting money on fire. He was a lovely guy, and I really appreciated his help at the time, but 10 years later I can honestly say this was awful advice (the buck stops with me for listening to it).

Moving off Heroku meant replacing all the bits that Heroku did. We didn’t do it in a fully automated way, because we were sub-scale! We were so small, there would be no point. Instead, we just point and clicked a servers in Digital Ocean’s UI. Then we set up some Capistrano scripts to deploy to them. Over a weekend, we took the site offline for some ridiculously short period of time, downloaded the database from Heroku, uploaded it to a Digital Ocean “droplet” (aka server), and changed the DNS records. We had migrated over!

Our first Digital Ocean invoice was for $28.93, and our second (the first full month) was for $39.23. I thought I was so smart saving $2 a day. For a while it worked okay; it turns out $40/month bought a lot more servers than we actually needed to run our very small app.

The cracks started to show when we started to grow faster. We were doubling the size of our customer base every 9 months, and pretty soon this meant we needed more servers. The process of adding them was manual, finicky, and easy to get wrong. I worked out how to do it but I always had a bad feeling in my stomach when adding extra “hardware”.

The cracks really started to show when our database server started getting overloaded. Pretty consistently if we didn’t deploy the site for more than a day, Postgres would run out of memory and get killed by the operating system. Sometimes it would fix itself, but more often it would require someone to SSH into app servers and restart them all. This was an annoying part of the workday during the business hours, but I have more than a few memories of restarting servers on my phone from the toilets of bars during this time.

But the worst Digital Ocean incident we ever had was when they turned all our droplets off all at once. The credit card entered into the account had expired, there was no backup card, and the contact email on the account went to a shared inbox that was not monitored. So for probably a month we were getting and ignoring billing alerts, until we really paid attention when everything was offline and not responding to SSH. This wasn’t totally their fault, but at the time it just all felt like a dodgy, shaky setup.

Writing all this nearly 10 years later feels very cringe, it’s shocking how little we knew and a bit of a miracle we got away with it. If I had a time machine I’d go back and tell myself to spend 10x more on Digital Ocean ($500/month really wouldn’t have broken the bank) and sleep properly.

After about 3 years on Digital Ocean, we decided that the platform was too simple for our growing needs. We were starting to sign bigger customers on, and we thought we needed a more enterprisey approach to hosting our app. We wanted a managed database instead of managing our own Postgres our own server. We wanted less platform downtime.

We needed to be able to autoscale in response to fluctuations in demand, and we needed to be able to load balance different routes to different groups of resources (… of our monorepo). We thought we needed all these things to be legitimate.

In hindsight, most of this logic was backwards. Auto scaling is a technique, not a product monopolised by AWS. Instead of seeking more challenges, we should have found a platform that was simple enough that we could actually master it. (The managed database thing was a good idea though.)

The only reasonable reason to move off DO was that they didn’t have an Australian data center, and we had some customers that really cared about that. At the time it was just around the corner; it launched in late 2022. So it’s good we didn’t wait for that.

Anyway. We needed to level up. And if you want to level up your hosting, who you gonna call?

We needed to be a real business, and real businesses hosted their apps on AWS. So that’s what we did. Specifically, we ported our exact Digital Ocean infrastructure onto AWS EC2. We didn’t take advantage of any other platform features, we just treated AWS like any other VPS.

A few months later I learned that we were entitled to an AWS account manager. I learned this from a customer, who also did an intro. I was pretty excited - I thought an account manager would be able to help us grow very quickly and get to a nirvana where we didn’t have constant fear about scaling.

At our first meeting our account manager brought along his solutions architect. I had never met a solutions architect, so I didn’t really know what they did. All this guy did was answer every question we asked about anything with “how would that work in a world without servers?”. I didn’t really understand how AWS Lambda would help us (still don’t) but he didn’t have anything useful to contribute except for reminding us it existed.

I had been so excited about having an account manager, so for a while I felt dumb for not understanding Lambda and not being smart enough to make AWS work. Eventually I realised that I wasn’t the problem.

Another fun incident about a year later was that we ran out of integers. Our Rails app was pretty old, and almost every table used integer as its primary key type. Newer versions of Rails created new tables as bigints, but nobody in our team realised this was a problem until one Friday (you can’t make this stuff up - it was Friday the 13th!) we couldn’t insert any new rows into the mostly commonly written table. Luckily everyone was still at the office drinking, so we were able to respond to the incident pretty quickly. This story really resonates.

This incident prompted us to put a lot more effort into monitoring so we could respond more quickly when things break (this was a silver lining). It also gave me a lifelong paranoia about other hidden gotchas in PostgreSQL that I have never been able to fully shake (I’m not sure if this was a silver lining).

In more recent times, major projects in AWS land have mostly been compliance related things. Making sure we tick every box for GDRP and equivalents in other countries led to getting SOC-2 certified. For all these things, being able to point to the Amazon logo made things a little bit easier, but it’s not the case that anything we wanted to do was made possible or impossible by being on a specific cloud.

A few years into AWS we started to feel stable on it, infrastructure wise. We hadn’t architectured our stack for a while, and we didn’t see a big need to - two big achievements! The next major challenge we faced was institutional knowledge, or lack thereof. Over Tanda’s history, less than 10 people had worked in “DevOps” (very broadly defined). But people come and go. 2 were around at the moment, 1 was finishing up soon, so the idea of having only a single Site Reliability Engineer in the team was not very appealing.

Not that the SREs were working entirely alone. We’d had an oncall rotation for engineers for a while too, but we weren’t very good at training people on tricky parts of the stack beyond the Rails app. So oncall folks spent a lot of time acknowledging alerts and monitoring them, but only on a few occasions did they successfully get in the weeds and fix issues or improve systems significantly.

Basically, the system was being held together by string and random bursts of individual brilliance. That’s a bad long term strategy. We needed a proper team structure so that we never depended on one person being able to debug any issue.

To do this, about a year ago we created a Platform Infrastructure Team, reporting to the CTO. The team had several people in every time zone so we had 24 hour coverage for Ops, Infrastructure, and related work.

This was a big highlight personally - I finally stopped being on call!

It also was the first time I really felt like we had a team that was building expertise. After a decade of worrying we didn’t know enough, having things break in embarrassing ways, and changing platforms a lot, it felt great to have a roadmap to stability and professionalism.

The first thing the PIT did was end a bunch of half-done ongoing infrastructure projects and trim as much unused infrastructure as possible. Between that and documenting the oncall process properly they got rid of a lot complexity very quickly. This made everyone in the team more productive right away, and also gave them ownership over the system.

The official Platform Infrastructure Team hat, on the head of our CTO, Leon.

It’s still a work in progress, because building expertise in complex domains takes a long time. But for the first time ever I’m really confident in the team, and really proud of what they’ve achieved in a year.

By the way, we’re still on AWS, but this doesn’t mean we don’t want to change platform ever again. It’s always good to explore what’s out there, and we’ve spent a bit of time learning more about moving off the cloud to a managed data center. But the nice thing is not feeling like we need to.

If I had a time machine to go back to 2012 and give myself a few pointers, what would I say?

Lots of little tips, and three big ones. Both boil down to spending a bit more money, to avoid a lot of headaches.

Use managed services for as long as possible. We did ourselves a big disservice by leaving Heroku after only a few months. We should have stayed on it for years - there was so much time wasted managing servers that could have been done for us during critical early days.

Set up a PIT sooner. I should have set up a team of professionals who wanted to work in this space much much earlier. Not in the Heroku days, but once it became untenable as we hit real scale.

Look after yourself just a bit more. For some reason I always found it really hard to prioritise projects that would decrease alerts, simplify oncall, or help me get more sleep. Until suddenly one day I snapped and reallocated a lot of budget to set up the PIT team. Getting decent sleep has many commercial benefits and it’s not selfish to prioritise that over other things the team could work on.


Thanks to Austin and Dave for reading drafts of this. Bigger thanks to those two, and to everyone else who’s worked on keeping us online and clocking over the years. I only take credit for the stuff we got wrong.



from Hacker News https://ift.tt/kxR67Ud

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.