A Goldilocks Approach to Reliability and Scalability Projects

Rondo, Spain

Fast growing SAAS companies face some first class problems. Often as transactions and concurrent users accelerate, your application will struggle to maintain reliable service. Without scaling the infrastructure and software, the golden signals of SRE (Latency, Traffic, Errors, and Saturation) are all going to head in the wrong direction. Pretty soon your Service Level Objectives (SLOs) are going to be in the red more and more. All of this means operational struggles, fire drills, and upset customers. You’ll be spending more and more of your R&D budget on fixes and scale band-aids and less and less on the features all your new customers are needing.

Using the error budget approach recommended by Google and others can be an effective way to make sure you have a release valve to move more resources towards reliability as issues arise. But I do think there are situations and organizations that need an additional process or two to manage Reliability At Scale. Error budgets are a lag measure- once you are seeing errors and red SLOs on your board, it’s very likely customers are already feeling considerable pain and your staff is in fire-fighting mode. If the way to sustainably staying within your error budget requires significant software or infrastructure changes it may take a long time before you are meeting your reliability goals. Don’t forget you are growing rapidly! If things are bad now (and even if they aren’t yet) are you one big growth spurt away from an unreliable system?

How can you and your stakeholders gain confidence that you are likely to meet your Service Level Agreements (SLAs) and SLOs now and in the future? How can you feel good about your investment in product development versus tech debt, reliability, and scalability work? Is the fancy horizontally scalable database your architect wants to swap in worth the multiple-developer-months it will take to pull off? Do we really need to do that now or could we move to the next quarter to get more features done?

Arcos De La Frontera, Spain

What you need is a “Goldilocks Approach” to investing in Reliability — not too little, not too much and not too soon, not too late.

To summarize the problems we want to solve:

  • Are we doing the right work?
  • Are we doing enough of it?
  • Will we stay ahead of our growth?

I’ll editorialize that I think any solution should also:

  • Define success
  • Measure reliability against realistic projected load on sensible time horizons
  • Engage the full team in ownership and accountability for system reliability
  • Provide a context to prioritize reliability and scalability projects against other work

Enter “Evidence Based Reliability At Scale”

I call the solution to these problems Evidence Based Reliability At Scale or EBR@S.

EBR@S
  1. Create transaction projections based on the best information you have.
  • First do your research:
    - Talk to sales, talk to marketing- Are we likely to close a big customer or planning a set of splashy commercials?
    - Talk to Product and your architects- Are we adding complex new processes that will compete for resources on your infrastructure?
    - What are the company’s goals- if we hit them what would load on our system look like?
  • Determine the core transactions you need to track:
    - You know your system well (right?!?) so you should be able to get these intuitively. If you are say, an online ordering company submitted orders is clearly one of them. But menu views, menu changes, or reports run by administrators could also be important. A core transaction probably isn’t “add to basket” or changing store hours (though I do have a story about that!)
    - You are looking for the types of transactions that roll up a bunch of tasks that happen on your system at once and grow together for the most part.
    - It’s important to not tie these too closely to the architecture of your current system. You’ll likely be changing a lot under the hood and you want these transaction projections to continue to have intrinsic meaning.
  • Now you use your research, data, and your best judgment to create transaction projections into the future. You might say you expect user signups to double in the next 12 months, or that because of the seasonality of your business you’ll see a spike of transactions in winter 25% higher than last year. The further out you go the less certainty you have.

2. Consider if you need to do the above for your sub-systems. If you have a pretty complex system, you likely need to create localized projections taking in the larger company projections, adding in any domain/system specific projections as needed.

  • For example if you have a user sub-system, you could create projections on user-signups and logins. If overall system transactions are going to double over 12 months, and you know that user-signup grows linearly with the overall system, you’ve got your number.

3. Time to test! So how are your load tests (you do have load tests right)? Time to do a checkup!

  • We want as authentic a load as we can easily get here (there are diminishing returns).
  • You also want a load test environment as close to production as possible. If you can’t sustain the expense, at least try to determine the relative ratio of your test environment to production.
  • You want load tests that hit all of your core transactions. Transactions that share resources should be load tested together.
  • As your system changes be sure to keep your load tests current. You should run your load tests as often as practical and ideally before every release.
  • The goal of any good load test is to find your system’s breaking points. At what load do your SLIs and SLOs go sideways? At what point do you have total system failure?

4. Compare the results of your load testing to your load projections both at the system and subsystem level. There are a few possible outcomes here:

  • You are good for the foreseeable future- your load tests toppled the system at levels far higher than projected load. Congrats, that’s great. You can focus on features, improving lead times, or whatever you want. Don’t fully rest on these results. Continue testing and adjusting projections as you learn more.
  • There are some issues in the near future and there is less head room than you are comfortable with. Also don’t forget these are just projections, they could be wrong, your company could grow faster than expected.
  • Things did not go well. You are rapidly approaching your system limits. You can see the fire drills, executive escalations, and irate customers in your future. Best to do something about it now, hopefully before you hit a tipping point.
  • In all likelihood, you won’t fit neatly into these groups. Parts of your system are probably fine while others need some love and soon.

5. Time to problem solve. You know you need to increase headroom.

  • Look at your limiting factors preventing higher transactions or green SLOs. Maybe the CPU spikes, the thread pool exhausts, or the database latches.
  • Try to figure out why, using instrumentation, logs, traces, and other diagnostics from your load tests. You may have an n+1 at the database, an O(n³) algorithm, or a memory leak.
  • This is a good place to talk about the budget. Will you be allowed to simply “throw money at the problem” by scaling infrastructure? Can your infrastructure even scale or do you have an architectural limit you are going to hit soon?
  • Based on this analysis, you can propose projects to address the issues you’ve found.

6. You should now have what you need to prioritize Reliability projects with your stakeholders. You have the evidence you need to justify the investments required. Instead of “I’m worried about system stability ” or “We’re doomed!” you can come to a road mapping meeting with “Based on projections gathered from the business, we have 6 months to complete these 3 projects. If we don’t do them, we’ll fail our SLOs.” I’d favor smaller projects, done quickly.

7. As you deliver these improvements, large and small, retest your system. Hopefully you are getting more runway, giving you more confidence in your future reliability and earning trust with your stakeholders.

EBR@S

And that’s Evidence Based Reliability At Scale. You should repeat this process on some cadence or as conditions change. This is a time intensive process so getting buy-in to start may prove difficult. I would recommend doing this process on a couple sub-systems, one you think will likely pass and one that won’t. This can help you get traction for the process in your organization.

Let me know what you think of this process and if you try it out in your organization!

--

--

--

20+ years in software. I write about leadership and managing managers. I add in travel photos for fun.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Crafting the Bobarista

13 Advanced Python Scripts For Everyday Programming

5 Principles for Designing Evolvable Event Streams

Flutter over other cross-platforms

Adding a collectible ammo charge powerup

Jaeger with Elasticsearch on Kubernetes using Operators

Speed, Stability, Scalability— Launching Arken TxFlow™

Tips when working with monitoring tools

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Andrew Sidesinger

Andrew Sidesinger

20+ years in software. I write about leadership and managing managers. I add in travel photos for fun.

More from Medium

Why software architecture matters to you and your customers

Why Test Runners Neither Test nor Run and What Software Developers Should Be Doing Instead

Female engineer controlling a simulator.

Thoughts On Event-Driven Architectures

Trunk-Based Development for Higher Operational Performance