A Goldilocks Approach to Reliability and Scalability Projects

Rondo, Spain
Arcos De La Frontera, Spain

What you need is a “Goldilocks Approach” to investing in Reliability — not too little, not too much and not too soon, not too late.

  • Are we doing the right work?
  • Are we doing enough of it?
  • Will we stay ahead of our growth?
  • Define success
  • Measure reliability against realistic projected load on sensible time horizons
  • Engage the full team in ownership and accountability for system reliability
  • Provide a context to prioritize reliability and scalability projects against other work

Enter “Evidence Based Reliability At Scale”

  1. Create transaction projections based on the best information you have.
  • First do your research:
    - Talk to sales, talk to marketing- Are we likely to close a big customer or planning a set of splashy commercials?
    - Talk to Product and your architects- Are we adding complex new processes that will compete for resources on your infrastructure?
    - What are the company’s goals- if we hit them what would load on our system look like?
  • Determine the core transactions you need to track:
    - You know your system well (right?!?) so you should be able to get these intuitively. If you are say, an online ordering company submitted orders is clearly one of them. But menu views, menu changes, or reports run by administrators could also be important. A core transaction probably isn’t “add to basket” or changing store hours (though I do have a story about that!)
    - You are looking for the types of transactions that roll up a bunch of tasks that happen on your system at once and grow together for the most part.
    - It’s important to not tie these too closely to the architecture of your current system. You’ll likely be changing a lot under the hood and you want these transaction projections to continue to have intrinsic meaning.
  • Now you use your research, data, and your best judgment to create transaction projections into the future. You might say you expect user signups to double in the next 12 months, or that because of the seasonality of your business you’ll see a spike of transactions in winter 25% higher than last year. The further out you go the less certainty you have.
  • For example if you have a user sub-system, you could create projections on user-signups and logins. If overall system transactions are going to double over 12 months, and you know that user-signup grows linearly with the overall system, you’ve got your number.
  • We want as authentic a load as we can easily get here (there are diminishing returns).
  • You also want a load test environment as close to production as possible. If you can’t sustain the expense, at least try to determine the relative ratio of your test environment to production.
  • You want load tests that hit all of your core transactions. Transactions that share resources should be load tested together.
  • As your system changes be sure to keep your load tests current. You should run your load tests as often as practical and ideally before every release.
  • The goal of any good load test is to find your system’s breaking points. At what load do your SLIs and SLOs go sideways? At what point do you have total system failure?
  • You are good for the foreseeable future- your load tests toppled the system at levels far higher than projected load. Congrats, that’s great. You can focus on features, improving lead times, or whatever you want. Don’t fully rest on these results. Continue testing and adjusting projections as you learn more.
  • There are some issues in the near future and there is less head room than you are comfortable with. Also don’t forget these are just projections, they could be wrong, your company could grow faster than expected.
  • Things did not go well. You are rapidly approaching your system limits. You can see the fire drills, executive escalations, and irate customers in your future. Best to do something about it now, hopefully before you hit a tipping point.
  • In all likelihood, you won’t fit neatly into these groups. Parts of your system are probably fine while others need some love and soon.
  • Look at your limiting factors preventing higher transactions or green SLOs. Maybe the CPU spikes, the thread pool exhausts, or the database latches.
  • Try to figure out why, using instrumentation, logs, traces, and other diagnostics from your load tests. You may have an n+1 at the database, an O(n³) algorithm, or a memory leak.
  • This is a good place to talk about the budget. Will you be allowed to simply “throw money at the problem” by scaling infrastructure? Can your infrastructure even scale or do you have an architectural limit you are going to hit soon?
  • Based on this analysis, you can propose projects to address the issues you’ve found.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store