Alejandro's blog

Understanding Software Dynamics Book: Prelude

An intro to the Understanding Software Dynamics Book, chapters 1 and 2

  ยท   6 min read

Phil Eaton (one of my influences) started a mailing list-like book club in late May for the Understanding Software Dynamics book. At a glance, the book delves into a daunting topic for someone who haven’t been this deep into systems performance, but I have to say that it’s an interesting opportunity for multiple things:

  • Reading about a “novel” topic, and trying to digest what I understood about it.
  • Read what other people have to say about the topic, and learn from them.
  • Get the encouragement to go deeper.

The book club will kick off the discussion for two chapters at a time each week, and I am aiming to put my somewhat messy notes here, along with the open questions that I have. With this format, I hope that I can come back and answer some of those, based on my learnings.

Having said that, let’s go!

Chapter 1: My program is too slow

The takeaways from my reading of chapter 1 are captured in the following ideas:

  1. What is too slow?
  2. General terminology
  3. Tail latency
  4. Five fundamental resources

What is too slow?

In Chapter 1, Richard starts talking about two types of programmers:

  • Those who can talk about โ€” with numbers โ€” about their program and why it is not behaving as expected.
  • Those who can say: “My program is slow, and I don’t know how to make it fast”.

I am one of those programmers in the second type. It’s interesting to observe how working at higher abstraction layers limits the scope of possibilities about what could go wrong with a program. When you’re writing enterprise software with known tools or frameworks (say, Java with Spring Boot), you’re thinking about a library that may be causing an undesired behaviour, or a misconfiguration in your deployment, a missing flag in the JVM running your program, and so on. You rarely think about a cache miss (unless it’s a cache server ๐Ÿ˜…), or a SIMD computation. That’s my observation after spending too many years in the consultancy world.

As for the first type of programmers, they know how their program should behave, most of the time by using standard numbers every programmer should know (you can see a nice visualization for those numbers here), also mentioned as order of magnitude estimations. Knowing that these numbers are approximations, the author says that the true learning is here, since your estimation may be wrong, or right, depending on the heuristics you use. Still, understanding the desired vs current behaviour, and running some back-of-the-envelope calculations to set your expectations greatly helps to level the ground, and to communicate your expectations based on numbers, not in guesses.

Just to wrap up this idea: knowing the load that your service will take, the interactions of the service with other services โ€” a service is an umbrella term here โ€” and the runtime of your program, you can estimate what would be the time your program should take, and also help you our rule out scenarios that may not fit such estimation. Quoting the book:

Knowing the estimates in Table 1.1 will also guide you in identifying the likely source of a performance bug. If some program fragment takes 100 msec more than you expect, the problem is unlikely to be related to branch misprediction, whose effect is 10,000,000 times smaller than 100 msec. It is more likely related to disk or network times, or as we will see in later chapters, to long lock-holding times or to waiting on long RPC sub-requests.

We will design and build observation tools and data displays. As you use them, get in the habit of predicting within an order of magnitude what you expect to see. Once you are practiced at this prediction-observation-comparison loop, you will rapidly start spotting the odd stuff.

This is also referred to as the thought framework in the book: “with the offered load, and a system under stress, what are the observations we can capture, how do we reason about what we see, and then how do we change those observations?”.

General terminology

There is a lot of terms that I thought I knew, but most of the time I took them for granted. It was great to see these definitions explained in a holistic way, and not tied to buzzwords, or the trending design pattern (microservices, anyone?). Some of the terms that caught my eye were:

  • Offered load: the advertised capacity of a service. May be measured in number of requests (or transactions) per time unit (e.g. 100K reqs/minute).
  • Service: a collection of programs responsible for handling a transaction.
  • Tail latency (explained below)
  • Execution skew: the variation (in time units) of the completion time of a request.

Tail latency

It would be futile to try and explain again what tail latency is, however, I do think that the p99 discussion mentioned in the book was an eye-opener for me, mostly because of the true impact of this latency when you’re able to reduce it by 1x, 2x, and more.

Five fundamental resources

Those being: CPU, memory, disk, network, and if your program is comprised of a bunch of threads, then the critical section of the program would be the fifth one. The interesting part about these resources is to see the amount of heuristics that rule each of them, and how those influence the measurement process. If you forget a variable, that mistake may yield flawed measurements.

Learning how to observe these factors is one of the author’s goals in the book.

Chapter 2

This chapter was denser, since we started to talk about CPU measurement, with an example. To follow this along, a “basic” knowledge of the CPU is fundamental (notice the quotes). My comments here are scarce, other than a bunch of unanswered questions that I want to answer first (outlined in the open questions section).

So far, things that I liked about the chapter are:

  • Richard followed the thought framework while running the experiment. Starting from an odd scenario (a program that measures the performance of the add instruction), and with an expectation set, he tries out different scenarios until he finds the optimal approach.
  • He delves into the eternal time-measuring dilemma.
  • The historical introduction โ€” and refresher โ€” to the CPUs evolution across time.

Some learnings from the book club

Open questions

Chapter two has an interesting disclaimer:

If you are not very familiar with the terms instruction fetch, pipelining, cache memory, and (for the next chapter) virtual memory, now would be a good time to review a computer architecture textbook, such as Hennessy and Patterson.

Of course, most of these concepts flew over my head, and my specific open questions are:

  • How does the instruction fetch process work?
  • How does pipelining work?
  • How does L1/2/3 cache work?
  • Refresher on Memory (virtual memory, pages)
  • Out-of-order execution

Closing remarks

This post (and subsequent ones in the series) aims to be my messy notes, hopefully to orient me to learn and fill in the gaps. If someone besides me is reading this, thanks!

I hope this may be useful in the future