When you’re developing new functionality on an existing website, especially on a hot new young fresh startup like Redfin, you can get really excited about moving fast and breaking things and really break things. I want to spend a little bit of time reflecting on a time that I broke things, and how load testing with Taurus helped me fix it.
What happened
A couple of months ago, I shipped a new feature to include links on our listing and property pages (an example). The change was mostly intended for SEM, to help search engines understand the relationship between various pages that we have on Redfin.com. Usually, when we ship a new feature, we have it behind a runtime feature toggle we call Bouncer, but I shipped some of my debugging code into production:
if (Bouncer.isOn(Feature.SMART_INTERLINKING) || true) { // CRUSH THE DB!!! }
Usually, we would have gradually dialed up the feature, and discovered slowly but surely that the load this feature adds to our PostgreSQL cluster is untenable, but because of my debugging code, we saw the new controller go from not existing to taking a startling 15% of our total database SQL execution time overnight. I pushed a hotfix, and then had to quickly pull together a long-term solution. Because of the high visibility of the bug, I had a lot of developers suggesting less expensive ways to query our database, and since the “later” had come in my plan to Optimize Later, it was important to come up with the solution that would put the least load on the database and return results blazingly fast.
What I tested
By the end of my initial investigation, I had come up with 5 different solutions for how I could retrieve the data I needed from the database, some of which used a pre-computed mapping table, some of which used GIS queries, some of which used native SQL, some of which used criteria queries. I added 5 separate methods to my Spring MVC controller:
@Controller public class ExpensiveComputationController { private @Inject ExpensiveComputationHelper helper; @RequestMapping(value = "/api/variant1", method = RequestMethod.GET) public doExpensiveComputation1() { return helper.expensiveVariant1(); } @RequestMapping(value = "/api/variant2", method = RequestMethod.GET) public doExpensiveComputation2() { return helper.expensiveVariant2(); } // snip }
Then, I needed to load test each of the variants. My first instinct was to use cURL to time the total time used to return a response. I found this fantastic article on how to time curl requests, and got a reasonable bellwether for how long each of variant takes:
curl -w "ntime: %{time_total}sn" https://redfintest.com/api/variant1
However, sending a single request at a time is not a particularly representative statistic for how effective a given method or query functions because a real production database would never be running just one query — for instance, Redfin has thousands of active, concurrent users while I’m writing this article. To mimic this behavior, we needed to use a load generator to generate a lot of concurrent requests to simulate a real-world-like test. There are a lot of load generation tools available, like Siege, Apache JMeter, The Grinder, and Gatling. At Redfin, we recommend Siege for load testing major architectural changes, since solutions like Apache JMeter just can’t generate enough load to replicate real production-like traffic, but I hoped to find a solution that was a little easier to use. I saw Taurus, which claimed I could write my tests once and then run them using JMeter AND Grinder AND Gatling by adding a DSL for the supported tools! Sweet!
How I tested it
Getting started with Taurus couldn’t have been easier. I followed this article, but I’ll also include the CliffNotes® version here. I started out by installing the dependency:
sudo pip install bzt
Then I created my test:
--- execution: concurrency: 25 hold-for: 5m ramp-up: 1m scenario: requests: - url: https://redfintest.com/api/variant1?param1=val1¶m2=val2 method: GET - url: https://redfintest.com/api/variant1?param1=val3¶m2=val4 method: GET ...snip...
Let’s break that down line-by-line:
--- execution: concurrency: 25
I generated load for 25 concurrent users. This isn’t a particularly realistic number, but since I was generating the load from my laptop, and testing a locally-running process also running on my laptop, I was concerned that setting concurrent users too high would melt my laptop into my desk.
ramp-up: 1m
The test gives 1 minute of ramp-up time to Taurus to get 25 concurrent users making requests so that once the test starts, there are 25 actual concurrent users making requests. It’s much like how in stock car racing the race doesn’t start from a cold stop, but instead the pack circles the track for a few laps before the race starts.
hold-for: 5m scenario: requests: - url: https://redfintest.com/api/variant1?param1=val1¶m2=val2 method: GET - url: https://redfintest.com/api/variant1?param1=val3¶m2=val4 method: GET
The test goes on for 5 minutes of testing, and, in this example, each user makes two requests in succession, to https://redfintest.com/api/variant1, before creating a new user and starting from the beginning.
Then, I fired up the load test:
bzt scenario.yml
And was so very pleased — all in all, from deciding to use Taurus to having the results of a real load test took me less than a half hour, and 6 minutes of that was spent actually running the test. Taurus installed all the JMeter dependencies for me, and after a short wait, I was greeted with a super awesome-looking terminal of charts and graphs and numbers:
How I groked the results
Taurus gave me lots of fantastic data about the amount of time it took to return results under load. The console output for one of the variants looks like this:
16:26:59 INFO: Samples count: 23740, 0.00% failures 16:26:59 INFO: Average times: total 0.348, latency 0.348, connect 0.000 16:26:59 INFO: Percentile 0.0%: 0.062 16:26:59 INFO: Percentile 50.0%: 0.317 16:26:59 INFO: Percentile 90.0%: 0.508 16:26:59 INFO: Percentile 95.0%: 0.670 16:26:59 INFO: Percentile 99.0%: 0.876 16:26:59 INFO: Percentile 99.9%: 1.227 16:26:59 INFO: Percentile 100.0%: 1.750
The top-line number, the samples count, was a reasonable approximation for how efficient a given variant was, since being able to process more requests in a fixed amount of time with a fixed number of concurrent requests means that each request must have been faster on average.
And all in all, the table of results that came out:
1 | 2 | 3 | 4 | 5 | |
samples | 35,299 | 35,382 | 37,970 | 4,443 | 37,669 |
mean | 0.234s | 0.234s | 0.218s | 1.868s | 0.219s |
p0 | 0.039s | 0.045s | 0.029s | 0.078s | 0.040s |
p50 | 0.220s | 0.224s | 0.207s | 1.794s | 0.210s |
p90 | 0.301s | 0.293s | 0.287s | 2.814s | 0.282s |
p95 | 0.351s | 0.327s | 0.333s | 3.225s | 0.322s |
p99.5 | 0.718s | 0.703s | 0.695s | 4.131s | 0.689s |
p99.9 | 1.182s | 1.683s | 0.849s | 5.577s | 0.961s |
p100 | 1.722s | 3.296s | 2.049s | 6.544s | 2.041s |
But those numbers are only half of what I needed to know to make the right decision — it tells me how fast it is for customers, but not necessarily how much load we are producing on the database. To get at the database statistics was more complicated, but thanks to some work by some other teams here at Redfin, most of the heavy lifting had already been done for me. Here at Redfin we use a C Port of Etsy’s StatsD, stored by Carbon, displayed by Graphite and collected by a Spring interceptor that sends data to StatsD every time we call the database. Through this magical setup, I was able to get at the second half of what I wanted to know without doing any additional setup — I just opened up my browser to the appropriate controller and voila!
Note that the order of variants on the graph is 1, 2, 2, 4, 5, 3 because I initially ran variant 2 when I should have run variant 3.
Based on the results we got, it was easy to tell that if we deployed variant 3 in place of variant 4 (the version I initially deployed), we could expect for 90% of users to render the component in a tenth the time that we were serving this component. Oh, and of course I shipped with automated end-to-end tests using Selenium to prevent pushing bad code again, but that’s a post for another day.