Engineer-to-Engineer Talk: How and Why Twitter Uses Scala

Published on May 1st, 2010 by Glenn Kelman

Updated on October 5th, 2020

To kick off our San Francisco series of engineer-to-engineer lectures on new technologies and interesting problems in consumer software, we invited in the Great Alex Payne to talk about how Twitter uses Scala, a programming language that combines traits of object-oriented languages and functional languages with an eye toward supporting concurrency better in large-scale software.

Alex started at Twitter in 2007, working remotely in Washington DC, when there were “only one and a half engineers.” Now, Twitter has 170 engineers. “It has been an interesting process,” Alex said. Right after his talk, Alex packed up his cats and headed for Portland, where he’ll still work for Twitter, but ensconced in a smaller, more closely-knit community. Here are his thoughts on Scala (Alex talks fast, and doesn’t waste many word, so my hands were in a rictus of agony from trying to type what he wrote) :

Best, Glenn at Redfin

I started working the programming interface when we were at this very early stage. Now, it handles a couple billion operations every day. It is being baked into more and more of the Web.

I’ve spent the past year working on Twitter’s infrastructure. For that, we use a weird language called Scala. I worked on a book for O’Reilly about Scala that you could sit down with over a three-day weekend to get up to speed on the language.

Why Use Scala?
Why use Scala when you have Ruby and Ruby on Rails? Well, we still use Rails. It works great for front-end stuff. The productivity is worth the tradeoff for working in a slower-performing dynamic language. When you think about what a web framework is doing under the hood, it’s tons and tons of string concatenation. Ruby on Rails can handle that.

What we had a need for as Twitter grew was for long-running heavy processes, message-queuing, caching layers for doing 20,000 operations a second. Ruby garbage-collection is tough, Ruby doesn’t do really well with long-running processes.

Languages Twitter Considered
We knew we needed another language. How did we pick a language that was really fun for us? We considered Java, C/C++ of course. And we looked at Haskell and OCaml for functional programming, though neither has gotten much commercial use. Erlang developers are doing stuff with a lot of network I/O but not with a lot of disk I/O; the knowledge-base around the language wasn’t great though, and the community seemed inaccessible.

Java is easy to use, but it’s not very fun, especially if you’ve been using Ruby for a while. Java’s productive, but it’s just not sexy anymore. C++ was barely considered as an option. Some guys said, if I have to work in C++ again, I’m going to stab my eyes out with a shrimp fork. Java-script on the server-side via Rhino had performance problems, and it wasn’t quite there yet when we were evaluating it.

So what were our criteria for choosing Scala? Well first we asked, was it fast, and fun, and good for long-running process? Does it have advanced features? Can you be productive quickly? Developers of the language itself had to be accessible to us as we’d been burned by Ruby in that respect. Ruby’s developers had been clear about focusing it on fun, even sometimes at the expense of performance. They understood our concerns about enterprise-class support and sometimes had other priorities.

We wanted to be able to talk to the guys building the language, not to steer the language, but at least to have a conversation with them.

Was Scala Fast?
And did Scala turn out to be fast? Well, what’s your definition of fast? About as fast as Java. It doesn’t have to be as fast as C or Assembly. Python is not significantly faster than Ruby. We wanted to do more with fewer machines, taking better advantage of concurrency; we wanted it to be compiled so it’s not burning CPU doing the wrong stuff.

What Alex Likes About Scala
Scala is a lot of fun to work in; yes, you can write staid, Java-like code when you start. Later, you can write Scala code that almost looks like Haskell. It can be very idiomatic, very functional — there’s a lot of flexibility there.

And it’s fast. The principal language developer at Scala worked on the JVM at Sun. When Java started, it was clearly a great language, but the VM was slow. The JVM has been brought to the modern age and we don’t think twice about using it.

Scala can borrow libraries from Java libraries; you’re compiling down to Java byte code, and it’s all calling back and forth in a way that is really efficient. We haven’t run into any library dependencies that cause problems. We can hire people with Java and they can do pretty well.

The community is small but growing, and it’s really accessible. We got to sit down with Martin and ask him and his team about funding for Scala, how problems with Scala will get solved. We’ve never really had to call on that level of access, but it’s really nice to know it’s there.

The Grand Unified Theory of Scala
The grand unified theory of Scala is that it combines objective-oriented programming (OOP) and functional programming (FP). Scala’s goal is to essentially say OOP and FP don’t have to be these separate worlds. It’s kind of zen, and you don’t get it when you first start. It’s really, really powerful; it’s nice to have a language with a thesis, rather than trying to appeal to every programmer out there. Scala is trying to solve a specific intellectual problem.

You have methods that take anything between a string and several point away on the inheritance chain from a string. The syntax is more flexible than Java; it’s very human-readable, as you can leave out period between method calls so it looks like a series of words. Your program can make nice declarative statements about the logic of what you’re trying to do.

Traits, Pattern-Matching, Mutability
With Scala, you can also use traits. This is handy because of course you have cross-cutting concerns in your application. For example, every object needs to be able to log stuff, but you don’t want everything extending from a logger class — that’s crazy. With Scala, you can use a trait to shove that right in, and you can add as many traits as you like to a given class or object.

You can choose between mutability and immutability. This can be dangerous. 9 out of 10 times you use immutable variables when you want predictability, especially when you have stuff running concurrently. But Scala trusts the programmer for mutability when he or she needs it.

Scala has the concept of lazy values – you can say lazy val x = a really complicated function. That isn’t going to be calculated until the last second, when you need that value. This is nice.

Pattern-matching is nice too. It lets you dive into a data structure so you can, for example, explode out a collection that matches an array with “2” as its third element. You can break out strings and regular expressions, and you can pattern-match groups with regular expressions.

An oddball feature that is really useful is the ability to use XML literals, so that you can make something equal to an XML literal, as if the XML literal is a string. You don’t have to import Sax or some crazy XML library.

The Concurrency Story
When people read about Scala, it’s almost always in the context of concurrency. Concurrency can be solved by a good programmer in many languages, but it’s a tough problem to solve. Scala has an Actor library that is commonly used to solve concurrency problems, and it makes that problem a lot easier to solve.

An Actor is an object that has a mailbox; it queues messages and deals with them in a loop, and it can leave a message on the floor when it doesn’t know what to do with it.

You can model concurrency as messages – a unit of work — sent to actors, which is really nice. It’s like using a queuing system. You can also use Java.util.concurrency stuff too, Netty and Apache Mina, dropping it right in. You can rewrite the Actor implementation, and some folks have gone so far as rolling their own software transactional memory libraries.

Java interoperability is a big, big win. There are ten years of great libraries, things like Jodatime. We use a lot of Hadoop and it has been easy to wire Scala to the Hadoop libraries. We use Thrift, without having to patch it; we use libraries from the Apache Commons and from Google.

How Twitter Uses Scala
So that’s why we use Scala, but how do we use it?

In the enterprise world, a service-oriented architecture is not new, but in Web 2.0 it is crazy new science. With PHP or Ruby on Rails, when you need more functionality, you just include more plugins and libraries, shoving them all in to the server. The result is a giant ball of mud.

So anything that has to do heavy lifting in our stack is going to be an independent service. We can load-test it independently, it’s a nice way to decompose our architecture.

What services at Twitter are Scala-powered? We have a queuing system called Kestrel. It uses a souped-up version of the mem-cache protocol. We originally wrote it in Ruby — it got us through a few weeks, but because Ruby is a dynamic language, the service began to show its performance weak spots.

Flock to Store the Social Graph
We use Flock to store our social graph, as a denormalized list of user ids. It’s not a graph database, so you can’t perform random walks along the graph. But it’s great for quickly storing denormalized sets of user ids, and doing intersections. We’re doing 20,000 operations a second right now, backed by a MySQL schema designed to keep as much as possible in memory. It has been very efficient — not many servers are needed.

Hawkwind for People Search
Our people-search is powered by a Scala-built service we called Hawkwind. It’s a bunch of user objects dumped out by Hadoop, where the request is fanned out to multiple machine and then pulled back together.

Hosebird for Streaming
We stream out tweets to public search engines, using a low-latency, HTTP-based, persistent connection system called Hosebird. We looked at queuing systems that financial-services companies use, but couldn’t find anything that could handle the volume of the load. We built something on top of Jetty using Scala. We have more Scala-powered services in the works that I can’t talk about.

Thrift for Transferring Data
We use also Thrift, built at Facebook then open-sourced at Apache. With Thrift, you can define data structures and methods, and it deals with everything you don’t want to deal with to efficiently represent data and get it from point A to point B. As your system evolves, your method signatures change, and Thrift has a nice system for creating positional arguments and being backwards compatible.

These services make our life a lot easier. We often staff projects with two people who are pair programming, sitting together for six or eight weeks. These guys can build something like people-search in a couple of months.

The only problem with so many different teams is that there is some divergence in terms of operational approaches – we have to work with ops guys to monitor the right stuff, be it disk or memory or what have you — but we can resolve that jitter over time. We’re ok with the tradeoffs.

The Development Environment
OK, now let’s talk about the tools… the IDEs for Scala are not up to snuff, that is true. IntelliJ IDEA is good but it’s shockingly buggy. The solution we’ve settled on is just using a plain text editor. We use EMACS, as there’s a really nice mode for the build tool. That takes compile/test BS out of your workflow. Of course, you can give the IDEs a try. Even though I’m an IDE cynic, maybe they’ve improved; that said, a plain text editor can be really productive.

Simple Build Tool
sbt is our Simple Build Tool, but it’s not simple or limited in any way. It’s Scala’s answer to Ant and Maven, and really it’s a superset of Ant and Maven. It’ll set up a new project, create a nice project structure for you and manage dependencies — you can slap ‘em right in by copying XML.

You can write your own build-tasks. We added support for Thrift in an afternoon; it’s got a library for shelling out, as Java is not so great at shell operations because it targets so many platforms. sbt is well well documented. And the absolutely coolest feature is that it’s got an interactive console interface where you can type in code and see how it works.

So that means sbt can insert you in an interactive way into your running program. This is great for debugging, great for sketching code out. You have a nice workflow where you don’t have to worry about compilation.

specs
We’re very test-driven, we’re not wedded to behavior-driven development (BDD), but the best library in Scala is BDD-oriented. You can throw in different mocking libraries, and it works just as well in Scala as Java.

Libraries
We’ve built a bunch of libraries. We gather a lot of stats, I mean, A LOT. We spent the first year of Twitter pushing forward on features, but never thinking about what we were building scientifically. That bit us in the ass in a big way.

You’ve probably seen a gradual increase in stability. At conferences, people ask us if it was the switch from Ruby to Scala, or if it was more machines. But really what did it was gathering numbers on everything, setting metrics and trying to improve.

Ostrich helps here. It is an in-process statistics gatherer, with counters, gauges, timers. You can share stats via JMX, JSON-over-HTTP etc. Hopefully it’s pretty simple to use and easy to integrate.

Configgy manages configuration files and logging in a really nice, flexible way. You can include config files in one another and you can do inheritance; it throws in a really nice logging wrapper, with lazy evaluation on the values you’re trying to log so you don’t burn machine-time generating log statements. It has a subscription API for pushing out a new config file. It’s a little crazy to have our own config file format, but Scala makes it work.

xrayspecs: this is an extension to specs, because we need a way to test concurrent operations. Some of the extensions in xrayspecs have been merged back into specs. We can freeze and unfreeze time.

scala-json: this is a better Scala JSON codec. We’ve used this really heavily in production for a while. If you need something like this, hopefully it’ll do the job.

Other Twitter Scala libraries: Naggatti (protocol builder for Apache Mina), Smile (Actor-powered memcached client), Querulous (a nice SQL database client) and Jackhammer (a load testing framework in its early stages). Check out GitHub for more.

How Do we Teach People?
I think we’re employing at Twitter about half the people in the world who know the Scala language. The other half are academics or at Foursquare. Even though Scala’s getting more and more popular, fundamentally we can’t hire people with experience in the language.

Pair Programming, Code Reviews
To start people out, we pair program. It isn’t mandatory at Twitter, but it’s a great way to learn Scala. We’ve come up with a bunch of style guides. The good and bad thing is that Scala’s going to be C++ in ten years, because there’s just a lot of surface area and it can get complicated. For that reason, we are pretty rigorous about a style code.

We do code reviews; it doesn’t go into the master branch if it hasn’t been reviewed by your peers. Right now, I’m working with a guy we hired from Google. He’s an amazing engineer, far better than I am, but at first he didn’t know Scala.

When I looked at his code, there was absolutely nothing wrong under the hood. But we’d go through and say, “Here’s where this line could be a little more idiomatic from a Scala perspective.” I do classes over lunch – but you need a big group to commit to come every week. Then there’s my book, and there’s other books: Dave Pollak’s book, the Odersky book (Programming in Scala, aka “the stairway book”). If you learn by example and need a desk reference, grab “the stairway book.” Or search Google for a talk by my co-worker on “The Seductions of Scala” for lots of examples

What Version of Scala Does Twitter Use?
We use 2.7. It’s got a couple of warts, particularly in the collections classes. Scala 2.8 fixes a lot of those warts, and there’s a bunch of performance work in there too, plus the ability to have named arguments in your functions.

I’m co-organizing a Scala summit at the OSCON conference in Portland this summer; come to that if you want to learn more! There’s a great blog called DailyScala, where an engineer writes about what he’s learning. I learn stuff from that guy all the time…

And that was it! Many thanks to Alex for his magnificent talk, and to all the lovely folks who visited our offices! We had a lot of fun, we learned a ton, and now we’re looking forward on May 20 to hearing from Cloudera’s Jay Hammerbacher — the man who conceived of and built the data team at Facebook — on Hadoop. Everyone’s invited!

Glenn Kelman

Glenn is the former CEO of Redfin. Prior to joining Redfin, he was a co-founder of Plumtree Software, a Sequoia-backed, publicly traded company that created the enterprise portal software market. In his seven years at Plumtree, Glenn at different times led engineering, marketing, product management, and business development; he also was responsible for financing and general operations in Plumtree’s early days. Prior to starting Plumtree, Glenn worked as one of the first employees at Stanford Technology Group, a Sequoia-backed start-up acquired by IBM. Glenn was raised in Seattle and graduated from the University of California, Berkeley.

Email Glenn