Normal is not something to aspire to, it’s something to get away from.
Scout reads go slow
A few weeks ago, as we were about to launch our iPhone app, we discovered that one of its core features, Scout, frequently took seconds to render.
For a little background as to what Scout is, at TheLadders our mission is to find the right person for the right job. One of the ways we strive to deliver on that promise is to provide jobseekers information about jobs they’ll find nowhere else. Serving that mission is Scout, which in a nutshell allows jobseekers to view anonymized information about applicants who have applied to the job they are viewing. Salary, education, career history: we present a lot of useful information to jobseekers about their competition for any given job.
Over time, some attractive jobs accumulate on the order of 30 to 60 applicants, yielding response times of over 1 second (due to multiple synchronous requests, done serially, just to serve one Scout view request). In cases of higher load, sometimes request times take well over that.
That brings Scout into unusably slow country, as the Graphite chart below indicates:
The graph shows the time it takes to form a response to a view-job request issued by our iPhone app. It’s the 95th percentile, which means that 5% of requests had times of the lines in the graph or higher for any given date. One in twenty requests took this long or longer. There are many lines because we have a horizontally scalable architecture, so there are many backend app nodes.
We managed to bring those seconds down to milliseconds, with about a 1000x decrease in times of high load. Below I’ll describe the changes in our architecture that enabled us to make such a huge improvement.
In its initial implementation, Scout’s applicant information was gathered and assembled on the fly for each and every request. Driving the iPhone app, we have a backend app server, which is essentially just a number of RESTful endpoints against which our iPhone app issues requests. Below is a quick rundown of the architecture before I trace a request through our architecture.
Below this backend server there are a number of RESTful entity servers with which the app server is interacting via HTTP.
These entity servers in turn query each other and the canonical data store, in our case Clustrix, and that’s that.
So when a user of our iPhone app taps on a job, a request is sent to the backend app server…
…which then issues a request to our job application service for all job applications for that job. The response contains a number of links to the where those job applications may be retrieved.
The backend server iterates over those links, requesting the job applications themselves one at a time. Just as before, adhering to hypermedia design, the response contains a link to the jobseeker who applied to the job. For your sanity, I’ve simplified the response to contain only the job seeker link:
Finally with that result set, the orchestration service then issues a number of requests to the job seeker service for information about the job seekers who have applied to the job being viewed. In its initial implementation all of the requests were synchronous and in series as I mentioned earlier. We eventually parallelized them, as you can see in the Graphite chart where the big spikes left diminish towards the right.
The iPhone app backend server then extracts the relevant information from those job seekers’ profiles, and returns them as a JSON array of applicants to the mobile app.
That is not just a lot of words and diagrams, that is a lot of work!
The workflow includes multiple objects serializing and deserializing, HTTP transfers, hitting the canonical store etc. Why does each request need to assemble this data itself? Why bother hitting the database? Is there an alternative? It seems like a natural fit for a document-oriented database, as the data we are passing back to the client is just a JSON object containing an array of applicants. We could stand a Varnish cache in front of the Scout endpoints on the orchestration service, but then we’d be trading freshness for speed. On the platform team we like to deliver data fast and fresh (and furious).
Scout reads go fast
Principal Architect Sean T Allen set Andy Turley and me to improving Scout’s performance. The architecture is surprisingly simple: stick the data in Couchbase and have the iPhone app backend query that instead. How would we keep this data up-to-date? The first step is to have the job application entity service emit a RabbitMQ event when it receives an application from a job seeker to a particular job (a PUT returning a 201). On the other end of that message queue there is a Storm topology that should listen for that message. The RabbitMQ message would be the entry point into the spout.
The message contains a link to the job seeker who applied to the job, as well as the ID for the job to which she applied. The message isn’t actually encoded as JSON and transmitted over the wire, but for clarity I’ve displayed the RabbitMQ message as JSON.
The second step, after having received the RabbitMQ message, fetches the job seeker profile from the jobseeker service, and passes that information to the next step.
This third step is responsible for persisting the applicant information to a Couchbase bucket. It uses the job ID as the key, and it does a create or update operation on the document corresponding to that key depending on whether there are applicants already in the bucket for that job.
That last diagram is a bit of a simplification. Although Couchbase is “JSON-aware”, it lacks the ability to perform certain operations on the JSON documents it stores. For example, if the document being stored is an Array, and the client’s append method is called, we hoped that Couchbase would add that element to the end of the Array. Instead, it’s just a blind String.append, resulting in an invalid JSON document. As a result, we had to implement our own append operation by reading the document (if it exists), adding an item to a list if it’s not already there, and then writing the document. So it’s more like two operations than one.
Now when TheLadders mobile service gets a request for Scout information for a job, all it does is a lookup in Couchbase with that job ID and returns the applicants associated with that key.
Dramatically faster, even at the 95th percentile.
SOA is no panacea. There are many instances where querying a number of backend servers to assemble and aggregate data returned from a database simply doesn’t make sense. In those cases, you may do well to denormalize that data and put it in a store that’s more efficient for retrieval.
If you find this post interesting, join the dicussion over on Hacker News.