We know it takes time and energy to build amazing mobile apps and keep them crash and bug free. That’s why we built Crittercism.
Our product team faces three challenges in keeping the mobile app developers that use Crittercism happy:
- Creation of error-free, lightweight client code
- Building and maintaing the website our customers use to gather insight from the data we collect
- Developing and operating “Crittercism Data,” our Internet-scale data platform
Crittercism Data must remain operable even as we process millions of requests each hour, with 50% daily peak-to-trough fluctuations in traffic levels; detailed monitoring is how we ensure the platform is properly provisioned and error-free.
When I joined the company (May 2012), the team had configured Pingdom to poll the site’s status every five minutes. They knew when things weren’t working, but wanted more going forward: visibility into real-time system behavior rather than a simple “are we up” test, and a way to gauge capacity for infrastructure planning.
The goal: to deliver the highest level of customer satisfaction possible, at the lowest ongoing cost to the company.
Measuring what matters
An Internet-scale data platform is a layer cake. From top to bottom:
- Infrastructure: physical computers, on-premise or at a datacenter
- Hypervisor: allows multiple virtual machines to share a single host operating system
- Operating system: runs inside the hypervisor, or directly “on the metal”
- Data stores: MongoDB, Redis, memcached
- Webserver: responds to incoming HTTP requests and writes responses (gunicorn)
- Application: written in a high-level language (we use Python), PHP, or Ruby, that runs inside the webserver and services incoming requests
- Load balancers: divide incoming traffic among a group of application instances (nginx)
Measurements closest to the end-user tell the most about the end-user experience, but are difficult to obtain and can mask the true problem. If the webserver is slow, is it because the application code is written poorly, or because a database query is running too slowly? And is the database running too slowly because it can’t keep up with the incoming request stream, or has the machine run out of disk space?
Our first metrics: the load-balancers
We began by measuring HTTP request rate and latency at the load balancer: both directly affect the end-user experience, and a sudden change in either means trouble. We looked at Ganglia, Nagios, or rolling our own time-series database, but were sold on Graphite’s simplicity, performance, and visualization tools. A Chef-installed cronjob, logster, and some clever one-liners got it done.
We display this graph on a 40″ monitor in the center of our office, which lets us know right away when something is wrong.
We also keep an eye on our error rates by aggregating the proportion of HTTP 200, 300, 400, and 500 errors across all load balancers.
We’ve been impressed with the scalability of carbon-cache; even with hundreds of machines, we’ve been able to monitor our entire process formation with a single instance of the daemon.
After that: symbolication
When iOS crashes come in, we receive a numerical core dump of the stack and registers. Our webservers receive this numerical crash, and place it into an AMQP “symbolication queue,” where it waits for a symbolication worker to dequeue it and match it with symbols from an uploaded dSYM file.
We also monitor cache hit/miss rates, database size, and CPU use, among many others.
Some of our machines use New Relic. It’s a little too expensive to use on everything, and doesn’t do exactly what we’d want, but we’ve configured a few representative nodes to use it via a chef node attribute we’ve set on a few machines by hand.
Two minor points worth mentioning, that didn’t fit anywhere else:
- Logster supports dimensioned quantities, but as far as we know, Graphite’s Whisper database can’t store units alongside incoming values. As a workaround, we store all quantities in base units (seconds, bytes, bytes/second) and let the visualization tools accommodate scaling the axes, e.g. most latency measurements are in miliseconds
- Almost everything we measure is some kind of rate, not a fixed quantity (queue depth being one notable exception). Even when the underlying data accumulates, e.g. ifconfig’s byte counts, we convert them to rates for make for easier cross-period analysis
We need help! Crittercism is actively searching for a head of operations to come in and take the reins. If you think you’re qualified, please get in touch; my email address is email@example.com.
Thanks for reading!
David Albrecht – Senior Engineer & Chief Pot Stirrer