The secret weapons behind the chartbeat beta
The new version of chartbeat is a big change: in addition to a whole new look, it lets users pivot around any data element, including seeing a history of a referrer or page. To pull this off, we introduced two new technologies to our existing C/Django/MySQL/JS stack, and I wanted to give them a shoutout here.
The first is the Google Closure. After working at Google, I really came to trust the libraries to be correct and efficient, and code written in the Closure style comes out nicely readable and maintainable and works optimally with the compiler. Having the UI elements respond independently but in common ways when the data refreshes or selection state changes was very easy to pull off by having a Feed base class that inherited from EventTarget and a Widget class as the base class for every widget listening to Feed changes. If you’re interested, I encourage you to buy access to my friend’s in-progress O’Reilly book.
The second is MongoDB. We save snapshots of a domain’s state every 5 minutes to produce the history of the domain. I spent a long time trying various data stores, including Tokyo Cabinet and a custom Java solution, before realizing that most of our latency in using historical data came from parsing JSON with cjson (up to a minute for a month of snapshots!) rather than from the data transfer. Oops. Profiling hindsight is 20/20. In any case, the fact that Mongo stores data in BSON not only obviates the need for slow parsing in our server, but it even lets us transfer back only the subset of each snapshot we’re interested in. Although we still see some performance issues when data is being paged into memory, it’s still usable where it was unusable before, and once the data is hot, performance is great. The flexibility of using a document store is amazing, and I’m dying to move our user account data to Mongo at some point for flexible addition and removal of properties.
(For those who might be interested, the schema we’re currently using involves 1 database per month, with one collection per domain. The timestamp of the snapshot is the key, saving us an extra indexed field. The data lives on striped EBS volumes, with master-slave replication as well as EBS backups taken by pausing the slave. The month divisions require a bit more complexity in the client, but it lets us really easily roll off and archive old data by copying the relevant data files and then deleting the database. We were hoping that the locality of collections on disk would make paging a domain in more efficient, but we recently learned that although blocks of rows in a given collection may be on the same extent, the extents comprising a collection may not be contiguous. We’re currently exploring using one document per day with append and upsert, doing daily rollups, or trying to condense our verbose data, but, honestly, what we have is good enough for now. We did see some gains from gzipping data we didn’t need to search over and storing it as binary.)
Thanks again to the teams behind both of these technologies, both of whom are amazingly responsive on their respective mailing lists.