The naive basics of scaling backend data
I recently had to talk about scaling backend data, so I figured I would write a post about it. There’s a well defined pattern to scaling backends which involves moving the bottlenecks from IO to CPU to network and back. I’m only going to be talking about well established patterns for scaling backend data because optimizing frontend and throughput is an incredibly nuanced topic.
So let’s say you’re running out of space. Here’s a little flowchart.
Regarding the flowchart, here are some details:
- Buying a bigger box: one of my favorite. Hardware is cheap, well, unless you’re on EC2. You can get 384 GB of ram for about $15k on Dell . How much developer and maintenance time is that worth? Of course, if you’re on EC2, you might have to pick some other solutions.
- Partitioning horizontally: If your data can be partitioned horizontally and you don’t need cross-shard joins, you just need a function to point to where the data resides. Easy. You could also grab the data from the shards and do a join in the app server if necessary. A lot of work though unless you start with this model.
- Partitioning with a central index: Say you have a site with users and profiles and then have a lot of extraneous data about the user. You can create a set of tables for the user for authentication, list of friends, followers and put that on one box. Then create an index which points to the datastore that the user is on and create multiple shards.
- Moving to a distributed store: Distributed data stores are very tricky to manage. The two on my list are Riak and FoundationDB. For Riak, it has to be run WITHOUT last-write wins, WITH CRDT’s AND allow-mult to merge any concurrent writes on the client. See this jepsen article for more details. FoundationDB just seems like magic. I’m still testing it.
- Zookeeper: Using an external node management like Zookeeper requires a lot of setup. But Zookeeper is the tool coordination for clusters. Here’s a link about SolrCloud and ZooKeeper.
The flowchart is oversimplified because there are many failure modes when it comes to scaling data. Dealing with an unreliable network is very difficult. Split-brain problems and master election problems are a big headache. Even in single-master scenarios, this is not a no-brainer. If you have the master separate from the slaves, what happens when all the slaves connect back to the master? The same goes for multi-datacenter applications. Splitting up data and adding more boxes is one thing, but dealing with failure modes is an entirely different beast. Although I’m referring mainly to databases, the same applies to file systems. Whenever I’ve waded out into the waters of distributed databases or filesystems, I have been burnt by some failure mode or other. Scaling it out might be easy, making sure it’s reliable not so much.
tl;dr - Flowcharts are awesome, but failure modes are not.