· scaling backend databases startups

The naive basics of scaling backend data

I recently had to talk about scaling backend data, so I figured I would write a post about it. There’s a well defined pattern to scaling backends which involves moving the bottlenecks from IO to CPU to network and back. I’m only going to be talking about well established patterns for scaling backend data because optimizing frontend and throughput is an incredibly nuanced topic.

So let’s say you’re running out of space. Here’s a little flowchart.

graph TD; A(My data has outgrown my box)-->B{Can you buy a bigger box?}; B-->|Yes|C(Awesome. Buy a new box); C-->D(Mission Accomplished.); B-->|No, it's too expensive|E( Ouch. Can you move out logs and other archival data?); E-->|Yes, we can clean up|D(Mission Accomplished.); E-->|No, already did that|F(Can you partition horizontally?); F-->|Yes, we can|D(Mission Accomplished.); F-->|No, too hard|G(Can you partition by creating an index to point to shards?); G-->| Yes, we can |D(Mission Accomplished.); G-->|No, too hard|H(What about Zookeeper for sharding?); H-->|Yes, we can|D(Mission Accomplished.); H-->|No, too hard| I(Can you use a distributed data store?); I-->| Yes, we can | D(Mission Accomplished.); I-->| No, we're too scared.| K(OK, you're in trouble.);

Regarding the flowchart, here are some details:

The flowchart is oversimplified because there are many failure modes when it comes to scaling data. Dealing with an unreliable network is very difficult. Split-brain problems and master election problems are a big headache. Even in single-master scenarios, this is not a no-brainer. If you have the master separate from the slaves, what happens when all the slaves connect back to the master? The same goes for multi-datacenter applications. Splitting up data and adding more boxes is one thing, but dealing with failure modes is an entirely different beast. Although I’m referring mainly to databases, the same applies to file systems. Whenever I’ve waded out into the waters of distributed databases or filesystems, I have been burnt by some failure mode or other. Scaling it out might be easy, making sure it’s reliable not so much.

tl;dr - Flowcharts are awesome, but failure modes are not.

  • LinkedIn
  • Tumblr
  • Reddit
  • Google+
  • Pinterest
  • Pocket