The personal website and blog of Ryan Marcus, a graduate student at Brandeis University. Previously worked at HPE Vertica and Los Alamos National Laboratory.
      
    ____                       __  ___                          
   / __ \__  ______ _____     /  |/  /___ _____________  _______
  / /_/ / / / / __ `/ __ \   / /|_/ / __ `/ ___/ ___/ / / / ___/
 / _, _/ /_/ / /_/ / / / /  / /  / / /_/ / /  / /__/ /_/ (__  ) 
/_/ |_|\__, /\__,_/_/ /_/  /_/  /_/\__,_/_/   \___/\__,_/____/  
      /____/                                                    
        
   ___                   __  ___                    
  / _ \__ _____ ____    /  |/  /__ ___________ _____
 / , _/ // / _ `/ _ \  / /|_/ / _ `/ __/ __/ // (_-<
/_/|_|\_, /\_,_/_//_/ /_/  /_/\_,_/_/  \__/\_,_/___/
     /___/                                          
        
   ___  __  ___                    
  / _ \/  |/  /__ ___________ _____
 / , _/ /|_/ / _ `/ __/ __/ // (_-<
/_/|_/_/  /_/\_,_/_/  \__/\_,_/___/                                   
        

Overflow in consistent hashing

Consistent hashing was first proposed in 1997 by David Karger et al., and is used today in many large-scale data management systems, including (for example) Apache Cassandra. Consistent hashing helps reduce the number of items that need to be moved from one machine to another when the number of machines in a cluster changes.

The basic idea is to use two hash functions1 -- one, , which maps each data item to a point on a circle (generally represented as an angle between and ), and another, , which maps each machine to a point on the circle. Each machine is responsible for serving all the data items with hash values that fall between and the machine with the next angle along the circle (clockwise).

ServerLoadItems

In the example above, each server is hashed by to three coordinates on the circle (server 1 is hashed to the east-most point on the circle). Items that fall between server A's hashed position and the next server (moving clockwise) are assigned to server A. Click and drag the data items to change their assigned segment.2

As shown above, this technique generally produces a uniform distribution of data items onto servers. Of course, bad luck might end up putting far more data items onto one server than another. But that's pretty unlikely... right?

As far as I can tell, this problem was first given an efficient answer in 1987 by M.V. Ramakrishna. Ramakrishna phrases the problem as an "urn" problem (paraphrased):

There are balls tossed uniformly at random into one of urns, each urn having a capacity of at most balls. What is the probability that no bin overflows?

It turns out this probability is difficult enough to compute that an instance of it appears in a high-numbered (307) Project Euler problem. This blog post won't go into the details behind efficiently computing the exact probability, but it will: * explore some unintuitive results that arise from these probabilities, * discuss some (very good) approximations that can be computed near-instantly, * provide a handy calculator for computing the probability, * and provide some general pointers for cluster sizing and tuning.

It can't overflow that often, can it?

Since an example is worth a thousand equations, here's an animated figure:

Each time a segment turns red, that segment is overflowing. The table below controls the parameters of the experiment:
ParameterValue
Number of items
Number of bins
Bin capacity
Overflow probability

The probability listed in the table above is computed precisely using Ramakrishna's technique. With 5 servers, and a server capacity of 4, you have a theoretical maximum capacity of 40 items. However, with just 10 items, there's a 16.37% chance of one server overflowing!

Perhaps even more counter-intuitively, with 10 bins of size 3 and 15 items, there's a nearly 50% chance of an overflow! That's even with the theoretical capacity of the cluster being twice as large as the number of items we're trying to store (a load factor of 0.5).

When we talk about hash tables, distributed or otherwise, we often talk about the load factor, which is the number of items divided by the total number of spots an item could occupy. In our case, the load factor is . However, playing with extreme cases makes it clear that there's more to this problem than the load factor. For example, consider two scenarios: 5 bins with a capacity of 2 and 2 bins with a capacity of 5. With 5 items, the overflow probability for the second scenario is zero: no arrangement of items will cause any bin to overflow. The first scenario, however, overflows about 25% of the time.

Next, we'll look at a way to approximate the probability of a bin overflowing. This process will also teach us quite a bit about what is really going on.

Approximations

Ramakrishna works out a few nice approximate closed-form expression for the overflow probability. To build this approximation, we first note that the probability of an arbitrary but fixed bin overflowing is binomial (e.g., the probability of at least successful trials occurring within total trials with success probability ). Letting be the load factor (the number of items divided by the maximum capacity), we can approximate this probability using a Poisson distribution:

When our bin capacity is large, and our load factor is not too close to one, the tail of the distribution will collapse rapidly, so we can approximate this sum by taking the first few terms (and applying some algebra):

Ramakrishna next makes the assumption that each bin overflows independently of other bins (which is definitely false, but remember, we're building an approximation), which allows us to write the probability that no bin overflows as:

Ramakrishna shows that this approximation is very good in the case of and , and larger values of allow for lower values of . In addition to being a useful approximation for calculation, we can also differentiate this approximation to gain a better understanding of how the variables work.

Imagine we had a load factor of 0.5 and . This equation tells us that a unit increase in will result in a 20-fold increase to ! So, for example, if we were to increase our load factor from 0.5 to 0.6, we could expect . Counter-intuitively, this means that larger bin sizes are more sensitive to changes in load factor.

Let's take a closer look at how the load factor, number of items, bin capacity, and number of bins interact. We'll begin by fixing the capacity of each bin at 20. The sliders below allow you to vary the number of bins represented by each curve.

Fixed bin capacity (20)

CurveBins
Red
Green
Blue

The graph can be a little perplexing at first glance. To interpret it, first set the values to 5, 25, and 200. Notice that the red curve, representing 5 bins, at a load factor of 0.8, has an overflow probability of around 50%. This means that in a cluster configuration with 5 nodes, a node capacity of 20, and 80 items, there is (approximately) a 50% chance of a node overflowing. Compare this to the blue curve, representing 200 bins -- if I have the same node capacity (20) and the same load factor of 0.8 (implying that I have 3200 items), my overflow probability is practically one. This is very counter-intuitive if you are used to thinking of the load factor as something that should be kept constant as a system grows. However, as we've seen here, keeping the load factor constant as you increase the number of nodes and data items will cause your overflow probability to skyrocket.

Imagine you have 5 cups, each currently filled with 8 pieces of candy. You instruct your friend, a computer scientist, to take the remaining 10 pieces of candy and evenly distribute them into each cup, such that each cup will have 10 pieces of candy. The computer scientist, having recently taken a data structures course, begins dropping each remaining piece of candy into a cup at random. Confused and unsatisfied, you are forced to redistribute the candies yourself (the computer scientist's explanation of an "amortized cup" does not comfort you).

Clearly, the computer scientist's strategy would've worked if your friend had gotten a little lucky. But exactly how lucky? Well, getting it exactly right would've been like rolling 10 dice (each representing a piece of candy), each with 5 sides (each representing a cup), and getting exactly two "ones", two "twos", two "threes", two "fours", and two "fives." We know that there are possible ways the dice could land, and we know that the total number of ways the dice could land in the desired configuration is:

Taking the quotient, we end up with a success rate of around 1%. Not very good odds.3

Now imagine that instead of 5 cups filled with 8 pieces of candy with 10 pieces of candy leftover, you instead have 50 cups, each with 8 pieces of candy, with 100 pieces of candy still to be distributed. Each cup is still 80% full, but do you think your friend's odds of success have changed? Now, your friend must roll exactly 20 "ones", 20 "twos", 20 "threes", 20 "fours", and 20 "fives." Again, there are different ways for the dice to land. This time, the number of valid configurations is:

Since we're starting to get some 60+ digit numbers, we'll let Wolfram Alpha do the math for us. The result? A smashing 0.013% success rate. Just like keeping the cup only 80% full didn't keep the success probability the same, keeping the load factor the same for consistent hashing schemes of different scales doesn't give you the same overflow probability.4

Next, we will fix the number of bins, and instead alter the bin capacity.



Fixed number of bins (50)

CurveBin Capacity
Red
Green
Blue

Interpreting the graph is much the same. First, reset the graph to its default values. Notice that, with a bin capacity of 5 (red line), the overflow probability goes above 10% at about a load factor of 0.25. This corresponds to around 63 items. However, with a bin size of 25 (green line), the overflow probability does not go over 10% until a load factor of 0.55! This corresponds to around 688 items. So, by increasing each bin size by a factor of 5 (from 5 to 25), the total number of items I could store (with a fixed overflow probability) went up by a factor of 10!

We can also learn quite a bit from the slopes of the curves. As we predicted before, based on the derivative of the approximation, the blue curve (representing a bin size of 200) has a much steeper slope than the red curve (representing a bin size of 5). So, on the one hand, increasing bin capacity gives a very nice improvement to the maximum load factor that can be tolerated, but it also makes the system more sensitive to fluctuations in that load factor.

Handy calculator

ParameterValue
Number of items
Number of bins
Bin capacity
Result type exact
Overflow probability 0.2095

The calculator will use an exact method when the inputs are small enough, and fall back to using an estimate when the values become too large.

Takeaways & suggestions

When you are using a consistent hashing system, it's good to keep overflows in mind. Unless any node can hold all your data, there's a possibility that you'll have overflow.

Since there's (almost) always going to be some chance of overflow, you should probably do something about it. Google has put out an interesting paper showing that a simple strategy of "pushing" extra items from one bin to the next doesn't require too many "pushes." Vimeo put this idea into practice to great success.

Regardless of how you handle them, overflowing nodes are never a good thing. First, decide what your system needs: tolerance to changes in load factors, or maximal storage with low overflow probability. If you need the former, build many, smaller nodes. If you need the later, build larger nodes, but keep in mind that a small change in load factor may cause your overflow rate to skyrocket!

This problem gets more interesting when you consider what might happen in a heterogeneous cluster... it certainly makes the math a lot more complicated!


  1. I'll assume that both hash functions are perfectly uniform, which, of course, is never the case in practice. 

  2. When you click and drag an item, the figure will neatly re-arrange each segment such that each data item is evenly spaced. In practice, the hash values of each data item is unlikely to spread across the segment so nicely. 

  3. If you'd like, you can confirm the calculation using this simulation

  4. More skeptical readers may claim that the load factor in this example is really 100%, not 80% -- the argument holds equally well in both cases.