Ryan Marcus, assistant professor at the University of Pennsylvania. Using machine learning to build the next generation of data systems.
    ____                       __  ___
   / __ \__  ______ _____     /  |/  /___ _____________  _______
  / /_/ / / / / __ `/ __ \   / /|_/ / __ `/ ___/ ___/ / / / ___/
 / _, _/ /_/ / /_/ / / / /  / /  / / /_/ / /  / /__/ /_/ (__  )
/_/ |_|\__, /\__,_/_/ /_/  /_/  /_/\__,_/_/   \___/\__,_/____/
      /____/
        
   ___                   __  ___
  / _ \__ _____ ____    /  |/  /__ ___________ _____
 / , _/ // / _ `/ _ \  / /|_/ / _ `/ __/ __/ // (_-<
/_/|_|\_, /\_,_/_//_/ /_/  /_/\_,_/_/  \__/\_,_/___/
     /___/
        
   ___  __  ___
  / _ \/  |/  /__ ___________ _____
 / , _/ /|_/ / _ `/ __/ __/ // (_-<
/_/|_/_/  /_/\_,_/_/  \__/\_,_/___/
        

Compiled and vectorized evaluation in columnar engines

How do modern columnar database management systems process a query like this?

SELECT SUM(t.c) FROM t WHERE t.a > 5 AND t.b < 7;

Columnar systems today are generally either use compiling query engines or vectorized query engines.

First, we’ll look at the strategies used by both engine types, and measure their performance over random data (both a > 5 and b < 7 will have 50% selectivity – more on that later). Then, we’ll analyze some of the surprises we find along the way. This blog post is a microcosm of the excellent 2018 VLDB paper, “Everything You Always Wanted to Know About Compiled and Vectorized Queries But Were Afraid to Ask.” by Kersten et al.

Compiling engine strategy

Most compiling engines, like Amazon Redshift or Tableau Hyper, follow the code generation technique described in Thomas Neumann’s 2011 paper, “Efficiently compiling efficient query plans for modern hardware.” The code generated for a query like ours would be pretty close to what you might write by hand. Ignoring all the scaffolding required to read data, handle multiple chunks, etc., a compiling execution engine might generate code like this:

pub fn compiled_engine(a: &[i32], b: &[i32], c: &[f32]) -> f32 {
    a.iter()
        .zip(b)
        .zip(c)
        .filter_map(|((a_el, b_el), c_el)| {
            (*a_el > 5 && *b_el < 7).then_some(c_el)
        }).sum()
}

If you check the generated assembly (make sure to use optimizations and set the target CPU to something modern!), you’ll see that Rust produces a compact, short-circuiting loop with SIMD instructions for the summation. At first glance, it also seems like this is just about the optimal code we could write to answer this query (aside from system-level changes like parallelism or index structures).

Doesn't Rust do bounds checking?

You might wonder if Rust’s iterators or bounds checking make this code slower than the “C style” version using a for-loop. In this case, the compiler is smart enough to optimize away the bounds check. You can check yourself if you’d like, using this unsafe code:

pub fn compiled_engine(a: &[i32], b: &[i32], c: &[f32]) -> f32 {
    unsafe {
        let mut accum = 0.0;
        for i in 0..a.len() {
            if *a.get_unchecked(i) > 5 && *b.get_unchecked(i) < 7 {
                accum += *c.get_unchecked(i);
            }
        }
        accum
    }
}

Vectorized engine strategy

Vectorized engines, like Actian Vector, MonetDB, and DuckDB, don’t generate or compile code, but instead use hand-optimized “kernel functions” that operate on large chunks of data at once (primarily following Peter Boncz et al.’s X100 paper). There are two ways a vectorized engine might handle our query: full evaluation and partial evaluation.

Full evaluation is the most straightforward, and may even seem naive at first. At a high level, the idea is to build one bitmap for a > 5, another bitmap for b < 7, then bitwise-AND the two bitmaps together. Finally, we can iterate over the set bits in the resulting bitmap and sum up the corresponding entries of c.

pub fn vec_engine_full(a: &[i32], b: &[i32], c: &[f32]) -> f32 {
    // kernel 1: compare column with constant, store result in bitmap
    let a_bm = BooleanBuffer::collect_bool(a.len(), |idx| a[idx] > 5);

    // kernel 2: compare column with constant, store result in bitmap
    let b_bm = BooleanBuffer::collect_bool(b.len(), |idx| b[idx] < 7);

    // kernel 3: bitwise AND of two bitmaps
    let res = &a_bm & &b_bm;

    // kernel 4: sum of a vector at positions given by bitmap
    BitIndexIterator::new(res.values(), res.offset(), res.len())
        .map(|idx| c[idx])
        .sum()
}

Here, we use Apache Arrow’s BooleanBuffer to construct and iterate over the bitmaps. At first glance, one might think this code is a significant regression from the compiled code. Not only are we doing “extra work” by testing entries of b that the compiled version gets to skip via short-circuiting, but we’re also allocating two ~125KB data structures. Well, vectorized database engine designers had the same thought, so they also built kernels for partial evaluation.

Partial evaluation avoids the wasted work by building vectors (sometimes called selections or selection vectors) of indexes that track which positions are not filtered out.1 Once the vector of indexes that pass the a > 5 filter is computed, we can check just those indexes for b < 7, and then compute the sum over c.

pub fn vec_engine_partial(a: &[i32], b: &[i32], c: &[f32]) -> f32 {
    // kernel 1: find indexes of a where the predicate is true
    let sel: Vec<usize> = a
        .iter()
        .enumerate()
        .filter_map(|(idx, a_el)| (*a_el > 5).then_some(idx))
        .collect();

    // kernel 2: narrow input indexes by predicate
    let sel: Vec<usize> = sel
        .into_iter()
        .filter_map(|idx| (b[idx] < 7).then_some(idx))
        .collect();

    // kernel 3: sum using indexes
    sel.into_iter().map(|idx| c[idx]).sum()
}

Unfortunately, I could not for the life of me get the compiler to properly vectorize this code2, so I had to write the SIMD manually. This is totally fine for a vectorized database, since every kernel is expected to be hand-tuned. You can find the SIMD implementation here. But the SIMD version simply does what the above code does, just more efficiently.

Bakeoff #1: 50% Selectivity

OK, how do our times compare? Let’s evaluate over a million randomly generated rows, generating the data so that a > 5 and b < 7 have 50% selectivity. This benchmark was ran on my laptop, equipped with a i7-1370P.

predicates/compiled
    time:   [4.2265 ms 4.2431 ms 4.2578 ms]
    thrpt:  [234.86 Melem/s 235.68 Melem/s 236.60 Melem/s]

predicates/vec_full
    time:   [852.99 µs 858.36 µs 863.24 µs]
    thrpt:  [1.1584 Gelem/s 1.1650 Gelem/s 1.1723 Gelem/s]

predicates/vec_partial
    time:   [1.4747 ms 1.4793 ms 1.4830 ms]
    thrpt:  [674.32 Melem/s 675.98 Melem/s 678.09 Melem/s]

Bar graph of throughput values

Wow! It might be quite surprising that the vectorized “full” strategy is not just the fastest, but is the fastest but a wide margin. What’s going on here? Since some of the fastest databases in the world use a compiled engine, we should be quite surprised by this seemingly-poor showing.

Let’s try and figure out why our compiled engine is doing so poorly compared with the vectorized strategies. After all, the compiled code doesn’t create any extra data structures, and the compiled code avoids all wasted work via short-circuiting. We can use perf and top-down analysis to figure out what is going on.

Performance counter stats for 'cargo bench compiled':
 12K  page-faults:u           #  830.127 /sec
 74G  cycles/u                #    5.111 GHz
 15G  instructions/u          #    0.46  insn per cycle ❶
  6G  branches/u              #  462.162 M/sec
  2G  branch-misses/u         #   26.75% of all branches ❹
  TopdownL1 (cpu_core)        #    9.9 %  tma_backend_bound
                              #   37.3 %  tma_bad_speculation ❷
                              #   47.7 %  tma_frontend_bound ❸
                              #    5.0 %  tma_retiring

The first thing we notice is that ❶ instructions per cycle is quite low. This means that our CPU is not doing useful work most of the time; instead, the CPU might be recovering from branch mispredictions, or might be stuck waiting for memory accesses (either data or instruction). Second, when looking at the topdown analysis, we see very high ❸ frontend_bound and ❷ bad_speculation. Since frontend_bound includes instruction misses from branch resteers, and our bad_speculation is also pretty high, we can conclude that the primary bottleneck here is likely branch misprediction – ❹ about 26% of branches are missing mispredicted. Instruction count could also be a problem, but the presence of so much bad speculation is a “red flag” that should catch our eye first.

Why is high bad speculation a red flag?

When bad speculation occurs, it could be counted as either a branch resteer (loading a different set of instructions), or a branch misprediction (undoing the effects of an incorrect speculation). Thus, when bad speculation is high, we don’t really know if the true problem is in the front end or the back end. As Ahmad Yasin puts it in “A Top-Down Method for Performance Analysis and Counters Architecture”:

Having Bad Speculation category at the Top-Level is a key principle in our Top-Down Analysis. It determines the fraction of the workload under analysis that is affected by incorrect execution paths, which in turn dictates the accuracy of observations listed in other categories. Furthermore, this permits nodes at lower levels to make use of some of the many traditional counters, given that most counters in out-of-order CPUs count speculatively. Hence, a high value in Bad Speculation would be interpreted by the user as a “red flag” that need to be investigated first, before looking at other categories. In other words, assuring Bad Speculation is minor not only improves utilization of the available resources, but also increases confidence in metrics reported throughout the hierarchy.

Let’s compare this with a top-down analysis of the vectorized full code (using bitmaps):

Performance counter stats for 'cargo bench bitmap':
 12,260  page-faults:u   #  542.632 /sec
 101G    cycles/u        #    4.485 GHz
 381G    instructions/u  #    8.35  insn per cycle ❶
 63G     branches/u      #    2.829 G/sec
 443M    branch-misses/u #    2.41% of all branches
 TopdownL1 (cpu_core)    #   22.3 %  tma_backend_bound ❸
                         #    6.0 %  tma_bad_speculation
                         #   12.4 %  tma_frontend_bound
                         #   59.2 %  tma_retiring ❷

We can immediately see that our bitmap code is making better use of the CPU. First, instead of executing ½ an instruction per cycle, ❶ we’re now executing over 8 instructions per cycle! Our topdown analysis shows that most of our time is ❷ going to “retiring” instructions, meaning that the instructions are leaving the CPU pipeline normally! Our next biggest bottleneck is ❸ “backend,” means that instructions are waiting on resources (like memory reads) in order to proceed.

It’s important to note that higher instructions per cycle (IPC) is not always better. Higher IPC more instructions are making it through the CPU, but two different techniques may require a different number of instructions! In our case, the bitmap approach requires some extra “work” to be done (bitmaps must be allocated, constructed, compared, and destroyed), whereas the compiled approach does very little extra work. The reason that the bitmap approach performs better is not because it has an higher IPC, but because it does work in a way that is better aligned with what the CPU is expecting.

Bakeoff #2: 0.04% Selectivity

If we want to make the compiled version of the code look as good as possible, we note two things:

  1. The compiled version only gets to avoid work if the a > 5 predicate is false – that is, if the conditional short-circuits.
  2. The compiled version had tons of branch misprediction, so making the branch behavior more consistent should improve performance.

We can maximize the number of short-circuits and make the branch behavior more consistent by lowering the selectivity of a > 5. Keeping everything else the same, we’ll lower the selectivity of a > 5 to 0.04:

predicates/compiled
    time:   [152.77 µs 153.70 µs 154.50 µs]
    thrpt:  [6.4723 Gelem/s 6.5063 Gelem/s 6.5456 Gelem/s]

predicates/vec_full
    time:   [401.64 µs 403.45 µs 405.38 µs]
    thrpt:  [2.4668 Gelem/s 2.4786 Gelem/s 2.4898 Gelem/s]

predicates/vec_partial
    time:   [195.99 µs 197.01 µs 197.92 µs]
    thrpt:  [5.0524 Gelem/s 5.0759 Gelem/s 5.1023 Gelem/s]

Bar graph of throughput values

And, under some circumstances, this code is indeed essentially optimal!

To understand why this code might sometimes perform well or poorly, it is important to make a few observations about this query and the compiled_engine approach:

  1. Memory bound. We’re doing very little with each value that we read from memory. Essentially, we are performing two inequalities and one addition (both cheap operations) per three memory reads. We can thus hypothesize that our code will be bound by the speed of memory, either by the “front end” (loading new instructions) or the “back end” (reading data).
  2. Branching. The short-circuiting evaluation of *a_el > 5 && *b_el < 7 means that we may have a branch. I say “may” because the optimizer could choose to rewrite this code in a branchless fashion, but branching is something we should think about. This is also our first hint that the performance of this code might have something to do with how often a > 5 is true or false.
  3. Inconsistent memory access. We always read the entries of a in order, but we might access b sparsely (since we only access b if a > 5), and c even more sparsely (since we only access c is both conditions are true).

We can confirm or deny our hypotheses using perf stat. If we run with a highly selective a > 5 (that is, most rows are filtered out):

predicates/compiled time:   [144.20 µs 144.76 µs 145.35 µs]
Performance counter stats for 'cargo bench compiled':
 19,254.19 msec task-clock:u       #  1.306 CPUs utilized
      81K          page-faults:u   #  4.239 K/sec
      80G          cycles/u        #  4.158 GHz              (93.14%)
      292G         instructions/u  #  7.34  insn per cycle   (93.14%)
      126G         branches/u      #  6.591 G/sec            (93.14%)
      204,605,913  branch-misses/u #  2.13% of all branches  (93.14%)
      TopdownL1                    #    3.9 %  tma_backend_bound
                                   #    5.5 %  tma_bad_speculation
                                   #   14.6 %  tma_frontend_bound
                                   #   76.0 %  tma_retiring

Compared to if we use a non-selective a > 5 (zero rows are filtered):

predicates/compiled time:   [3.6357 ms 3.6439 ms 3.6532 ms]
Performance counter stats for 'cargo bench compiled':
 16,432.68 msec task-clock:u       #   0.999 CPUs utilized
  12,161      page-faults:u        # 740.050 /sec
     79G      cycles/u             #   4.851 GHz              (99.85%)
     30G      instructions/u       #   0.89  insn per cycle   (99.85%)
     13G      branches/u           # 815.276 M/sec            (99.85%)
      2G      branch-misses/u      #  24.00% of all branches  (99.85%)
         TopdownL1 (cpu_core)      #     9.6 %  tma_backend_bound
                                   #    49.9 %  tma_bad_speculation
                                   #    31.2 %  tma_frontend_bound
                                   #     9.3 %  tma_retiring

When the predicate over a is selective, we get 7.34 instructions per cycle, but when the predicate is not selective, we drop to 0.89 instructions per clock cycle! This reflects in the total time taken to process 1M rows going from 144 microseconds to 3.6 milliseconds! Luckily, perf also points us to the culprit: the topdown analysis shows that, in the slow case, we’re spending 50% of our time recovering from bad speculations (i.e., branch mispredictions). In the fast case, only 5% of total time is spent on mispredictions.

Bakeoff #3: End-to-end optimizations

Compiling execution engines really shine when the compiler (e.g., LLVM) can do something smarter than just lowering our code to machine instructions. In the above examples, LLVM was able to use SIMD instructions for some of our code, unroll loops, etc., but it was never able to fundamentally change the shape of the compilation for this particular query. But what if our query was instead:

SELECT COUNT(*) FROM t WHERE t.a > 5 AND t.b < 7;

… now we only need to count how often both predicates are true. Our vectorized code might look like this:

pub fn vec_count(a: &[i32], b: &[i32], _c: &[f32]) -> usize {
    let a_bm = BooleanBuffer::collect_bool(a.len(), |idx| a[idx] > 5);
    let b_bm = BooleanBuffer::collect_bool(b.len(), |idx| b[idx] < 7);
    let res = &a_bm & &b_bm;
    res.count_set_bits()
}

… and the compiler-generated code might look like this:

pub fn compiled_count(a: &[i32], b: &[i32], _c: &[f32]) -> usize {
    a.iter()
        .zip(b)
        .filter(|(a_el, b_el)| **a_el > 5 && **b_el < 7)
        .count()
}

In this case, LLVM can lower this code into beautiful SIMD code that loads 16 elements from a, 16 elements from b, compares them, and then counts the number of set bits in the resulting mask and adds that count to an accumulator.

This pays off – at any selectivity, the compiled code operates at around rows/sec, compared to the vectorized code which operates at around rows/sec. But such optimizations are not always possible.

The best of both worlds?

Can we get the best of both vectorized and compiled execution engines? Surely, a compiling engine could generate the same code that a vectorized engine uses, so why not? A paper by Menon et al. called “Relaxed Operator Fusion” explores this idea. The short answer is that you can, but it’s not as simple as either strategy alone.

Database, and especially their execution engines, are filled with surprising performance “gotchas” and highly-optimized tricks that have been passed down from system to system. If you’d like, you can read some of my work on engineering a hash table for GROUP BY aggregation, or check out the excellent everything you’ve always wanted to know about vectorized and compiled execution engines, but were afraid to ask.

Notes

  1. You can technically use a partial strategy with bitmaps instead of selection vectors, and you can use selection vectors instead of bitmaps in a full strategy. But intersecting bitmaps is generally cheaper than intersecting vectors, and iterating over (sparse) vectors is generally cheaper than iterating over bitmaps. Of course, there are exceptions. 

  2. Vectorizing the selection vector code gets significantly easier on AVX-512 hardware due to the “compress store” primitives, but unfortunately, I’m trapped in AVX2 land. Vectorizing this code with AVX2 requires precomputing a permutation lookup table, which is probably why the compiler doesn’t do it automatically.