Fallthrough Sort: Quickly Sorting Small Sets

In many applications, such as a median filter, we want to sort a small (n < 30) set of numbers. In the case of the median filter, we are only concerned with sorting sets of one exact size -- if this is the case, one can generate an optimal sorting network using a tool like this one to create a provably-unbeatable solution.

However, often we want to be able to sort sets of varying size that are still small. Perhaps if one wished to implement a 3x3, 5x5, and 7x7 median filter using a single sorting function. Or perhaps when sorting an arbitrary list of files, when there could be very many or very few items.

In this case, we can utilize a special implementation of bubble sort that takes advantage of switch statement fallthrough. To quickly sort sets of size n \leq 9, we could use this C code:

#include <stdlib.h>
#define min(a, b) (a < b ? a : b)
#define max(a, b) (a > b ? a : b)
#define exch(a, b) temp = a; a = min(temp, b); b = max(temp, b);
#define exch3(a, b, c) exch(a, b); exch(b, c);
#define exch4(a,b,c,d) exch3(a,b,c); exch(c,d);
#define exch5(a,b,c,d,e) exch4(a,b,c,d); exch(d,e);
#define exch6(a,b,c,d,e,f) exch5(a,b,c,d,e); exch(e,f);
#define exch7(a,b,c,d,e,f,g) exch6(a,b,c,d,e,f); exch(f,g);
#define exch8(a,b,c,d,e,f,g,h) exch7(a,b,c,d,e,f,g); exch(g,h);
#define exch9(a,b,c,d,e,f,g,h,i) exch8(a,b,c,d,e,f,g,h); exch(h,i);

int cmpfunc (const void * a, const void * b) {
	return ( *(int*)a - *(int*)b );
}
// quickly sort an array if size is less than or equal to 9, otherwise use
// stdlib's qsort to sort the array.
void fallthroughSort(int* array, int length) {
	int temp;
	switch (length) {
	case 9:
		exch9( array[0],array[1],array[2],array[3],array[4],array[5],array[6],array[7],array[8] );
	case 8:
		exch8( array[0],array[1],array[2],array[3],array[4],array[5],array[6],array[7] );
	case 7:
		exch7( array[0],array[1],array[2],array[3],array[4],array[5],array[6] );
	case 6:
		exch6( array[0],array[1],array[2],array[3],array[4],array[5] );
	case 5:
		exch5( array[0],array[1],array[2],array[3],array[4] );
	case 4:
		exch4( array[0],array[1],array[2],array[3] );
	case 3:
		exch3( array[0],array[1],array[2] );
	case 2:
		exch(array[0], array[1]);
		break;
	default:
		qsort(array, length, sizeof(int), cmpfunc);
	}
}

Each call to \texttt{exch}N represents a bubble sort pass from index 0 to index N. Any array that is of size n \leq 9 will jump to the proper pass of bubble sort, and execute all the required passes by falling through the switch statement.

You might be skeptical as to how a O(n^2) algorithm is outperforming a O(n \log n) algorithm, but remember that big-O notation only defines asymptotic behavior. It is often the case that the actual performance of an algorithm depends on the constants hidden by big-O notation, as has been exhaustively discussed.

Of course, code similar to that above can be generated for any size, and a Python script to do just that is available in this BitBucket repository. We'll take a look at the performance of this algorithm.

The following graph shows the performance of fallthrough sort (for n \leq 9) compared to the standard library's qsort function. Both functions sorted 10^7 randomly generated lists of size n = 9.




As you can see, fallthrough sort provides a substantial speed boost. Obviously, the difference is negligible if you're only sorting one list.

This next graph shows the performance of fallthrough sort (for n \leq 25) compared to the standard library's qsort function when sorting 10^7 lists of varying sizes.



As the number of elements increases, fallthrough sort seems to slow down and approach the speed of qsort (eventually, qsort will become faster). This is expected, given the asymptotic behavior of each algorithm.

This last graph shows how many times faster fallthrough sort is compared to qsort.



Here, the asymptotic behavior of both algorithms is extremely clear: qsort scales much better than fallthrough sort.

One might consider this comparison unfair because qsort evaluates a user-supplied comparison function. However, looking at the output of GCC reveals that when a very standard function (like the one above) is used, the comparison function is inlined.

One might also consider creating a branch table to jump right to the required pass of bubble sort. Once again, looking at optimized GCC output will show that the above switch statement is optimized into a branch table.

You can view a sample implementation and benchmark code over at BitBucket.

Google+ comments.

Reddit comments.

Reddit user sltkr seems to be getting better results from a simple insertion sort. This appears to be due to CPU differences, but the discussion is ongoing.

TopNTree — A merge-sort inspired data structure

Recently, I came across a problem in which I had to store contacts in an instant message application. I needed to be able to display the list of contacts to the user in alphabetical order, and I needed to be able to retrieve a contact object quickly. One data structure that accomplishes all this is a simple binary tree -- an in-order traversal of the tree will produce the list of contacts in alphabetical order, and a balanced binary tree provides O(\log n) search time.

However, I also needed to be able to query the k most frequently contacted contacts. Since contacts are sorted by their names, finding the k most frequently contacted contacts would require O(n \log n) search time.

To solve this problem, I created an augmented binary tree which provides an insertion time of O(k \log n), a search time of O( \log n), and can find the top k contacts in O(1). You can view a Java implementation of the code here. The code is license free, and ought to be considered public domain.

In order to accomplish this, each node is augmented with an array of size k that stores, in sorted order, the k largest keys in it's left and right sub-trees, including itself. Leaf nodes (nodes with no children) thus contain an array with only their own value.

When a node \alpha is inserted and becomes the left or right child of \beta, \beta's array is merged with \alpha's array, with the result stored into \beta's array. This merging can be done in linear time, since both arrays are already sorted. After this, \beta's parent is updated, and then the parent's parent, all the way up to the root of the tree.

Pictures are great. Consider a tree with one node, for a contact Dave who has been contacted 10 times.

One node

Then, we add another node for Betsy, who has been contacted 4 times.

Two nodes

The array stored in the node representing Betsy gets merged with the parent node, Dave. Inserting another node:

Three nodes

Once again, new nodes start out with only themselves in their array, and arrays are merged upwards from the node inserted.

Four nodes

Another node:
Five nodes

From here, fetching the k most contacted contacts can be done in constant time, because it simply requires the retrieval of the array from the root node.

Note that in order to maintain O(k \log n) time, the tree must be kept balanced. In my implementation, I use an AVL tree. For others seeking to use an AVL tree, make sure that, after any rotation, you properly update each node rotated. The simplest solution is to clear the arrays of merged nodes, and recalculate them by merging their immediate children.

In case the similarity to merge sort is not clear, here is some pseudo-code for the merging:

function merge(ParentArray a, ChildArray b) {
   int myPos = a.size - 1;
   int theirPos = b.size - 1;
                
   Entry[] newObjs = new Entry[];
   for (int i = size - 1; i >= 0; i--) {
        if (a[myPos] > b[theirPos]) {
             newObjs[i] = a[myPos];
             myPos--;
        } else if (a[myPos] == b[theirPos]) {
             newObjs[i] = a[myPos];
             myPos--;
             theirPos--;
        } else {
             newObjs[i] = b[theirPos];
             theirPos--;
        }
   }
}

This process, ran on arrays of size 5, is depicted below. Initially, the new array is empty.
m1

The largest value from both the parent's old array and the child's array are selected on each iteration, ensuring that the new array is properly ordered.
m2

Note that the pointer variable (represented by the diamond) moves every time for the top array, but only moves for one of the bottom arrays or the other, except in the case where both bottom arrays have an equal value.

m3

In each iteration, the value to be placed in the top array at the diamond is determined in constant time by simply picking the larger of the two values pointed to in the lower arrays. This means that the merge operation runs in O(k) time.

m4

Unlike in a traditional merge sort, the merge process can be halted when the top array is full, because we only care about the largest k entries.

m5

While the TopNTree provides, in my opinion, a clean solution to this particular problem, there are other possible solutions. The first thing I thought of was to simply maintain a tree of contacts sorted by name and a heap of the contacts sorted by how frequently they are contacted. In the case where the most frequently contacted contacts only needs to be calculated once, this might save some time, but if a contact's frequency must be updated (a common event in a communication application), one would have to iterate over the entire heap. The following table shows the differences in upper bounds for different techniques. Note that in all cases k \leq n and in most cases k < n and in my case k \ll n.

Method Insert Search Getting most contacted Update
TopNTree O(k \log n) O(\log n) O(1) O(k \log n)
Tree and heap O(\log n) O(\log n) O(k) O(n)
Tree O(\log n) O(\log n) O(n \log n) O(n)
Heap O(\log n) O(n) O(k) O(\log n)

For me, the search time of the heap and the "getting most contacted" time of the tree made those solutions impractical, so I compared the TopNTree to a tree and a heap. Since the size of k was both constant and very low (5, for my listing of most popular contacts), I decided to use the TopNTree.

Because big-O analysis, in general, is not to be trusted, I created a synthetic benchmark that emphasized the operations I was concerned with. The results are graphed below.



Obviously, no synthetic beats real-world testing. I hope the TopNTree performs well!

Reddit user JustSomeBadAdvice points out that using a heap and a tree where the nodes of each are the same (or contain pointers to each other) would produce similar, and possibly faster times, with the cost of more space.

Objective C v C Speed / Benchmarks

I've been becoming increasingly interested in Objective C recently (as an alternative to C++). It is a much cleaner object-oriented extension to C that dodges a lot of of the pitfalls present in C++.

One of the "complaints" about Objective C is that message passing (Objective C's method of communicating with objects) is slower than C++ and C. Because of the nature of Objective C, this is likely true. I set out to discover just how much Objective C's dynamic messages would cost me.

A blog post by Jonathan Hohle showed some interesting numbers for a recursive Fibonacci calculator. Hohle's post showed that Objective C's message passing costs pushed the runtime of the Objective C version of the code up to nearly triple that of the C version. Using a few hacks (storing a function pointer to the result of a message) reduced the difference to double.

Hohle's results left me skeptical and wanting a bit more, so I created my own little benchmark with an Objective C and a C implementation of a linked list. I tried to keep the codes as close to each other as possible, only introducing differences when an Objective C construct should be used (objects, most notably).

The code for the linked list implementations is exactly what you'd expect it to be, but it is available for download (as well as scripts to run the benchmark yourself) at the bottom of this post.

Before anyone jumps to any long-shot conclusions, obviously these benchmarks aren't perfect: not even close. I don't have access to the wide range of machines that would be needed to sufficiently gauge what is going on. I'm also aware that only testing a linked list implementation is not exhaustive. I was interested in the speed of various data structures like a linked list, so that's what I tested. Take everything below with a handful of salt.

The actual benchmark code (that utilized the linked list) looks like this:

C:

#include "linked_list_c.h"
#include <stdlib.h>
#include <stdio.h>


int main(int argc, char** argv) {
	int SIZE = atoi(argv[1]);

	// make a new linked list
	struct linked_list* ll = ll_create();
	
	int* d = calloc(SIZE, sizeof(int));
	int i;
	for (i=0; i < SIZE; i++) {
		d[i] = i;
		ll_add(&(d[i]), ll);
	}
	
	printf("Size of list: %d\n", ll_length(ll));
	int sum = 0;
	while (ll_length(ll) != 0) {
		sum +=  *(int*)ll_pop(ll);
		struct linked_list* lli = ll_create();
		for (i=0; i < sum; i++) {
			ll_add(&(d[1]), lli);
		}

		while (ll_length(lli) != 0) {
			sum -= *(int*)ll_pop(lli);
		}
		free(lli);
	}
	printf("Sum: %d\n", sum);
	
	
	free(d);
	free(ll);
	return 0;
}

And its Objective C counterpart:

#include "linked_list.h"
#include <stdlib.h>
#include <stdio.h>



int main(int argc, char** argv) {
	int SIZE = atoi(argv[1]);
	
	// make a new linked list
	LinkedList* ll = [[LinkedList alloc] init];
	
	int* d = calloc(SIZE, sizeof(int));
	int i;
	for (i=0; i < SIZE; i++) {
		d[i] = i;
		[ll add: &(d[i])];
	}
	
	printf("Size of list: %d\n", [ll length]);
	int sum = 0;
	while ([ll length] != 0) {
		sum +=  *(int*)[ll pop];
		LinkedList* lli = [[LinkedList alloc] init];
		for (i=0; i < sum; i++) {
			[lli add: &(d[1])];
		}

		while ([lli length] != 0) {
			sum -= *(int*)[lli pop];
		} 

		[lli release];
	}
	printf("Sum: %d\n", sum);
	
	
	free(d);
	[ll release];
	return 0;
}

If you're anything like me, you prefer the look and feel of the Objective C code, but only if the performance costs aren't too high.

I tested my code on two different systems: an i7 920 Linux box (Debian stable Linux 2.6.32-5-amd64) and a MacBook Air 13" (OS X 10.7.3, Darwin 11.3.0). Since the majority of my development targets Linux, I wanted to test both GCC and Clang on the Linux box. Apple's default compiler was used on the MacBook Air.

There is quite a bit of test data, and I've made it available in a publicly-accessible Google spreadsheet here: https://docs.google.com/spreadsheet/ccc?key=0AmQWGSF3o-iidEtCR2Yxa1lPR2FhUDZRc0wyRXVGckE

The first system/compiler tested is Clang, an LLVM compiler. The compiler options for both the GCC and Clang tests were provided by the gnustep-config utility.

Clearly, the Objective C version is slower. It might be more helpful to look at a graph of the percent differences between the two implementations:

Despite some strangeness with low data sizes, Clang's compilation of my Objective-C implementation runs around 30% slower than Clang's compilation of my C implementation. The strange behavior at lower ranges is likely due to the initial overhead of the first message passed to an object.

Clang's Objective C is about 30% slower than Clang's C

Similar tests for GCC:

Once again, Objective C is obviously slower, but the gap looks quite a bit smaller. The percent-difference graph is more helpful.

GCC's Objective C binary seems to approach 25% slower than C as opposed to Clang's 30%, possibly due to the maturity of GCC.

GCC's Objective C is 25% slower than C

Since GCC's C binaries tend to execute faster than Clang's, it is no surprise that GCC's Objective C binaries also execute faster than Clang's Objective C binaries. The difference, however, is not that large.

Generally, GCC's Objective C binary seems to be 4-5% faster than Clang's Objective C binary, although the shape of the graph seems to indicate that the difference may shrink even more as the problem size increases.

GCC's Objective C seems to be 4-5% faster than Clang's Objective C

One would imagine that Apple's compiler would be especially good at compiling Objective C, given Apple's large commitment to the language. Benchmarks of the same code ran on a MacBook Air are below.

It appears as though Apple's Objective C compiler produces binaries that are around 18% slower than Apple's C binaries. The shape of the graph suggests that the percentage may decrease further with an increased problem size.

Apple

In conclusion, Objective C sucks a bit of the performance out of your application if you depend on objects/messages for high-use data structures. While these numbers certainly have their issues (there's no real comparison of the Apple compiler to any other compiler), they do suggest that performance-critical chunks of code ought to remain in C (or, *shutter*, Fortran).

In the HPC world,Greenspun's tenth rule seems to hold true: "any sufficiently complicated C or Fortran program contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp." Meaning, of course, that certain higher-level constructs (like objects) inevitably become needed. So while Objective C shouldn't be replacing our high-performance data structures yet, it could be used to implement some of the more complex parts of our applications.

Perhaps in a scenario when I would take advantage of distributed objects, I would consider Objective C.

There's no doubt that Objective C is a well thought-out and elegant language, and it is possible that there are various constructs that, when implemented in Objective C, would actually be faster than their written-by-me C counterpart. Until I run into one, I'll just continue hammering away in C.

While it is out of the scope of this post (and many others have done a much better job), GCC did seem to produce faster C binaries than Clang.

You can grab the code here. Reddit comments.

Android Debate v0.6 Released

I (finally) put out a new version of Android Debate today. Functionality wise, it isn’t too much different from the previous version, but it is a lot cleaner, and you can search through judges and view their paradigms!

Changelog:

- Choose timer sounds
- Progress bar!
- Fully configurable times
- Judge paradigms (currently they just open as links, but that will be changed in a future version)
- New look and feel!

I imagine I’ll have a lot more time this summer to work on the app, and I plan on adding a lot more features, including many that integrate with DebateResults.

Check it out on your phones or on Google Play here: https://play.google.com/store/apps/details?id=info.rmarcus.debate.

Lyapunov Fractals in Python: Performance and Pretty Pictures

I spent some time earlier this week playing around with Lyapunov fractals in Python. While normally I wouldn’t consider Python to be a performance-packing language, the ease-of-use (programmer cycles vs. machine cycles) inherent in Python has always been attractive. However, I was able to speed up my Python code substantially using Psyco, the Python multiprocessing module, and a fast implementation of Python called PyPy.

Lyapunov fractals are generated from a specific binary sequence, generally denoted with A’s and B’s. For example, the sequence “ABA” produces a different fractal than the sequence “BBAA.” An algorithm for generating these fractals is described by the Wolfram folk here.

Lyapunov fractals are formed by associating an exponent (which, from what I can determine, is generally between -2.0 and 2.0) with a color. Each point on a 4.0×4.0 plane gets an exponent. A visual representation of the fractal is then formed by associating each exponent with a color.

Because of the computationally intense aspects of calculating an exponent, I decided to create one program to calculate all the exponent data, saving it to a file*, and another program to create an image from it. My source is available at the end of the post.

After I’d written my code, I felt the wraith of Python in the form of slowness. After some trivial optimizations, I found Pysco and the Python multiprocessing module.

Psyco is pretty straightforward: it does a bunch of JIT-style compilation (read: magic) on vanilla python to speed it up. It is pretty easy to use:

import psyco
psyco.full()

Or, if you are slightly more error-cautious:


try:
   import psyco
   psyco.full()
   print "Psyco'd!"
except ImportError:
   print "No psyco"
   pass

The Python multiprocessing library is a real gem. It provides a parallel version of the “map” function. This is pretty awesome — since each exponent calculation depends only upon a point (embarrassingly parallel), it is easy to create a function that takes in a point and spits out an exponent.

Of course, all the concerns of parallel programming still apply. You can’t have any race conditions, and your code needs to be conceptually parallel. Python takes care of all the messy bits (semaphores, join/fork management, etc.) for you.

If you were using the map() function like this:

a = map (function, list)

… you can simply replace that line with:


try:
   from multiprocessing import Pool
   pool = Pool(8)
   print "Parallel!"
   a = pool.map(function, list)
except ImportError:
   print "Not parallel"
   a = map(function, list)

This will preform the map function in parallel (if the multiprocessing module is available) using 8 workers (which is pretty optimal for an i7 920).

I also discovered an alternative flavor of Python called PyPy, which is a performance-driven Python that uses technology similar to Psyco. Sadly, PyPy doesn’t have an implementation of the multiprocessing module, so I was forced to choose between PyPy and traditional Python with multiprocessing and Psyco. I was incredibly doubtful that PyPy’s speed would be able to make up for parallelism and Psyco. I’ll let the numbers talk for me.

The test system contains an un-overclocked i7 920 and 3GB of DDR3 RAM. These numbers are from generating a 1000×1000 fractal worth of data with the sequence “AB” to a depth of 500 (iterations). I run Ubuntu 10.10 32-bit.

Time in seconds. Smaller is better.

Trial Python with Psyco and MP Python with Pysco Python with MP Python PyPy
Trial 1 36.4 273 253 2041 71
Trial 2 36.3 268 258 2042 66
Trial 3 35.2 269 255 2058 71
Average 35.966667 270 255.333333 2047 69.33333333

Clearly, Python on its own is pretty darn slow. Psyco provides nearly a 10x speedup, and if you can write your code to take advantage of the multiprocessing library, you can go really fast.

The PyPy results, compared to the Pysco results, are quite impressive. If PyPy had an implementation of the multiprocessing library (which I’m fairly certain it doesn’t — feel free to correct me), it would probably dominate.

The multiprocessing results for this test aren’t completely fair — generating fractals is an embarrassingly parallel problem. But, of course, if you can break your problem up into parallel pieces, you can expect a relatively big speed bump.

Conclusion: Python doesn’t have to be as slow as Python normally is. Drop-in replacements like Psyco can greatly accelerate code. if your code is vanilla enough to not require many external libraries (like PIL) then PyPy can probably give you quite the boost as well.

You can download my source here.

Here’s some pretty pictures:

* A note about pickle: I realize that many Python developers would’ve decided to use pickle (or the faster cPickle) to save their exponent data to a file. I found cPickle to be far, far slower than just writing out comma separated values.