Ryan Marcus, assistant professor at the University of Pennsylvania. Using machine learning to build the next generation of data systems.
____ __ ___
/ __ \__ ______ _____ / |/ /___ _____________ _______
/ /_/ / / / / __ `/ __ \ / /|_/ / __ `/ ___/ ___/ / / / ___/
/ _, _/ /_/ / /_/ / / / / / / / / /_/ / / / /__/ /_/ (__ )
/_/ |_|\__, /\__,_/_/ /_/ /_/ /_/\__,_/_/ \___/\__,_/____/
/____/
___ __ ___
/ _ \__ _____ ____ / |/ /__ ___________ _____
/ , _/ // / _ `/ _ \ / /|_/ / _ `/ __/ __/ // (_-<
/_/|_|\_, /\_,_/_//_/ /_/ /_/\_,_/_/ \__/\_,_/___/
/___/
___ __ ___
/ _ \/ |/ /__ ___________ _____
/ , _/ /|_/ / _ `/ __/ __/ // (_-<
/_/|_/_/ /_/\_,_/_/ \__/\_,_/___/
Code
I think that all researchers, but especially systems researchers, should strive to open source their research artifacts. I believe this has several benefits, including, but not limited to:
- Open sourcing allows other researchers (and the world) to “look inside” a precise implementation of the ideas a paper, often improving understanding and making mistakes easier to spot and correct.
- Open sourcing lets other people reproduce your results. When someone questions whether or not an idea works as well as a paper claims, the ability to download the code and try it out is invaluable.
- Open sourcing enables faster and more fair comparisons. When systems are not open sourced, future researchers are forced to re-implement them in order to conduct experimental comparisons. Often, such re-implementations are undertuned or incorrect. Releasing a binary isn’t a suitable alternative, because researchers won’t be able to properly understand the important details that are often excluded from papers due to space constraints, but would be easy to answer from source code.
For these reasons, in 2020 I committed to open sourcing all of my own research as quickly as possible, to the greatest degree I could. I will try to keep the below list up to date with my contributions and the contributions of my co-authors. You can find my “recreational” code on my GitHub.
- The Kepler parameterized query optimizer, an evolutionary algorithm and robust predictor
- The Bao learned query optimizer, PostgreSQL prototype
- The cardinality estimation benchmark, including query generation
- Tree convolution implementation, as used in Neo and Bao.
- IMDb PostgreSQL Benchmarks & Vagrant box
- A benchmark of learned indexes
- The official RMI implementation, along with automatic optimizer (CDFShop).
- The RadixSpline single-pass learned index
- A reinforcment learning powered garbage collector for CPython
- The Park project, a benchmark for RL-powered systems.
- NashDB, an economics-driven approach to cloud database fragmentation, replication, and provisioning.
- WiSeDB, an ML-powered resource provisioning, query placement, and query scheduling technique.