Introduction

This is the documentatation for Bao for PostgreSQL.

GitHub: https://learned.systems/bao

Bao is designed for PostgreSQL 12. We've tested that the Bao code compiles for version 12.17, and suggest starting with that version.¹ You'll also need Python 3.8 or later.

Bao is a learned query optimizer for PostgreSQL. Bao works by providing automatic coarse-grained query hints (e.g., SET enable_nestloop TO off) on a per-query basis. Bao uses reinforcement learning, so Bao learns from it mistakes.

Bao has two components: the Bao server, which is a standalone Python application, and the PostgreSQL extension, which integrates directly with PostgreSQL and communicates with the Bao server. The best way to try out Bao is to follow the tutorial.

This implementation has a number of features:

In the default configuration, Bao works as a learned query optimizer, providing coarse-grained hints to the PostgreSQL query planner and incorporating feedback from query execution to improve its recommendations.
Bao provides a continually-updated query performance prediction model that is custom-tailored to your DB and workload. Even if you do not use Bao for query optimization, you can still use it to predict the runtime of your queries. Runtime predictions are made available via EXPLAIN.
Bao can be used as an advisor, simply providing the coarse-grained hints that Bao would use if Bao were running as a full optimizer. This allows you to manually apply Bao's recommendations to only a few queries.
Since Bao uses reinforcement learning, Bao must balance exploration and exploitation, and will occasionally try out a query plan that may be slower than the one chosen by PostgreSQL; you have to make mistakes in order to learn! However, when regressions on certain queries are unacceptable, these special queries can be pre-explored using Bao's exploratory mode. Queries added to Bao's exploratory mode are tested at user-defined times, and future Bao models are checked to properly handle these queries. Bao will never pick a regressed query plan for a query processed in exploratory mode.
Separate server process that can run on a different machine from your database. You can offload model training, potentially to a machine with a GPU. You can also co-locate multiple Bao servers together, if you have multiple DBs, so they share training resources.

Bao is provided under a GPLv3 license. Specifically, please note:

THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION.

Thanks to Yao Tian for determining applicable PG versions!

Bao Tutorial

In this tutorial, we'll set up Bao with PostgreSQL, load the IMDB dataset, and then execute a small sample workload with and without Bao to see the difference. Then, we'll look at how Bao's exploratory mode can be used to prevent query regressions. Finally, we will look at how Bao can be used as a query advisor.

PostgreSQL Setup

In this tutorial, we'll use the IMDB dataset from the "How Good are Query Optimizers, Really?"¹ paper. The fastest way to get started is to use the pre-built Docker image which contains a PostgreSQL database and the compiled Bao extension. Yu can launch a container like this:

$ docker run -p 127.0.0.1:5432:5432/tcp --add-host host.docker.internal:host-gateway --shm-size=8g ryanmarcus/imdb_bao:v1

If you don't want to use the Docker container, follow the rest of the instructions on this page. Otherwise, continue to the next step.

To load the IMDB data into Postgres, there are two options. You can setup a PostgreSQL database with this data via a virtual machine, or you can download a a PostgreSQL dump from the Harvard dataverse and manually load the data yourself.

Don't forget to configure PostgreSQL with sane defaults. The VM setup will do this automatically, but at a minimum you should set the shared_buffers variable to something larger than the default (around 25% to 40% of your total RAM is recommended).

Assuming you've used the virtual machine, we can test that the DB was setup correctly:

# add -h localhost if you are connecting to the VM from your host machine
$ psql -U imdb 
psql (12.3)
Type "help" for help.

imdb=# select count(*) from title;
  count  
---------
 2528312
(1 row)

Install the Bao extension

With our PostgreSQL database setup, it is time to install Bao. If you are using a virtual machine, the following steps should be performed from within the VM (i.e., type vagrant ssh to login).

$ git clone https://github.com/learnedsystems/BaoForPostgreSQL
$ cd BaoForPostgreSQL
$ ls
bao_server  branding  COPYING  LICENSE  pg_extension  
README.md  run_queries.py  sample_queries

The directory pg_extension contains the code for the PostgreSQL extension. We'll install that next, using the PGXS system. Make sure your machine has the these installed.

For Ubuntu, you need these packages:

sudo apt-get install postgresql-server-dev-all postgresql-common

For Arch Linux:

sudo pacman -S postgresql-libs

With the correct packages installed, we can proceed to install the Bao extension. First, turn off PostgreSQL:

systemctl stop postgresql

Next, build and install the Bao PostgreSQL extension.

# depending on your setup, this may require sudo
cd pg_extension
make USE_PGXS=1 install

If everything goes correctly, you should see output like below:

$ make USE_PGXS=1 install
/usr/bin/clang -Wno-ignored-attributes -fno-strict-aliasing -fwrapv -O2  -I. -I./ -I/usr/include/postgresql/server -I/usr/include/postgresql/internal  -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -I/usr/include/libxml2  -flto=thin -emit-llvm -c -o main.bc main.c
gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation -march=x86-64 -mtune=generic -O2 -pipe -fno-plt -fPIC -I. -I./ -I/usr/include/postgresql/server -I/usr/include/postgresql/internal  -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -I/usr/include/libxml2   -c -o main.o main.c
gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation -march=x86-64 -mtune=generic -O2 -pipe -fno-plt -fPIC -shared -o pg_bao.so main.o  -L/usr/lib  -Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now -L/usr/lib  -Wl,--as-needed  
/usr/bin/mkdir -p '/usr/lib/postgresql'
/usr/bin/mkdir -p '/usr/share/postgresql/extension'
/usr/bin/mkdir -p '/usr/share/postgresql/extension'
/usr/bin/install -c -m 755  pg_bao.so '/usr/lib/postgresql/pg_bao.so'
/usr/bin/install -c -m 644 .//pg_bao.control '/usr/share/postgresql/extension/'
/usr/bin/install -c -m 644 .//pg_bao--0.0.1.sql  '/usr/share/postgresql/extension/'
/usr/bin/mkdir -p '/usr/lib/postgresql/bitcode/pg_bao'
/usr/bin/mkdir -p '/usr/lib/postgresql/bitcode'/pg_bao/
/usr/bin/install -c -m 644 main.bc '/usr/lib/postgresql/bitcode'/pg_bao/./
cd '/usr/lib/postgresql/bitcode' && /usr/bin/llvm-lto -thinlto -thinlto-action=thinlink -o pg_bao.index.bc pg_bao/main.bc

Now that the Bao extension is installed, we have to tell PostgreSQL to load it. To do this, modify your postgresql.conf file to load the pg_bao shared library. If you are using the VM, you can do this like so (as a superuser):

echo "shared_preload_libraries = 'pg_bao'" >> /media/data/pg_data/data/postgresql.conf

Next, we restart PostgreSQL:

systemctl restart postgresql

We can now reconnect to the database and test to make sure Bao is installed:

# add -h localhost if you are connecting to the VM from your host machine
$ psql -U imdb 
psql (12.3)
Type "help" for help.

imdb=# SHOW enable_bao;
 enable_bao 
------------
 off
(1 row)

If the enable_bao is present (it defaults to off), then the Bao extension is installed. If instead you see an error like below, then the Bao extension has not been installed:

imdb=# show enable_bao;
ERROR:  unrecognized configuration parameter "enable_bao"

Notes

Leis, Viktor, Andrey Gubichev, Atanas Mirchev, Peter Boncz, Alfons Kemper, and Thomas Neumann. “How Good Are Query Optimizers, Really?” PVLDB, VLDB ’15, 9, no. 3 (2015): 204–215. https://doi.org/10.14778/2850583.2850594.

Bao Server Setup

After configuring the PostgreSQL extension, we'll next setup the Bao server. The Bao server is responsible for loading, maintaining, and serving the learned query optimization model. For now, we'll assume that the Bao server will run on the same machine as PostgreSQL, although this is not required.

The Bao server is configured with the BaoForPostgreSQL/bao_server/bao.cfg configuration file, but the defaults should work fine for this tutorial. Before starting the server, several Python dependencies are required. If you are using the VM, these are already installed. On Arch Linux, these can be installed by:

pacman -S python-scikit-learn python-numpy python-joblib python-pytorch-opt

If you are using a different Linux distribution, the package names may be slightly different. If you'd prefer, you can also install these dependencies with pip:

pip3 install scikit-learn numpy joblib
pip3 install torch==1.5.0+cpu -f https://download.pytorch.org/whl/torch_stable.html

Once these dependencies are installed, we can launch the Bao server.

$ cd bao_server
$ python3 main.py
Listening on localhost port 9381
Spawning server process...

With the Bao server running, we can now test to see if PostgreSQL can connect to it. If you are not running the Bao server on the same node as PostgreSQL, you'll need to change ListenAddress in bao.cfg and set the PostgreSQL variables bao_host and bao_port.

# add -h localhost if you are connecting to the VM from your host machine
$ psql -U imdb 
psql (12.3)
Type "help" for help.

imdb=# SET enable_bao TO on;
SET
imdb=# EXPLAIN SELECT count(*) FROM title;
                                        QUERY PLAN                                        
------------------------------------------------------------------------------------------
 Bao prediction: NaN
 Bao recommended hint: (no hint)
 Finalize Aggregate  (cost=50166.51..50166.52 rows=1 width=8)
   ->  Gather  (cost=50166.29..50166.50 rows=2 width=8)
         Workers Planned: 2
         ->  Partial Aggregate  (cost=49166.29..49166.30 rows=1 width=8)
               ->  Parallel Seq Scan on title  (cost=0.00..46532.63 rows=1053463 width=0)
(7 rows)

If everything is setup correctly, you should see two lines in the EXPLAIN output related to Bao -- the prediction and the hint. Since Bao currently has no experience, it has no model, so there's no prediction or hint provided.

Next, we'll test to make sure Bao can correctly record feedback from PostgreSQL query executions. In the same session (with enable_bao set to ON), execute a query:

imdb=# SELECT count(*) FROM title;
  count  
---------
 2528312
(1 row)

If you look at the stdout of the Bao server, you should see a line like:

Logged reward of 2103.556027

This indicates that Bao has recorded a runtime of 2 seconds for this simple query plan. Note that Bao does not record your SQL queries, only a scrubbed version of your query plans. Bao keeps the type of each node, the estimated cost and cardinality, and the names of the involved relations. Bao never stores, for example, the predicate values from an executed query. Here's what Bao stores for the above query:

{
  "Plan": {
    "Node Type": "Other",
    "Node Type ID": "42",
    "Total Cost": 50166.515833,
    "Plan Rows": 1,
    "Plans": [
      {
        "Node Type": "Other",
        "Node Type ID": "45",
        "Total Cost": 50166.500833,
        "Plan Rows": 2,
        "Plans": [
          {
            "Node Type": "Other",
            "Node Type ID": "42",
            "Total Cost": 49166.300833,
            "Plan Rows": 1,
            "Plans": [
              {
                "Node Type": "Seq Scan",
                "Node Type ID": "19",
                "Relation Name": "title",
                "Total Cost": 46532.633333,
                "Plan Rows": 1053463
              }
            ]
          }
        ]
      }
    ]
  },
  "Buffers": {
    "title_pkey": 1,
    "kind_id_title": 1
  }
}

Run an example workload

Next, we'll use Bao to execute a small sample workload. Then, we'll execute that same workload without Bao, and analyze the results.

The sample_queries folder in the repository root contains 40 sample queries drawn from the original join order benchmark (JOB)¹ and from extended JOB²:

$ cd sample_queries
$ ls
q10_2a265.sql  q17_7a164.sql  q23_19d.sql    q29_6e.sql      q36_7a136.sql   q5_8a423.sql
q11_17e.sql    q18_7a103.sql  q24_32a.sql    q30_18c.sql     q37_2a1291.sql  q6_16b.sql
q12_17a.sql    q1_8a463.sql   q25_13d.sql    q31_2a39.sql    q3_7a99.sql     q7_7a48.sql
q13_7a121.sql  q19_2a471.sql  q26_2a274.sql  q32_2a493.sql   q38_2a1870.sql  q8_6a505.sql
q14_6a349.sql  q20_24b.sql    q27_3c.sql     q33_2a156.sql   q39_2a2781.sql  q9_5a48.sql
q15_18a.sql    q21_2a396.sql  q28_13a.sql    q34_1a275.sql   q40_2a8120.sql
q16_26c.sql    q22_8a27.sql   q2_8a82.sql    q35_1a1508.sql  q4_8a122.sql

The run_queries.py script will execute a random workload with 500 queries drawn from these samples. First, 25 queries will be executed to provide some basic training data for Bao. Then, and for every 25 queries processed afterwards, the script will pause query execution to retrain Bao's model.

The run_queries.py script assumes your DB is reachable on localhost, with the username imdb in the database imdb. If this is not the case, modify the PG_CONNECTION_STR variable at the top of the file.

Start the run:

$ python3 run_queries.py sample_queries/*.sql | tee ~/bao_run.txt

We use the tee command to both show us the output and redirect the output to a file, which we analyze later. Grab a coffee, this run will take a while to finish (around 75 minutes on my hardware).

Next, once this run is finished, change the line in run_queries.py:

USE_BAO = True

To:

USE_BAO = False

This will cause the run_queries.py script to execute the exact same workload, but without using Bao to select query plans and without retraining a model every 25 queries. Start the run:

$ python3 run_queries.py sample_queries/*.sql | tee ~/pg_run.txt

... and grab another coffee. This took just under 3 hours on my hardware. The fact that the workload finishes faster with Bao enabled is already telling, but next we will analyze these two runs in detail.

Notes

Marcus, Ryan, Parimarjan Negi, Hongzi Mao, Chi Zhang, Mohammad Alizadeh, Tim Kraska, Olga Papaemmanouil, and Nesime Tatbul. “Neo: A Learned Query Optimizer.” PVLDB, VLDB ’19, 12, no. 11 (2019): 1705–18.

Analyzing the executions

With logs of our saved runs in bao_run.txt and pg_run.txt, we can next analyze them. If you are using the VM, you can move these files to /vagrant/ to access them from outside the VM.

To analyze the logs, run the analyze_bao.ipynb Python notebook. Make sure bao_run.txt and pg_run.txt are in the same folder. You'll need Jupyter to run the notebook if you don't already have it.

The notebook will first generate a queries complete vs. time graph. Depending on your hardware, the results will vary. Below is what I saw on my hardware:

Queries completed over time

The graph shows the number of queries that have been completed since the start of the experiment, including the training time of Bao. Each red circle indicates a time when query execution was paused and the Bao model was trained. Obviously, when deployed, one does not have to pause query execution to retrain the Bao model, and this retraining can either be offloaded to another machine or done at a fixed time during the day, concurrent with query processing. On my hardware, Bao executed the example workload about twice as fast as the default PostgreSQL optimizer.

Where do Bao's gains come from? Finishing the workload faster could come from making every query a little faster, or making a few queries much faster. To examine this, we look at the CDF of query times, which is the next pair of graphs generated by the notebook.

CDF of query completion time

The left figure shows the CDF with a linear y-axis, whereas the right figure shows the same CDF with a log scale. We can see that the majority of Bao's gains come from decreasing latency "at the tail" of the distribution: the 90%, 95%, etc. Bao incurs a slightly longer longest-running query (best visible on the left plot), and increases the query time of the fastest queries slightly (best visible on the right plot).

Bao increases the latency of the fastest query by a small margin because of increased optimization time. This is because query planning with Bao requires executing the PostgreSQL planner several times, and running inference through a neural network. If you care about a particular query taking 0.01 seconds instead of 0.05 seconds, you can always disable Bao on a per-query basis.

What about query regressions? The next table shows the latency of each query when using the PostgreSQL optimizer, and the worst / best time achieved by Bao.

	PG	Bao worst	Bao best
Q
q1	275.415884	12.206382	6.005776
q2	71.049927	198.310226	9.242487
q3	10.982070	290.048801	10.805816
q4	26.890862	26.966064	1.527303
q5	9.692364	9.354480	1.350012
q6	21.741243	19.851484	7.341236
q7	51.935738	51.321676	7.288905
q8	28.725613	15.981973	5.995388
q9	15.645138	16.394102	7.327004
q10	11.720967	9.688347	7.373339
q11	15.163100	7.686548	5.853226
q12	12.934380	9.379889	4.565600
q13	18.687008	11.803825	3.417922
q14	11.100027	14.864732	7.060695
q15	9.641760	8.258874	4.153027
q16	5.312640	7.992982	1.221813
q17	6.404161	17.702658	5.868285
q18	11.912653	20.336241	6.772051
q19	9.943220	33.939818	10.330661
q20	0.143906	0.679753	0.344254
q21	1.022706	1.292618	0.921263
q22	16.113360	51.231996	8.196555
q23	12.050350	13.501636	7.194857
q24	0.025990	0.153196	0.100763
q25	3.906976	5.511800	2.178743
q26	10.439918	16.880665	7.772304
q27	0.759958	1.491062	0.461650
q28	1.784515	2.679448	1.671500
q29	0.279165	0.263327	0.113964
q30	6.967469	7.600260	5.197018
q31	1.877799	3.540210	1.459715
q32	0.981562	7.478595	0.731711
q33	2.215288	4.805330	1.749660
q34	5.175736	8.535833	2.692968
q35	6.402113	14.327323	5.302665
q36	11.992452	15.665057	8.864666
q37	12.208148	22.470210	9.716367
q38	13.334725	34.440121	10.314824
q39	8.051381	16.096334	8.621814
q40	14.709921	19.853354	11.598819

We can see immediately that Bao has huge gains on Q1, which takes almost 5 minutes for PostgreSQL to process. However, on Q2 and Q3, while Bao normally found a pretty good plan (the Bao best column), sometimes Bao picked a regressing plan that took significantly longer to execute than the PostgreSQL plan.

For some applications, the raw workload speedup achieved by Bao may be more important than any of these query regressions. For other applications, these query regressions may be a huge problem. Next, we discuss how to use Bao's exploration mode to avoid these regressions.

Use Exploration Mode

Bao is powered by reinforcement learning, and must therefore strike a balance between the exploration of new plans and the exploitation of plans that are known to be good. If you do no exploration, you'll never do any better than PostgreSQL -- if you do too much exploration, you'll suffer many query regressions.

Bao uses an algorithm called Thompson sampling, which has a theoretical guarantee about regret. Regret is the difference in performance between the plan chosen by Bao and the (unknown) best possible choice. Thompson sampling ensures that, in the long run, regret approaches zero.¹ This is considered to be an optimal balance of exploration and exploitation.

However, sometimes "in the long run" isn't good enough for an application. For example, you may need to ensure that a particular query never regresses. In order to do this, Bao supports exploration mode, a special facility that lets Bao explore plans for specific queries offline, and then ensure the best plan is always chosen at runtime.

At a high level, exploration mode works as follows. You tell Bao about a particular SQL query that you never want to regress. Then, you give Bao a fixed period of time -- like 20 minutes -- to run experiments against your database. Bao uses this time (and no more) to run as many experiments as it can, saving the results. When a new model is trained, that model is checked to make sure it would make the right decision for each executed experiment. If the model would not, the model is retrained, with increased emphasis on those experiments.

We'll start from scratch, removing our previous Bao model and the experience we observed. Stop the Bao server (i.e., Control + C), then delete the model, Bao DB, and restart PostgreSQL:

$ rm -rf bao_server/bao_default_model
$ rm bao_server/bao.db
$ systemctl restart postgresql

Configure Bao to talk to PostgreSQL

Until now, the PostgreSQL extension has communicated with the Bao server, but the Bao server has never directly connected to the database. For exploration mode, we'll need such a connection. Edit the bao_server/bao.cfg file to tell Bao how to connect to your PostgreSQL instance:

# ==============================================================
# EXPLORATION MODE SETTINGS
# ==============================================================

# maximum time a query should reasonably take (used in
# exploration mode).
MaxQueryTimeSeconds = 120

# psycopg2 / JDBC connection string to access PostgreSQL
# (used by the experiment runner to prevent regressions)
PostgreSQLConnectString = user=imdb

MaxQueryTimeSeconds is an upper-bound on how long any non-regressed query plan you add to exploration mode should take. For this sample workload, 120 seconds was a reasonable value for this workload. Bao uses this value as a cutoff for its experiments, assuming any plan that takes longer than this amount of time must be a regression. Don't set this value too tightly, however, because Bao can still gain knowledge from observing how much a query plan regressed.
PostgreSQLConnectString is the JDBC-like string used by the Bao server to connect to the postgreSQL database. You can find documentation for it from the psycopg docs.

Testing the connection

To test the connection, we can use the baoctl.py script. You can find this script in the bao_server directory. It should be executed on the same machine as the Bao server is running.

Run baoctl.py --test-connection to see if Bao can connect to your PostgreSQL instance:

$ python3 baoctl.py --test-connection
Connection successful!

Adding exploration queries

Once Bao can connect to our PostgreSQL instance, we can add exploration queries. Let's add the first three workload queries:

$ python3 baoctl.py --add-test-query ../sample_queries/q1_8a463.sql 
Added new test query.
$ python3 baoctl.py --add-test-query ../sample_queries/q2_8a82.sql 
Added new test query.
$ python3 baoctl.py --add-test-query ../sample_queries/q3_7a99.sql 
Added new test query.

Start exploration

You can give Bao as much or as little time to execute experiments as you'd like. It is a good idea to provide at least a little longer than MaxQueryTimeSeconds, to ensure that at least one experiment is fully executed.

Each query added creates 5 experiments, so we currently have 15 un-ran experiments. We can see this by running baoctl.py --status:

$ python3 baoctl.py --status
Unexecuted experiments : 15
Completed experiments  : 0
Exploration queries    : 3

We can start executing these 15 experiments by running baoctl.py --experiment. We'll give Bao 30 minutes to run these experiments so that they can all be finished:

$ python3 baoctl.py --experiment 1800
We have 15 unexecuted experiment(s).
Running on backend PID 57718
...
Finished all experiments

Note: sometimes, a particular configuration will cause the PostgreSQL session to crash. When this is the case, PostgreSQL automatically restarts / recovers. However, since this can be an expensive operation, Bao ends any experimentation when this occurs. Bao notes when a configuration causes such a crash, and will not execute it again. If this occurs during the tutorial, restart the exploration process until every experiment finishes. When this occurs, the output looks like this:

Time remaining: 1539696 ms
Server down after experiment with arm 1
Treating this as a timeout and ceasing further experiments.
Logged reward of 240000
Finished all experiments

Once these experiments finish, we can re-run the entire workload and analyze the performance:

$ python3 run_queries.py sample_queries/*.sql | tee ~/bao_with_regblock.txt

Grab yet another coffee. This one will be faster, but will still take some time.

Notes

Precisely, the gaurantee is that the limit as time tends to infinity is that regret will tend towards zero. This gaurantee makes several assumptions about the underlying reward function, which do not strictly apply to query optimization. Nevertheless, Thompson sampling appears to be an effective algorithm in this domain.

Analyze the workload with exploration mode

We'll use the same notebook as before to analyze our new bao_with_regblock.txt results. You'll need to change the SHOW_RG = False line in the 2nd cell to SHOW_RG = True to plot both our previous run (without exploration mode) and our new run.

First, we'll look at queries completed vs. time.

Queries vs. time

The new green line shows the performance of Bao with our three test queries entered into experimental mode. In terms of overall workload performance, exploration mode doesn't help all tha tmuch: the workload finishes only a little bit faster.

Next, we'll look at the query latency CDFs.

Query latency CDF

The green line shows that tail latency has been significantly reduced, which is accounted for almost entirely by avoiding a few regressing query plans. We can verify this with the same table we looked at before:

	PG	Bao worst	Bao best	Bao + E worst	Bao + E best
Q
q1	275.415884	12.206382	6.005776	12.455495	6.148398
q2	71.049927	198.310226	9.242487	72.339166	10.068161
q3	10.982070	290.048801	10.805816	14.475681	5.646478
q4	26.890862	26.966064	1.527303	26.190447	1.468367
q5	9.692364	9.354480	1.350012	6.007022	1.319892
q6	21.741243	19.851484	7.341236	24.368830	8.481136
q7	51.935738	51.321676	7.288905	53.320758	8.071399
q8	28.725613	15.981973	5.995388	25.003634	5.759860
q9	15.645138	16.394102	7.327004	15.451688	7.699232
q10	11.720967	9.688347	7.373339	23.513964	7.451328
q11	15.163100	7.686548	5.853226	14.888963	10.835752
q12	12.934380	9.379889	4.565600	16.264112	4.621375
q13	18.687008	11.803825	3.417922	6.843567	3.754824
q14	11.100027	14.864732	7.060695	15.598768	7.015126
q15	9.641760	8.258874	4.153027	14.213644	3.983182
q16	5.312640	7.992982	1.221813	6.572023	1.271791
q17	6.404161	17.702658	5.868285	18.423400	5.759903
q18	11.912653	20.336241	6.772051	21.853793	6.889149
q19	9.943220	33.939818	10.330661	21.690789	10.453242
q20	0.143906	0.679753	0.344254	0.460239	0.357460
q21	1.022706	1.292618	0.921263	1.653735	0.884557
q22	16.113360	51.231996	8.196555	51.013479	7.209244
q23	12.050350	13.501636	7.194857	12.140767	7.034203
q24	0.025990	0.153196	0.100763	0.181307	0.107129
q25	3.906976	5.511800	2.178743	3.643400	2.291787
q26	10.439918	16.880665	7.772304	12.206994	6.855079
q27	0.759958	1.491062	0.461650	1.744088	0.467688
q28	1.784515	2.679448	1.671500	2.584871	1.753601
q29	0.279165	0.263327	0.113964	0.330460	0.128979
q30	6.967469	7.600260	5.197018	9.345165	5.021910
q31	1.877799	3.540210	1.459715	3.384164	1.439300
q32	0.981562	7.478595	0.731711	1.171264	0.626129
q33	2.215288	4.805330	1.749660	4.287353	1.832341
q34	5.175736	8.535833	2.692968	8.329981	2.556046
q35	6.402113	14.327323	5.302665	10.737619	5.267236
q36	11.992452	15.665057	8.864666	30.359994	10.390017
q37	12.208148	22.470210	9.716367	20.191838	10.248038
q38	13.334725	34.440121	10.314824	17.423221	10.154855
q39	8.051381	16.096334	8.621814	19.279665	8.618636
q40	14.709921	19.853354	11.598819	73.711118	10.908722

The first column shows the latency from the query plans produced by the PostgreSQL optimizer. The next two columns show the latency from the query plans produced Bao optimizer. The final two columns show our new results, the latency from the query plans produced by the Bao optimizer with exploration mode.

The large regressions on query 2 and 3 are eliminated, with both having a much more reasonble worst case time.

Bao as an advisor

For some applications, any amount of exploration -- and thus regression -- on any query is unacceptable. In these scenarios, it is possible to use Bao as an advisor instead of a full-blown optimizer.

To demonstrate this, let's use a simple psql session.

$ psql -U imdb -h localhost
psql (12.2, server 12.3)
Type "help" for help.

imdb=#

Bao can be controlled through three main session-level PostgreSQL configs.

enable_bao is the top level config. When set to off, the default, Bao does not observe or interfere with the query optimizer at all. When set to on, Bao will behave according to the values of enable_bao_rewards and enable_bao_selection.
enable_bao_rewards determines whether or not Bao collects additional experience from queries from this session.
enable_bao_selection determines whether or not Bao will use its value model to select query plans.

To use Bao as a pure advisor, we can set enable_bao but disable enable_bao_rewards and enable_bao_selection.

imdb=# SET enable_bao TO on;
SET
imdb=# SET enable_bao_selection TO off;
SET
imdb=# SET enable_bao_rewards TO off;
SET

Next, we'll execute a simple EXPLAIN statement:

imdb=# EXPLAIN SELECT count(*) FROM title;
                                        QUERY PLAN                                        
------------------------------------------------------------------------------------------
 Bao prediction: 894.349 ms
 Bao recommended hint: (no hint)
 Finalize Aggregate  (cost=50165.40..50165.41 rows=1 width=8)
   ->  Gather  (cost=50165.19..50165.40 rows=2 width=8)
         Workers Planned: 2
         ->  Partial Aggregate  (cost=49165.19..49165.20 rows=1 width=8)
               ->  Parallel Seq Scan on title  (cost=0.00..46531.75 rows=1053375 width=0)
(7 rows)

imdb=#

Since enable_bao_selection is off, this plan is generated using the PostgreSQL optimizer, exactly as it would be if you were not using Bao. Two additional lines are added to the outputof the EXPLAIN plan:

Bao prediction shows the time that Bao thinks this query plan will take to execute. In this case, about a second.
Bao recommended hint shows the query hint that Bao would use if enable_bao_selection was on. In this case, Bao would not use any query hints.

Let's execute the query and run the same EXPLAIN statement again:

imdb=# SELECT count(*) FROM title;
  count  
---------
 2528312
(1 row)

imdb=# EXPLAIN SELECT count(*) FROM title;
                                        QUERY PLAN                                        
------------------------------------------------------------------------------------------
 Bao prediction: 661.193 ms
 Bao recommended hint: (no hint)
 Finalize Aggregate  (cost=50165.40..50165.41 rows=1 width=8)
   ->  Gather  (cost=50165.19..50165.40 rows=2 width=8)
         Workers Planned: 2
         ->  Partial Aggregate  (cost=49165.19..49165.20 rows=1 width=8)
               ->  Parallel Seq Scan on title  (cost=0.00..46531.75 rows=1053375 width=0)
(7 rows)

Bao's prediction for the query changed, even though the query plan, cost estimates, and cardinality estimates are all the same as before! This is because the buffer pool has changed states: after the execution of the SELECT query, more data relevant to this query has been cached, so Bao predicts that it will execute faster.

Of course, predictions are exactly that -- predictions. While Bao's value model should get better over time, these predictions should be used as advice, and have no bounds whatsoever (although Bao will never predict a negative query runtime).

Let's look at q1 from our sample workload. You can copy and paste the below statement to see the EXPLAIN output.

EXPLAIN SELECT COUNT(*) FROM title as t, kind_type as kt, info_type as it1, movie_info as mi1, cast_info as ci, role_type as rt, name as n, movie_keyword as mk, keyword as k, movie_companies as mc, company_type as ct, company_name as cn WHERE t.id = ci.movie_id AND t.id = mc.movie_id AND t.id = mi1.movie_id AND t.id = mk.movie_id AND mc.company_type_id = ct.id AND mc.company_id = cn.id AND k.id = mk.keyword_id AND mi1.info_type_id = it1.id AND t.kind_id = kt.id AND ci.person_id = n.id AND ci.role_id = rt.id AND (it1.id IN ('7')) AND (mi1.info in ('MET:','OFM:35 mm','PCS:Digital Intermediate','PFM:35 mm','PFM:Video','RAT:1.33 : 1','RAT:1.37 : 1')) AND (kt.kind in ('episode','movie','tv movie')) AND (rt.role in ('actor','actress')) AND (n.gender in ('f','m') OR n.gender IS NULL) AND (n.name_pcode_cf in ('A5362','J5252','R1632','R2632','W4525')) AND (t.production_year <= 2015) AND (t.production_year >= 1925) AND (cn.name in ('Fox Network','Independent Television (ITV)','Metro-Goldwyn-Mayer (MGM)','National Broadcasting Company (NBC)','Paramount Pictures','Shout! Factory','Sony Pictures Home Entertainment','Universal Pictures','Universal TV')) AND (ct.kind in ('distributors','production companies'));

This should result in something like this:

imdb=# EXPLAIN SELECT COUNT(*) FROM title as t, kind_type as kt, info_type as it1, movie_info as mi1, cast_info as ci, role_type as rt, name as n, movie_keyword as mk, keyword as k, movie_companies as mc, company_type as ct, company_name as cn WHERE t.id = ci.movie_id AND t.id = mc.movie_id AND t.id = mi1.movie_id AND t.id = mk.movie_id AND mc.company_type_id = ct.id AND mc.company_id = cn.id AND k.id = mk.keyword_id AND mi1.info_type_id = it1.id AND t.kind_id = kt.id AND ci.person_id = n.id AND ci.role_id = rt.id AND (it1.id IN ('7')) AND (mi1.info in ('MET:','OFM:35 mm','PCS:Digital Intermediate','PFM:35 mm','PFM:Video','RAT:1.33 : 1','RAT:1.37 : 1')) AND (kt.kind in ('episode','movie','tv movie')) AND (rt.role in ('actor','actress')) AND (n.gender in ('f','m') OR n.gender IS NULL) AND (n.name_pcode_cf in ('A5362','J5252','R1632','R2632','W4525')) AND (t.production_year <= 2015) AND (t.production_year >= 1925) AND (cn.name in ('Fox Network','Independent Television (ITV)','Metro-Goldwyn-Mayer (MGM)','National Broadcasting Company (NBC)','Paramount Pictures','Shout! Factory','Sony Pictures Home Entertainment','Universal Pictures','Universal TV')) AND (ct.kind in ('distributors','production companies'));


                               QUERY PLAN                                                                                                                  
----------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------
 Bao prediction: 74053.945 ms
 Bao recommended hint: SET enable_nestloop TO off; 
 Aggregate  (cost=9644.44..9644.45 rows=1 width=8)
   ->  Gather  (cost=1006.53..9644.44 rows=1 width=0)
         Workers Planned: 1
         ->  Nested Loop  (cost=6.53..8644.34 rows=1 width=0)
               ->  Hash Join  (cost=6.10..8643.30 rows=1 width=4)
               ...

Here, Bao thinks the query plan generated by PostgreSQL will take a little over a minute to execute. Bao recommends a hint to disable nested loop joins for this query. Let's take the hint, and re-run the EXPLAIN:

imdb=# SET enable_nestloop TO off;
SET
imdb=# EXPLAIN SELECT COUNT(*) FROM title as t, kind_type as kt, info_type as it1, movie_info as mi1, cast_info as ci, role_type as rt, name as n, movie_keyword as mk, keyword as k, movie_companies as mc, company_type as ct, company_name as cn WHERE t.id = ci.movie_id AND t.id = mc.movie_id AND t.id = mi1.movie_id AND t.id = mk.movie_id AND mc.company_type_id = ct.id AND mc.company_id = cn.id AND k.id = mk.keyword_id AND mi1.info_type_id = it1.id AND t.kind_id = kt.id AND ci.person_id = n.id AND ci.role_id = rt.id AND (it1.id IN ('7')) AND (mi1.info in ('MET:','OFM:35 mm','PCS:Digital Intermediate','PFM:35 mm','PFM:Video','RAT:1.33 : 1','RAT:1.37 : 1')) AND (kt.kind in ('episode','movie','tv movie')) AND (rt.role in ('actor','actress')) AND (n.gender in ('f','m') OR n.gender IS NULL) AND (n.name_pcode_cf in ('A5362','J5252','R1632','R2632','W4525')) AND (t.production_year <= 2015) AND (t.production_year >= 1925) AND (cn.name in ('Fox Network','Independent Television (ITV)','Metro-Goldwyn-Mayer (MGM)','National Broadcasting Company (NBC)','Paramount Pictures','Shout! Factory','Sony Pictures Home Entertainment','Universal Pictures','Universal TV')) AND (ct.kind in ('distributors','production companies'));


                                        QUERY PLAN                                                                                                                    
           
----------------------------------------------------------------------------------
 Bao prediction: 15032.299 ms
 Bao recommended hint: SET enable_nestloop TO off; 
 Aggregate  (cost=10000977592.82..10000977592.83 rows=1 width=8)
   ->  Nested Loop  (cost=10000683246.19..10000977592.82 rows=1 width=0)
         ->  Seq Scan on info_type it1  (cost=0.00..2.41 rows=1 width=4)
               Filter: (id = 7)
               ...

For the new query plan, Bao predicts a time of only 15 seconds. If you'd like, you can execute both query plans to measure the quality of Bao's predictions.

Keep in mind that using Bao as we've configured it here won't let the value model get any better. In order to improve its model, Bao needs experience. For any given session, you can allow Bao to collect experience and thus improve its model, but still prevent Bao from modifying any query plans. To do this, just set enable_bao and enable_bao_rewards to ON, but set enable_bao_selection to OFF.

Final Notes

This tutorial was intended to give you an idea for what Bao could do and how the various components of Bao work. However, please note that the gains we saw in this tutorial might not be reflected in your real workload. Here's a few things to keep in mind:

This workload was artifically constructed to represent a mixture of queries where the PostgreSQL optimizer found the best plan, and where PostgreSQL found a terrible plan. Most queries fall somewhere in between these two extremes. Check out the Bao paper for a more detailed analysis on more realistic workloads.
Bao's query optimizer has overhead. For queries that run very quickly (under 500ms), Bao is unlikely to be helpful, and may just slow down these queries with additional optimization time.
Bao's value model needs to be retrained and verified, beyond what exploration mode could do. Most of the time, the verification done by Bao currently is sufficient. If training goes awry, Bao always saves the previous model. Training could easily be configured with a cron job. We're currently working on more advanced mechanisms for this, but currenlty it has to be done manually.

PostgreSQL Configuration Variables

Bao can be configured through a number of session level PostgreSQL configuration variables. These variables are set to their default values every time you open a session (e.g., psql session or a connection from an application).

Variable	Default	Description
`pg_bao.enable_bao`	OFF	Top-level switch for all Bao functions. When disabled, Bao operates purely in advisor mode.
`pg_bao.enable_bao_rewards`	ON	If this and `enable_bao` are on, Bao will report and record rewards from queries executed in this session.
`pg_bao.enable_bao_selection`	ON	If this and `enable_bao` are on, Bao will use its value model to select query plans at optimization time.
`pg_bao.bao_host`	localhost	Host where the Bao server is running. Can be changed to put the Bao server on a different machine than PostgreSQL.
`pg_bao.bao_port`	9381	Port to access the Bao server. If you change this, you may also need to change the port the Bao server listens on.
`pg_bao.bao_num_arms`	5	The number of arms Bao should examine during query optimization. Lower values decrease optimization time, higher values may lead to better query plans.
`pg_bao.bao_include_json_in_explain`	OFF	Includes the JSON of the Bao plan in the output of EXPLAIN. Used internally.

More information about each variable can be found in the pg_settings internal table:

SELECT name, extra_desc FROM pg_settings WHERE name LIKE '%bao%';

Bao Server Configuration Variables

The Bao sever is configured through the bao.cfg file in the bao_server directory. The default configuration file, reproduced below, contains a description of each variable.

[bao]
# ==============================================================
# BAO SERVER SETTINGS
# ==============================================================

# port to listen on. Note that the corresponding PostgreSQL
# variable, bao_port, must be set to match.
Port = 9381

# network address to listen on. If not localhost, don't forget
# to set the PostgreSQL bao_host variable.
ListenOn = localhost

# ==============================================================
# EXPLORATION MODE SETTINGS
# ==============================================================

# maximum time a query should reasonably take (used in
# exploration mode).
MaxQueryTimeSeconds = 120

# psycopg2 / JDBC connection string to access PostgreSQL
# (used by the experiment runner to prevent regressions)
PostgreSQLConnectString = user=imdb