Bao Server Setup

After configuring the PostgreSQL extension, we'll next setup the Bao server. The Bao server is responsible for loading, maintaining, and serving the learned query optimization model. For now, we'll assume that the Bao server will run on the same machine as PostgreSQL, although this is not required.

The Bao server is configured with the BaoForPostgreSQL/bao_server/bao.cfg configuration file, but the defaults should work fine for this tutorial. Before starting the server, several Python dependencies are required. If you are using the VM, these are already installed. On Arch Linux, these can be installed by:

pacman -S python-scikit-learn python-numpy python-joblib python-pytorch-opt

If you are using a different Linux distribution, the package names may be slightly different. If you'd prefer, you can also install these dependencies with pip:

pip3 install scikit-learn numpy joblib
pip3 install torch==1.5.0+cpu -f https://download.pytorch.org/whl/torch_stable.html

Once these dependencies are installed, we can launch the Bao server.

$ cd bao_server
$ python3 main.py
Listening on localhost port 9381
Spawning server process...

With the Bao server running, we can now test to see if PostgreSQL can connect to it. If you are not running the Bao server on the same node as PostgreSQL, you'll need to change ListenAddress in bao.cfg and set the PostgreSQL variables bao_host and bao_port.

# add -h localhost if you are connecting to the VM from your host machine
$ psql -U imdb 
psql (12.3)
Type "help" for help.

imdb=# SET enable_bao TO on;
SET
imdb=# EXPLAIN SELECT count(*) FROM title;
                                        QUERY PLAN                                        
------------------------------------------------------------------------------------------
 Bao prediction: NaN
 Bao recommended hint: (no hint)
 Finalize Aggregate  (cost=50166.51..50166.52 rows=1 width=8)
   ->  Gather  (cost=50166.29..50166.50 rows=2 width=8)
         Workers Planned: 2
         ->  Partial Aggregate  (cost=49166.29..49166.30 rows=1 width=8)
               ->  Parallel Seq Scan on title  (cost=0.00..46532.63 rows=1053463 width=0)
(7 rows)

If everything is setup correctly, you should see two lines in the EXPLAIN output related to Bao -- the prediction and the hint. Since Bao currently has no experience, it has no model, so there's no prediction or hint provided.

Next, we'll test to make sure Bao can correctly record feedback from PostgreSQL query executions. In the same session (with enable_bao set to ON), execute a query:

imdb=# SELECT count(*) FROM title;
  count  
---------
 2528312
(1 row)

If you look at the stdout of the Bao server, you should see a line like:

Logged reward of 2103.556027

This indicates that Bao has recorded a runtime of 2 seconds for this simple query plan. Note that Bao does not record your SQL queries, only a scrubbed version of your query plans. Bao keeps the type of each node, the estimated cost and cardinality, and the names of the involved relations. Bao never stores, for example, the predicate values from an executed query. Here's what Bao stores for the above query:

{
  "Plan": {
    "Node Type": "Other",
    "Node Type ID": "42",
    "Total Cost": 50166.515833,
    "Plan Rows": 1,
    "Plans": [
      {
        "Node Type": "Other",
        "Node Type ID": "45",
        "Total Cost": 50166.500833,
        "Plan Rows": 2,
        "Plans": [
          {
            "Node Type": "Other",
            "Node Type ID": "42",
            "Total Cost": 49166.300833,
            "Plan Rows": 1,
            "Plans": [
              {
                "Node Type": "Seq Scan",
                "Node Type ID": "19",
                "Relation Name": "title",
                "Total Cost": 46532.633333,
                "Plan Rows": 1053463
              }
            ]
          }
        ]
      }
    ]
  },
  "Buffers": {
    "title_pkey": 1,
    "kind_id_title": 1
  }
}