JSON Optimization

RFC PR: datafuselabs/databend#6995
Tracking Issue: datafuselabs/databend#6994

Summary

This RFC describes the design of JSON performance optimization.

Motivation

Currently, Databend Semi-Structured data is stored as the raw text JSON format and use serde_json to do serialization and deserialization. It has the following performance problems:

JSON must be parsed every time used, and text parsing is very slow, especially for large complex object data.
Query a single key path requires reading and parsing the whole JSON data, which takes more parsing time and takes up more memory.

In order to make the query performance of Semi-Structured datas as fast as other data types. This RFC introduced the following two proposals:

Automatically detects frequently queried key paths of JSON columns and extracts them as virtual columns to achieving similar query performance as other columns.
Use JSONB instead of JSON as the underlying binary encoding format for Semi-Structured data types, which will optimize the parsing speed and benifit for key path access.

Guide-level explanation

None.

Reference-level explanation

Virtual Column

The schema of JSON can be changed arbitrarily. However, in practice use, JSON data is often generated by machine and has a fairly rigid and predictable structure. Taking advantage of this feature, we can detect and extract frequently queried key path of JSON as virtual column to speed up query processing.

Collect frequently accessed JSON key paths

Extracting and storing virtual columns requires extra parsing processes and storage resources, so it is not appropriate to generate virtual columns for all key paths. We should only generate virtual columns for frequently queried JSON key paths.

In order to know which key paths are frequently queried, we use the FPGrowth algorithm to count the access frequency of the JSON data key paths. "FP" stands for frequent pattern, usually used to calculate item frequencies and identify frequent items. Every time the user queries one key path of the JSON data, it will be recorded by the FPGrowth algorithm, and finally a tree-like statistical information is generated. We can use this statistical information to determine which key paths need to generate virtual columns.

Extract virtual columns

The operation of extracting virtual columns is performed asynchronous, so that it will not affect the performance of the data insertion. Since Databend will compact the table blocks periodically, we can do extracting the virtual column as an additional operation, if the column data type is JSON. In this way, the data has been readed for compact can be reused, which will reduce unnecessary read amplification.

The process of extracting virtual columns is as follows:

Collect key paths which the accesses counts exceeds the threshold generated by FPGrowth algorithm.
Parse JSON datas, infer JSON schema for all the key paths, include the data type of value, and the row numbers.
Check all key paths, if the data type of the value is the same, and there is data in each row, extract data from this key path to generate virtual columns and store them in a separate parquet file.

Take the following JSON as an example.

{"id":1,"title":"a","user":{"id":1,"name":"a"},"tags":["t1","t2"]}
{"id":2,"title":"b","user":{"id":2,"name":"b"},"tags":["t3","t4"]}
{"id":3,"title":"c","user":{"id":3,"name":3}}
{"id":4,"title":"d","user":{"id":4,"name":4}}

The key path id, title and user:id have the same data type, we can extract them as virtual columns. For user.name, the value of data type is not same. For tags, it isn't on the every row. We don't generate virtual columns for them.

Push down key path access to storage

There are two ways to access json key path: dot notation such as col:level1:level2 and bracket notation such as col['levle1'][0].

For example:

create table test (id int8, v json);
insert into test values(1, parse_json('{"k1":{"k2":"v"},"a":[0,1,2]}'));
select v:k1:k2, v['a'][0] from test;
+---------+-----------+
| v:k1:k2 | v['a'][0] |
+---------+-----------+
| "v"     | 0         |
+---------+-----------+

Currently, the access of JSON key path will be converted to a get scalar function, and traverse the JSON by keys to fetch the value. In order to use the virtual columns to improve the performance, we need to modify the query plan and push down the access of key path to the storage layer. Since we use JSONB instead of JSON, even for key paths that don't generated virtual columns, pushing down to the storage layer will bring benefits.

JSONB

JSONB stands for JSON Binary or JSON better, which is an optimized binary format data type supported by PostgreSQL since v9.4. CockroachDB also implements JSONB based on the PostgreSQL design.

By storing data as JSONB instead of JSON, we can get the following benefits:

The parsing and quering process of JSONB is significantly faster.
Accessing a particular key path don't need read the full data, which will greatly speed up overall performance.
External indexes can be added to further speed up queries.

binary encoding

JSONB is a tree structure. Each node consists of three parts, a 32-bit header, several 32-bit JEntry, and the variable-length raw content.

The header is also known as the container header, it includes a 3-bit type (include array, object and scalar) and a 28-bit to indicate the number of elements in array or object.
The JEntry has a 3-bit value type(include true, false, null, string, number, array and object), a 28-bit provides the length or offset of the raw content, and a 1-bit flag to indicate whether length or offset is used.
The last part is the raw content values, which can be located by the length or offset in JEntry.

Take this JSON as an example, we can see the encoding format of JSONB as follow.

{"a":1,"b":[true,2,"v"]}

  .--------------.    .-----------------------------------.
  |    header    | -> |    type(object) | item nums(2)    |
  .--------------.    .-----------------------------------.
  | key1 JEntry  | -> | flag | val type(string) | len/off | -+
  .--------------.    .-----------------------------------.  |
  | key2 JEntry  | -> | flag | val type(string) | len/off | -+-+
  .--------------.    .-----------------------------------.  | |
  | val1 JEntry  | -> | flag | val type(number) | len/off | -+-+-+
  .--------------.    .-----------------------------------.  | | |
  | val2 JEntry  | -> | flag | val type(string) | len/off | -+-+-+-+
  .--------------.    .-----------------------------------.  | | | |
  |     "a"      | <-----------------------------------------+ | | |
  .--------------.                                             | | |
  |     "b"      | <-------------------------------------------+ | |
  .--------------.                                               | |
  |      1       | <---------------------------------------------+ |
  .--------------.                                                 |
  | [true,2,"v"] | <-----------------------------------------------+
  .--------------.
        |           .--------------+    .-----------------------------------.
        +---------> |    header    | -> |    type(array) | item nums(3)     |
                    .--------------+    .-----------------------------------.
                    | item1 JEntry | -> | flag | val type(bool true)        |
                    .--------------+    .-----------------------------------.
                    | item2 JEntry | -> | flag | val type(number) | len/off | -+
                    .--------------+    .-----------------------------------.  |
                    | item3 JEntry | -> | flag | val type(string) | len/off | -+-+
                    .--------------+    .-----------------------------------.  | |
                    |      2       | <-----------------------------------------+ |
                    .--------------.                                             |
                    |     "v"      | <-------------------------------------------+
                    .--------------.

For more detailed information, see the design of PostgreSQL and CockroachDB

Benifit from the tree struct of JSONB, searching for key paths only need parse part of the full data. In each node, object keys can be accessed in O(𝑙𝑜𝑔(𝑛)), since keys are sorted and binary search can be used, and array element can be accessed in O(1), since the length of JEntry is fixed.

Drawbacks

Extracting virtual columns requires additional asynchronous tasks.
JSONB may take more disk space than JSON.

Rationale and alternatives

BSON is a binary JSON format used in MongoDB. Its main design goal is to store documents, so it doesn't have index for key path in the header.

ION is also a binary JSON format developed by amazon. It is optimized for read and has symbol tables as index to speed up key path access. Its main problem is that the development of the Rust version is still in an early stage and lacks some important trait support, such as serde.

There are also some optimizations for plain text json, such as simd-json uses SIMD instructions to speed up parsing, Pison speeds up parsing by adding bitmap index. But they are optimized for parse over the full data, which don't help to speed up access by key path.

Prior art

The main idea of this RFC comes from the JSON Tiles paper. The experimental evaluation in the paper shows that these optimization can effectively improve the read performance of JSON data.

Both PostgreSQL and CockroachDB use JSONB as optimized JSON format, we can implement Rust version JSONB based on their design.

Unresolved questions

None.

Future possibilities

Extract virtual columns for key paths with value have different data types
Add extra index for JSONB

Summary​

Motivation​

Guide-level explanation​

Reference-level explanation​

Virtual Column​

Collect frequently accessed JSON key paths​

Extract virtual columns​

Push down key path access to storage​

JSONB​

binary encoding​

Drawbacks​

Rationale and alternatives​

Prior art​

Unresolved questions​

Future possibilities​