跳转到主要内容

Semi-structured Data Type

Summary

Short comes of current DataType

  • DataType is an enum type, we must use specific type after matching. For example, if we want to create deserializer/serializer by DataType, we should always do matching. It does not mean that match is not necessary. If we want to add more and more functions to DataType, matching may be very annoyment.

  • DataType represented as enum type, we can't use it as generic argument.

  • DataType may involve some nested datatypes, such as DataType::Struct, but we put DataField inside DataType, it's logically unreasonable。

  • Hard to put attributes into enum based DataType, such as nullable attribute #3726 #3769

Too many concepts about column (Series/Column/Array)

  • DataColumn is an enum, including Constant(value) and Array(Series)
pub enum DataColumn {
// Array of values.
Array(Series),
// A Single value.
Constant(DataValue, usize),
}
  • Series is a wrap of SeriesTrait
pub struct Series(pub Arc<dyn SeriesTrait>);

Summary

Semi-structured data types are used to represent schemaless data formats, such as JSON, XML, and so on. In order to be compatible with Snowflake's SQL syntax, we support the following three semi-structured data types:

  • Variant: A tagged universal type, which can store values of any other type, including Object and Array.
  • Object: Used to represent collections of key-value pairs, where the key is a non-empty string, and the value is a value of Variant type.
  • Array: Used to represent dense or sparse arrays of arbitrary size, where the index is a non-negative integer (up to 2^31-1), and values are Variant types.

Since Object and Array can be regarded as a type of Variant, the following introduction mainly takes Variant as an example.

Examples

The following example shows how to create a table with VARIANT, ARRAY and OBJECT data types, insert and query some test data.

CREATE TABLE test_semi_structured (
var variant,
arr array,
obj object
);

INSERT INTO test_semi_structured (var, arr, obj)
SELECT 1, array_construct(1, 2, 3)
, parse_json(' { "key1": "value1", "key2": "value2" } ');

INSERT INTO test_semi_structured (var, arr, obj)
SELECT to_variant('abc')
, array_construct('a', 'b', 'c')
, parse_json(' { "key1": [1, 2, 3], "key2": ["a", "b", "c"] } ');


SELECT * FROM test_semi_structured;

+-------+-------------------+----------------------------------------------------+
| var | arr | obj |
+-------+-------------------+----------------------------------------------------+
| 1 | [ 1, 2, 3 ] | { "key1": "value1", "key2": "value2" } |
| "abc" | [ "a", "b", "c" ] | { "key1": [ 1, 2, 3 ], "key2": [ "a", "b", "c" ] } |
+-------+-------------------+----------------------------------------------------+

Design Details

Data storage format

In order to store the Variant type values in the parquet format file with schema, we need to do some conversion on the original raw value. We have the following two choices:

Store data in one column as JSON or binary JSON-like format

JSON (JavaScript Object Notation) is the most common semi-structured format that can represent arbitrarily complex hierarchical values. It is very suitable for representing this kind of semi-structured data. Data of type Variant can be encoded in JSON format and stored as a raw string value. The main disadvantage of the JSON format is that each access requires expensive parsing of the raw string, so there are several optimized binary JSON-like formats to improve parsing speed and single key access. For example, MongoDB and PostgreSQL use BSON and jsonb respectively to store data in JSON format. UBJSON is also a compatible format specification for binary JSON, it can provide universal compatibility, as easy of use as JSON while being faster and more efficient. All of these binary JSON formats have better performance, the only problem is they lack a good Rust implementation libraries.

Store each unique key of data in sub-columns

Despite JSON format can represent arbitrary data, in practice, JSON data is usually generated by the machine, so we can predict the Shema and structure. Based on this feature, we can extract and flatten each unique key in JSON data into multiple independent virtual sub-columns.

For example, suppose we have a column named tweet and store the following JSON data:

{"id":1, "date": "1/11", type: "story", "score": 3, "desc": 2, "title": "...", "url": "..."}
{"id":2, "date": "1/12", type: "poll", "score": 5, "desc": 2, "title": "..."}
{"id":3, "date": "1/13", type: "pollop", "score": 6, "poll": 2, "title": "..."}
{"id":4, "date": "1/14", type: "story", "score": 1, "desc": 1, "title": "...", "url": "..."}
{"id":5, "date": "1/15", type: "comment", "parent": 4, "text": "..."}
{"id":6, "date": "1/16", type: "comment", "parent": 1, "text": "..."}
{"id":7, "date": "1/17", type: "pollop", "score": 3, "poll": 2, "title": "..."}
{"id":8, "date": "1/18", type: "comment", "parent": 1, "text": "..."}

This column can be split into 10 virtual sub-columns: tweet.id, tweet.date, tweet.type, tweet.score, tweet.desc, tweet.title, tweet.url, tweet.parent, tweet.text, tweet.poll. The data type of each sub-column can also be automatically deducted from the value, then we can automatically create those sub-columns and insert the corresponding value.

The main advantage of this storage format is that it does not need to parse the raw JSON string when querying the data, which can greatly speed up the query processing. The disadvantage is that additional processing is required when inserting data, and the schema of data in each row is not exactly the same. In some scenarios with large differences, many sub-column data will be Null. In order to have good performance and balance in various scenarios, we can refer to the optimization algorithms introduced in the paper JSON Tiles.

From the perspective of performance, a better solution is to store data in binary JSON-like format and extract some frequently queried unique keys as sub-columns. However, in order to simplify development, we use the JSON format in the first version. Binary JSON-like format and separately stored sub-columns will be adopted in a future optimized version.

Data Types

Add three new values Variant, VariantArray, VariantObject to the enumeration TypeID, respectively support these three semi-structured data types. Since we now have a value called Array, we name the semi-structured Array type as VariantArray to distinguish from it. Define the corresponding structures for these types, and implement the trait DataType. The PhysicalTypeID corresponding to these types are String, the JSON value will be converted to a raw string for storage.

pub enum TypeID {
...
Variant
VariantArray
VariantObject
}

pub struct VariantType {}

pub struct VariantArrayType {}

pub struct VariantObjectType {}

Object Column

Currently Column is only implemented for fundamental types, custom structs or enumerations like serde_json::Value don't have a suitable Column implementation to store. Define ObjectColumn and MutableObjectColumn as generic structures to store custom data types, and implement trait Column and MutableColumn respectively. ObjectType can be any custom type of structure or enumerations, we can define JsonColumn by specified parameter as serde_json::Value. All the variant data will be automatically cast to serde_json::Value and generate a JsonColumn. Other custom data types like BitmapColumn can be supported easily in the future.

#[derive(Clone)]
pub struct ObjectColumn<T: ObjectType> {
values: Vec<T>,
}

#[derive(Debug)]
pub struct MutableObjectColumn<T: ObjectType> {
data_type: DataTypeImpl,
pub(crate) values: Vec<T>,
}

type JsonColumn = ObjectColumn<serde_json::Value>;

TODO

  • Use better storage formats to improve query performance