跳转到主要内容

Multiple Catalog

Summary

Allow users to maintain multiple catalogs for the databend.

Motivation

Databend organize data in three layers:

catalog -> database -> table
  • catalog: the biggest layer for databend, contains all databases and tables, provided by Catalog
  • database: the container of tables, provided by Database
  • table: the smallest unit in databend, provided by Table

By default, all databases and tables will be stored in default catalog (powered by metasrv).

Databend supports multiple catalogs now, but only in a static way.

To allow accessing the hive catalog, users need to configure hive inside databend-query.toml in this way:

[catalog]
meta_store_address = "127.0.0.1:9083"
protocol = "binary"

Users can't add/alter/remove the catalogs during runtime.

By allowing users to maintain multiple catalogs for the databend, we can integrate more catalogs like iceberg more quickly.

Guide-level explanation

After this RFC has been implemented, users can create new catalogs like:

CREATE CATALOG my_hive
TYPE=HIVE
CONNECTION = (URL='<hive-meta-store>' THRIFT_PROTOCOL=BINARY);
SELECT * FROM my_hive.DB.table;

Besides, users can alter or drop a catalog:

DROP CATALOG my_hive;

Users can add more catalogs like:

CREATE CATALOG my_iceberg
TYPE=ICEBERG
CONNECTION = (URL='s3://my_bucket/path/to/iceberg');
SELECT * FROM my_iceberg.DB.table;

With this feature, users can join data from different catalogs now:

select
my_iceberg.DB.purchase_records.Client_ID,
my_iceberg.DB.purchase_records.Item,
my_iceberg.DB.purchase_records.QTY
from my_hive.DB.vip_info
inner join my_iceberg.DB.purchase_records
on my_hive.DB.vip_info.Client_ID = my_iceberg.DB.purchase_records.Client_ID;

Reference-level explanation

Databend has a framework for multiple catalogs now. The only change for us is to store catalog-related information in metasrv instead.

To make it possible to start a query without metasrv, we will also support configuring catalogs in config like:

[catalogs.my_hive]
meta_store_address = "127.0.0.1:9083"
protocol = "binary"

[catalogs.my_iceberg]
URL = "s3://bucket"

Static catalogs will always be loaded from configs and can't be altered or dropped.

Drawbacks

无。

Rationale and alternatives

无。

Prior art

Presto

Presto is an open-source SQL query engine that's fast, reliable, and efficient at scale. It doesn't have persisted states, so all its connectors will be configured.

Take iceberg as an example:

connector.name=iceberg
hive.metastore.uri=hostname:port
iceberg.catalog.type=hive

While using:

USE iceberg.tpch;
CREATE TABLE IF NOT EXISTS ctas_nation AS (SELECT * FROM nation);
DESCRIBE ctas_nation;

Unresolved questions

无。

Future possibilities

Iceberg Catalog

Discussed in RFC Iceberg External Table

Delta Sharing Catalog

Discussed in Tracking issues of integration with delta sharing