1 帖子已标记为 "benchmark"

Benchmarking Databend using TPC-H

2022年8月8日 · 5 分钟读完

This post gives you a general idea about the TPC-H benchmark and explains how to run a TPC-H benchmark on Databend.

What's TPC-H?

TPC-H is a decision support benchmark. It consists of a suite of business-oriented ad hoc queries and concurrent data modifications. The queries and the data populating the database have been chosen to have broad industry-wide relevance. This benchmark illustrates decision support systems that examine large volumes of data, execute queries with a high degree of complexity, and give answers to critical business questions.

The TPC-H benchmark simulates a system for online sales of parts and components and defines eight tables in total. The structure, data volume, and mutual relationship of each table are shown in the figure below:

The benchmark workload consists of twenty-two decision support queries that must be executed as part of the TPC-H benchmark. Each TPC-H query asks a business question and includes the corresponding query to answer the question. More information about TPC-H can be found at https://www.tpc.org/tpch/.

Running TPC-H Benchmark on Databend

This section describes the steps to run the TPC-H benchmark on Databend and provides the related scripts.

Step 1: Generate test data with TPC-H Docker

The following code pulls a docker image and allocates the data in the path where you are running the TPC-H benchmark.

docker pull ghcr.io/databloom-ai/tpch-docker:main
docker run -it -v "$(pwd)":/data ghcr.io/databloom-ai/tpch-docker:main dbgen -vf -s 1

TPC-H comes with various data set sizes to test different scale factors. You can use the -s option to set scale factor in the command (for example, the code above sets the scale factor to 1). For more information about the command, see https://github.com/databloom-ai/TPCH-Docker.

SF (Gigabytes)	Size
1	Consists of the base row size (several million elements).
10	Consists of the base row size x 10.
100	Consists of the base row size x 100 (several hundred million elements).
1000	Consists of the base row size x 1000 (several billion elements).

Step 2: Create database and tables

CREATE DATABASE IF NOT EXISTS tpch;

USE tpch;

CREATE TABLE IF NOT EXISTS nation
  (
     n_nationkey INTEGER NOT NULL,
     n_name      VARCHAR NOT NULL,
     n_regionkey INT NOT NULL,
     n_comment   VARCHAR
  );

CREATE TABLE IF NOT EXISTS region
  (
     r_regionkey INT NOT NULL,
     r_name      VARCHAR NOT NULL,
     r_comment   VARCHAR
  );

CREATE TABLE IF NOT EXISTS part
  (
     p_partkey     INT NOT NULL,
     p_name        VARCHAR NOT NULL,
     p_mfgr        VARCHAR NOT NULL,
     p_brand       VARCHAR NOT NULL,
     p_type        VARCHAR NOT NULL,
     p_size        INT NOT NULL,
     p_container   VARCHAR NOT NULL,
     p_retailprice FLOAT NOT NULL,
     p_comment     VARCHAR NOT NULL
  );

CREATE TABLE IF NOT EXISTS supplier
  (
     s_suppkey   INT NOT NULL,
     s_name      VARCHAR NOT NULL,
     s_address   VARCHAR NOT NULL,
     s_nationkey INT NOT NULL,
     s_phone     VARCHAR NOT NULL,
     s_acctbal   FLOAT NOT NULL,
     s_comment   VARCHAR NOT NULL
  );

CREATE TABLE IF NOT EXISTS partsupp
  (
     ps_partkey    INT NOT NULL,
     ps_suppkey    INT NOT NULL,
     ps_availqty   INT NOT NULL,
     ps_supplycost FLOAT NOT NULL,
     ps_comment    VARCHAR NOT NULL
  );

CREATE TABLE IF NOT EXISTS customer
  (
     c_custkey    INT NOT NULL,
     c_name       VARCHAR NOT NULL,
     c_address    VARCHAR NOT NULL,
     c_nationkey  INT NOT NULL,
     c_phone      VARCHAR NOT NULL,
     c_acctbal    FLOAT NOT NULL,
     c_mktsegment VARCHAR NOT NULL,
     c_comment    VARCHAR NOT NULL
  );

CREATE TABLE IF NOT EXISTS orders
  (
     o_orderkey      INT NOT NULL,
     o_custkey       INT NOT NULL,
     o_orderstatus   VARCHAR NOT NULL,
     o_totalprice    FLOAT NOT NULL,
     o_orderdate     DATE NOT NULL,
     o_orderpriority VARCHAR NOT NULL,
     o_clerk         VARCHAR NOT NULL,
     o_shippriority  INT NOT NULL,
     o_comment       VARCHAR NOT NULL
  );

CREATE TABLE IF NOT EXISTS lineitem
  (
     l_orderkey      INT NOT NULL,
     l_partkey       INT NOT NULL,
     l_suppkey       INT NOT NULL,
     l_linenumber    INT NOT NULL,
     l_quantity      FLOAT NOT NULL,
     l_extendedprice FLOAT NOT NULL,
     l_discount      FLOAT NOT NULL,
     l_tax           FLOAT NOT NULL,
     l_returnflag    VARCHAR NOT NULL,
     l_linestatus    VARCHAR NOT NULL,
     l_shipdate      DATE NOT NULL,
     l_commitdate    DATE NOT NULL,
     l_receiptdate   DATE NOT NULL,
     l_shipinstruct  VARCHAR NOT NULL,
     l_shipmode      VARCHAR NOT NULL,
     l_comment       VARCHAR NOT NULL
  ); 

Step 3: Load test data to Databend

This step uses the HTTP API v1/streaming_load to load the test data to Databend. More information about this API can be found at https://databend.rs/doc/load-data/local.

The code below connects to Databend using the Root user. Please note that the root user only works when you access Databend from localhost. 您需要创建新用户并授予适当的权限，才能远程连接到Databend。

#!/bin/bash

for t in customer lineitem nation orders partsupp part region supplier
do
    echo "$t"
    curl -XPUT 'http://root:@127.0.0.1:8000/v1/streaming_load' -H "insert_sql: insert into tpch.'$t' file_format = (type = CSV field_delimiter = '|' record_delimiter = '\n')"d -F 'upload=@"./'$t'.tbl"'
done

Step 4: Run TPC-H queries

All the definitions of the TPC-H queries can be found at https://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v3.0.1.pdf. You can simply run them by copying and pasting the scripts to Databend.

The Databend team ran the TPC-H benchmark around two months ago and uploaded their queries and results to GitHub. You can find them at https://github.com/datafuselabs/databend/tree/main/tests/suites/0_stateless/13_tpch. Please note that Databend now uses the new planner by default, so you DO NOT need to enable it any more before running the queries.

What's TPC-H?​

Running TPC-H Benchmark on Databend​

Step 1: Generate test data with TPC-H Docker​

Step 2: Create database and tables​

Step 3: Load test data to Databend​

Step 4: Run TPC-H queries​