Skip to main content

Data Catalog Overview

A Data Catalog is used to describe the attributes of a data source.

In Doris, multiple catalogs can be created to point to different data sources (such as Hive, Iceberg, MySQL). Doris will automatically obtain the databases, tables, columns, partitions, data locations, etc. of the corresponding data sources through the catalogs. Users can access these catalogs for data analysis through standard SQL statements and can conduct join queries on the data from multiple catalogs.

There are two types of catalogs in Doris:

TypeDescription
Internal CatalogThe built-in catalog, named internal, used to store Doris internal table data. It cannot be created, modified, or deleted.
External CatalogExternal catalogs refer to all catalogs other than the Internal Catalog. Users can create, modify, and delete external catalogs.

Catalogs are mainly applicable to the following three scenarios, but different catalogs are suitable for different scenarios. For details, see the documentation for the corresponding catalog.

ScenarioDescription
Query AccelerationDirect query acceleration for data lakes such as Hive, Iceberg, Paimon, etc.
Data IntegrationZeroETL solution, directly accessing different data sources to generate result data, or facilitating data flow between different data sources.
Data Write-backAfter data processing via Doris, write back to external data sources.

This document uses Iceberg Catalog as an example to focus on the basic operations of catalogs. For detailed descriptions of different catalogs, please refer to the documentation of the corresponding catalog.

Creating Catalog​

Create an Iceberg Catalog using the CREATE CATALOG statement.

CREATE CATALOG iceberg_catalog PROPERTIES (
'type' = 'iceberg',
'iceberg.catalog.type' = 'hadoop',
'warehouse' = 's3://bucket/dir/key',
's3.endpoint' = 's3.us-east-1.amazonaws.com',
's3.access_key' = 'ak',
's3.secret_key' = 'sk'
);

Essentially, a catalog created in Doris acts as a "proxy" to access the metadata services (such as Hive Metastore) and storage services (such as HDFS/S3) of the corresponding data source. Doris only stores connection properties and other information of the catalog, not the actual metadata and data of the corresponding data source.

Common Properties​

In addition to the set of properties specific to each catalog, here are the common properties for all catalogs {CommonProperties}.

Property NameDescriptionExample
include_database_listSupports synchronizing only specified Databases, separated by ,. By default, all Databases are synchronized. Database names are case-sensitive. Use this parameter when there are many Databases in the external data source but only a few need to be accessed, to avoid synchronizing a large amount of metadata.'include_database_list' = 'db1,db2'
exclude_database_listSupports specifying multiple Databases that do not need to be synchronized, separated by ,. By default, no filtering is applied, and all Databases are synchronized. Database names are case-sensitive. This is used in the same scenario as above, to exclude databases that do not need to be accessed. If there is a conflict, exclude takes precedence over include.'exclude_database_list' = 'db1,db2'

Column Type Mapping​

After a user creates a catalog, Doris automatically synchronizes the databases, tables, and columns of the catalog. For column type mapping rules of different catalogs, please refer to the documentation of the corresponding catalog.

For external data types that cannot currently be mapped to Doris column types, such as UNION, INTERVAL, etc., Doris will map the column type to UNSUPPORTED. For queries involving UNSUPPORTED types, see the example below:

Assume the synchronized table schema is:

k1 INT,
k2 INT,
k3 UNSUPPORTED,
k4 INT

The query behavior is as follows:

SELECT * FROM table;                -- Error: Unsupported type 'UNSUPPORTED_TYPE' in 'k3'
SELECT * EXCEPT(k3) FROM table; -- Query OK.
SELECT k1, k3 FROM table; -- Error: Unsupported type 'UNSUPPORTED_TYPE' in 'k3'
SELECT k1, k4 FROM table; -- Query OK.

Using Catalog​

Viewing Catalog​

After creation, you can view the catalog using the SHOW CATALOGS command:

mysql> SHOW CATALOGS;
+-----------+-----------------+----------+-----------+-------------------------+---------------------+------------------------+
| CatalogId | CatalogName | Type | IsCurrent | CreateTime | LastUpdateTime | Comment |
+-----------+-----------------+----------+-----------+-------------------------+---------------------+------------------------+
| 10024 | iceberg_catalog | hms | yes | 2023-12-25 16:11:41.687 | 2023-12-25 20:43:18 | NULL |
| 0 | internal | internal | | NULL | NULL | Doris internal catalog |
+-----------+-----------------+----------+-----------+-------------------------+---------------------+------------------------+

You can view the statement to create a catalog using SHOW CREATE CATALOG.

Switching Catalog​

Doris provides the SWITCH statement to switch the connection session context to the corresponding catalog. This is similar to using the USE statement to switch databases.

After switching to a catalog, you can use the USE statement to continue switching to a specified database, or use SHOW DATABASES to view the databases under the current catalog.

SWITCH iceberg_catalog;

SHOW DATABASES;
+--------------------+
| Database |
+--------------------+
| information_schema |
| mysql |
| test |
| iceberg_db |
+--------------------+

USE iceberg_db;

You can also use the USE statement with the fully qualified name catalog_name.database_name to switch directly to a specified database within a specified catalog:

USE iceberg_catalog.iceberg_db;

Fully qualified names can also be used in MySQL command line or JDBC connection strings to be compatible with the MySQL connection protocol.

# Command line tool
mysql -h host -P9030 -uroot -Diceberg_catalog.iceberg_db

# JDBC url
jdbc:mysql://host:9030/iceberg_catalog.iceberg_db

The fixed name for the built-in catalog is internal. The switching method is the same as for external catalogs.

Simple Query​

You can query tables in external catalogs using any SQL statement supported by Doris.

SELECT id, SUM(cost) FROM iceberg_db.table1
GROUP BY id ORDER BY id;

Cross-Catalog Query​

Doris supports join queries across different catalogs.

Here, let's create another MySQL Catalog:

CREATE CATALOG mysql_catalog properties(
'type' = 'jdbc',
'user' = 'root',
'password' = '123456',
'jdbc_url' = 'jdbc:mysql://host:3306/mysql_db',
'driver_url' = 'mysql-connector-java-8.0.25.jar',
'driver_class' = 'com.mysql.cj.jdbc.Driver'
);

Then, perform a join query between Iceberg tables and MySQL tables using SQL:

SELECT * FROM FROM
iceberg_catalog.iceberg_db.table1 tbl1 JOIN mysql_catalog.mysql_db.dim_table tbl2
ON tbl1.id = tbl2.id;

Data Import​

You can import data from a data source into Doris using the INSERT command.

INSERT INTO internal.doris_db.tbl1
SELECT * FROM iceberg_catalog.iceberg_db.table1;

You can also use the CTAS (Create Table As Select) statement to create an internal Doris table from an external data source and import the data:

CREATE TABLE internal.doris_db.tbl1
PROPERTIES('replication_num' = '1')
AS
SELECT * FROM iceberg_catalog.iceberg_db.table1;

Data Write-Back​

Doris supports writing data back to external data sources using the INSERT statement. For more details, refer to:

Refreshing Catalog​

Catalogs created in Doris act as "proxy" to access the metadata services of corresponding data sources. Doris caches some metadata to improve access performance and reduce frequent cross-network requests. However, the cache has a validity period, and without refreshing, you cannot access the latest metadata. Therefore, Doris provides several ways to refresh catalogs.

-- Refresh catalog
REFRESH CATALOG catalog_name;

-- Refresh specified database
REFRESH DATABASE catalog_name.db_name;

-- Refresh specified table
REFRESH TABLE catalog_name.db_name.table_name;

Doris also supports disabling metadata caching to access the latest metadata in real-time.

For detailed information and configuration of metadata caching, please refer to: Metadata Cache

Modifying Catalog​

You can modify the properties or name of a catalog using the ALTER CATALOG statement:

-- Rename a catalog
ALTER CATALOG iceberg_catalog RENAME iceberg_catalog2;

-- Modify properties of a catalog
ALTER CATALOG iceberg_catalog SET PROPERTIES ('key1' = 'value1' [, 'key' = 'value2']);

-- Modify the comment of a catalog
ALTER CATALOG iceberg_catalog MODIFY COMMENT 'my iceberg catalog';

Deleting Catalog​

You can delete a specified external catalog using the DROP CATALOG statement.

DROP CATALOG [IF EXISTS] iceberg_catalog;

Deleting an external catalog from Doris does not remove the actual data; it only deletes the mapping relationship stored in Doris.

Permission Management​

The permission management for databases and tables in an external catalog is the same as for internal tables. For details, refer to the Authentication and Authorization documentation.