BigQuery
This document describes how to migrate data from Google BigQuery into Apache Doris. The overall workflow uses object storage as an intermediate medium:
- Use BigQuery's Export statement to export data to GCS (Google Cloud Storage).
- Create the corresponding target table in Doris.
- Use Doris's S3 Load feature to pull the data from GCS into Doris. For details on the mechanism, see S3 Load.
Notes
Before starting the migration, pay attention to the following items to avoid common issues:
- Table schema design: Choose the appropriate Doris data model, as well as partitioning and bucketing strategies based on the BigQuery table schema. For more table creation strategies, see Load best practices.
- JSON type export: BigQuery does not support exporting JSON types in Parquet format. Use JSON format for the export instead.
- Time type export: When BigQuery exports a Time type, cast it to String before exporting.
- Complex type import: For Parquet/ORC files containing complex types (Struct/Array/Map), you must currently use TVF Load.
Data type mapping
Before migrating, create columns in Doris that correspond to the BigQuery fields according to the following rules:
| BigQuery | Doris | Notes |
|---|---|---|
| Array | Array | |
| BOOLEAN | BOOLEAN | |
| DATE | DATE | |
| DATETIME/TIMESTAMP | DATETIME | |
| JSON | JSON | |
| INT64 | BIGINT | |
| NUMERIC | DECIMAL | |
| FLOAT64 | DOUBLE | |
| STRING | VARCHAR/STRING | VARCHAR maximum length is 65535 |
| STRUCT | STRUCT | |
| TIME | STRING | Cast to String before exporting |
| OTHER | UNSUPPORTED | Types not currently supported |
Migration steps
Step 1: Create the target table in Doris
Before migrating a BigQuery table to Doris, first create a Doris table with a matching schema.
Assume the following table and sample data already exist in BigQuery:
CREATE OR REPLACE TABLE test.sales_data (
order_id INT64,
customer_name STRING,
order_date DATE,
amount NUMERIC(10,2),
country STRING
)
PARTITION BY order_date;
INSERT INTO test.sales_data (order_id, customer_name, order_date, amount, country) VALUES
(1, 'Alice', '2025-04-08', 99.99, 'USA'),
(2, 'Bob', '2025-04-08', 149.50, 'Canada'),
(3, 'Charlie', '2025-04-09', 75.00, 'UK'),
(4, 'Diana', '2025-04-10', 200.00, 'Australia');
Based on this schema, you can create a unique key partitioned table in Doris. The partition column matches the BigQuery partition column, and the table is partitioned by day:
CREATE TABLE `sales_data` (
order_id INT,
order_date DATE NOT NULL,
customer_name VARCHAR(128),
amount DECIMAL(10,2),
country VARCHAR(48)
) ENGINE=OLAP
UNIQUE KEY(`order_id`,`order_date`)
PARTITION BY RANGE(`order_date`) (
PARTITION p20250408 VALUES [('2025-04-08'), ('2025-04-09')),
PARTITION p20250409 VALUES [('2025-04-09'), ('2025-04-10')),
PARTITION p20250410 VALUES [('2025-04-10'), ('2025-04-11'))
)
DISTRIBUTED BY HASH(`order_id`) BUCKETS 16
PROPERTIES (
"dynamic_partition.enable" = "true",
"dynamic_partition.time_unit" = "DAY",
"dynamic_partition.end" = "5",
"dynamic_partition.prefix" = "p",
"dynamic_partition.buckets" = "16",
"replication_num" = "1"
);
Step 2: Export BigQuery data to GCS
2.1 Export Parquet files with the Export statement
Use BigQuery's EXPORT DATA to export the target table to GCS:
EXPORT DATA
OPTIONS (
uri = 'gs://mybucket/export/sales_data/*.parquet',
format = 'PARQUET')
AS (
SELECT *
FROM test.sales_data
);
2.2 Inspect the exported files on GCS
The command above exports sales_data to GCS. Each partition produces one or more files with incrementing file names. For details, see exporting-data.

Step 3: Import data into Doris with S3 Load
The import uses S3 Load. S3 Load is an asynchronous data import method. After the job is submitted, Doris actively pulls data from the data source. The data source supports object storage compatible with the S3 protocol, including AWS S3, GCS, and Azure.
This method is suitable for scenarios with large data volumes that need asynchronous background processing. For data imports that need synchronous processing, see TVF Load.
Note: For Parquet/ORC files containing complex types (Struct/Array/Map), you must currently use TVF Load.
3.1 Import data from a single file
Use the following LOAD LABEL statement to import a single Parquet file from GCS:
LOAD LABEL sales_data_2025_04_08
(
DATA INFILE("s3://mybucket/export/sales_data/000000000000.parquet")
INTO TABLE sales_data
FORMAT AS "parquet"
(order_id, order_date, customer_name, amount, country)
)
WITH S3
(
"provider" = "GCP",
"s3.endpoint" = "storage.asia-southeast1.rep.googleapis.com",
"s3.region" = "asia-southeast1",
"s3.access_key" = "<ak>",
"s3.secret_key" = "<sk>"
);
3.2 Check job status with Show Load
Because S3 Load is submitted asynchronously, use SHOW LOAD to query the import status for a given label:
mysql> show load where label = "label_sales_data_2025_04_08"\G
*************************** 1. row ***************************
JobId: 17956078
Label: label_sales_data_2025_04_08
State: FINISHED
Progress: 100.00% (1/1)
Type: BROKER
EtlInfo: unselected.rows=0; dpp.abnorm.ALL=0; dpp.norm.ALL=2
TaskInfo: cluster:storage.asia-southeast1.rep.googleapis.com; timeout(s):3600; max_filter_ratio:0.0; priority:NORMAL
ErrorMsg: NULL
CreateTime: 2025-04-10 17:50:53
EtlStartTime: 2025-04-10 17:50:54
EtlFinishTime: 2025-04-10 17:50:54
LoadStartTime: 2025-04-10 17:50:54
LoadFinishTime: 2025-04-10 17:50:54
URL: NULL
JobDetails: {"Unfinished backends":{"5eec1be8612d4872-91040ff1e7208a4f":[]},"ScannedRows":2,"TaskNumber":1,"LoadBytes":91,"All backends":{"5eec1be8612d4872-91040ff1e7208a4f":[10022]},"FileNumber":1,"FileSize":1620}
TransactionId: 766228
ErrorTablets: {}
User: root
Comment:
1 row in set (0.00 sec)
3.3 Handle errors during import
When multiple import jobs exist, use the following statement to query failed imports and their causes:
mysql> show load where state='CANCELLED' and label like "label_test%"\G
*************************** 1. row ***************************
JobId: 18312384
Label: label_test123
State: CANCELLED
Progress: 100.00% (3/3)
Type: BROKER
EtlInfo: unselected.rows=0; dpp.abnorm.ALL=4; dpp.norm.ALL=0
TaskInfo: cluster:storage.asia-southeast1.rep.googleapis.com; timeout(s):14400; max_filter_ratio:0.0; priority:NORMAL
ErrorMsg: type:ETL_QUALITY_UNSATISFIED; msg:quality not good enough to cancel
CreateTime: 2025-04-15 17:32:59
EtlStartTime: 2025-04-15 17:33:02
EtlFinishTime: 2025-04-15 17:33:02
LoadStartTime: 2025-04-15 17:33:02
LoadFinishTime: 2025-04-15 17:33:02
URL: http://10.16.10.6:28747/api/_load_error_log?file=__shard_2/error_log_insert_stmt_7602ccd7c3a4854-95307efca7bfe342_7602ccd7c3a4854_95307efca7bfe342
JobDetails: {"Unfinished backends":{"7602ccd7c3a4854-95307efca7bfe341":[]},"ScannedRows":4,"TaskNumber":1,"LoadBytes":188,"All backends":{"7602ccd7c3a4854-95307efca7bfe341":[10022]},"FileNumber":3,"FileSize":4839}
TransactionId: 769213
ErrorTablets: {}
User: root
Comment:
The example above is a data quality error (ETL_QUALITY_UNSATISFIED). To get the specific cause, visit the returned URL. For example, the error below indicates that the data exceeds the actual length of the country column in the table schema:
[root@VM-10-6-centos ~]$ curl "http://10.16.10.6:28747/api/_load_error_log?file=__shard_2/error_log_insert_stmt_7602ccd7c3a4854-95307efca7bfe342_7602ccd7c3a4854_95307efca7bfe342"
Reason: column_name[country], the length of input is too long than schema. first 32 bytes of input str: [USA] schema length: 1; actual length: 3; . src line [];
Reason: column_name[country], the length of input is too long than schema. first 32 bytes of input str: [Canada] schema length: 1; actual length: 6; . src line [];
Reason: column_name[country], the length of input is too long than schema. first 32 bytes of input str: [UK] schema length: 1; actual length: 2; . src line [];
Reason: column_name[country], the length of input is too long than schema. first 32 bytes of input str: [Australia] schema length: 1; actual length: 9; . src line [];
For data quality errors, if skipping some erroneous rows is acceptable, set the error tolerance ratio in the S3 Load PROPERTIES. For details on the parameters, see Load configuration parameters.
3.4 Import data from multiple files
When migrating large volumes of historical data, use a batched import strategy:
- Batch division: Each batch corresponds to one Doris partition or a small number of partitions.
- Per-batch data volume: No more than 100 GB per batch is recommended, to reduce system pressure and lower the cost of retries after a failed import.
- Batch script: See s3_load_file_demo.sh. This script splits the file list under a specified directory in object storage and submits multiple S3 Load jobs to Doris in batches, enabling batch import.
FAQ
Q1: Why can JSON types in BigQuery not be exported directly as Parquet?
BigQuery does not currently support exporting JSON types in Parquet format. Set format to JSON for the export, and use a corresponding JSON column in Doris to receive the data.
Q2: How does the BigQuery Time type map to Doris?
The BigQuery Time type must be cast to String during export, and stored in a STRING column in Doris.
Q3: What should I do when an import fails with ETL_QUALITY_UNSATISFIED?
This error indicates that the data does not meet quality requirements. Handle it as follows:
- Look at the
URLfield in theSHOW LOADresult and usecurlto fetch the detailed error log. - Identify the cause from the log (common causes include column length overflow and type mismatch).
- Adjust the target table schema, or set
max_filter_ratioin the S3 LoadPROPERTIESto skip some erroneous rows.
Q4: Can Parquet/ORC files containing Struct/Array/Map be imported with S3 Load?
No. Files with these complex types must currently be imported with TVF Load.
Q5: How can I avoid the retry cost caused by single-job failures when migrating large volumes of historical data?
Import in batches by partition, with no more than 100 GB per batch, and use the s3_load_file_demo.sh script to split and submit batches.