Cluster Upgrade

Apache Doris supports rolling upgrades by progressively replacing the binaries on FE and BE nodes, which minimizes downtime and keeps the cluster available during the upgrade. This document is for cluster administrators and covers version compatibility rules, the metadata compatibility test procedure, and the detailed upgrade steps.

Applicable Scenarios

Scenario	Applicable	Notes
Patch-version upgrade within the same minor version (for example, 2.1.3 to 2.1.7)	Yes	Rolling upgrade can be applied directly
Minor-version upgrade across minor versions (for example, 3.0 to 3.3)	Yes, step by step	Upgrade in sequence: 3.0 to 3.1, 3.1 to 3.2, 3.2 to 3.3
Major-version upgrade (for example, 2.x to 3.x)	Yes, step by step	Directly upgrading across major versions is not recommended; upgrade minor version by minor version
Upgrade of a single-FE cluster	Yes, after compatibility testing	Scale out to a 3-FE high-availability deployment first, or run a metadata compatibility test before upgrading
Upgrade across versions where metadata may be incompatible	Yes, after compatibility testing	Metadata compatibility must be verified before the upgrade

Prerequisites

Before performing the upgrade, confirm the following:

You have read the Release Notes of the target version and understand the behavioral changes and compatibility between versions.
Data is stored with 3 replicas, so that an upgrade failure does not cause data loss.
Client tasks have retry logic in place (see "Upgrade Considerations" below).
A development machine or BE node is available for the metadata compatibility test.
The installation package of the target version has been downloaded and extracted (in the steps below, ${DORIS_NEW_HOME} refers to the new version's root directory, and ${DORIS_OLD_HOME} refers to the root directory of the old version running in production).

Version Compatibility

A Doris version number has three digits. The first digit is the major milestone version, the second digit is the feature version, and the third digit is the bug-fix version. No new features are released in third-digit versions. Take Doris 2.1.3 as an example:

Position	Example value	Meaning
First digit	2	The 2nd milestone version
Second digit	1	The feature version under that milestone
Third digit	3	The 3rd bug-fix version under that feature version

Follow these rules when upgrading:

Upgrade type	Cross-version supported	Recommended path
Third-digit version (same second-digit version)	Supported	Upgrade directly, for example 2.1.3 to 2.1.7
Second-digit version	Cross-version not recommended	Upgrade second-digit version by second-digit version, for example 3.0 to 3.1 to 3.2 to 3.3
First-digit version	Cross-version not recommended	First upgrade to the latest second-digit version under the same first-digit version, then upgrade across the major version

For detailed version information, see Versioning rules.

Upgrade Considerations

Pay attention to the following three items before upgrading:

Consideration	How to handle
Behavioral changes between versions	Read the Release Notes of the target version before upgrading and confirm whether there are incompatible behavioral changes
Client task retries	Nodes are restarted in sequence during the upgrade, so retry logic must be added for Stream Load and query tasks. Routine Load, the Flink Doris Connector, and the Spark Doris Connector have built-in retries, so no extra handling is required
Replica repair and balance	Replica repair and balance must be disabled before the upgrade. Whether the upgrade succeeds or not, they must be re-enabled after the upgrade is complete

Note

A Doris upgrade only requires replacing the /bin and /lib directories under the FE directory and the /bin and /lib directories under the BE directory.

In versions 2.0.2 and later, a custom_lib/ directory has been added under both the FE and BE deployment paths (create it manually if it does not exist). It is used to store user-defined third-party jar files (such as hadoop-lzo-*.jar and orai18n.jar). This directory does not need to be replaced during the upgrade.

Metadata Compatibility Test

The metadata compatibility test verifies before the upgrade that the new version can load the existing metadata correctly, which prevents data loss from an upgrade failure. It is recommended to run this test before every upgrade.

Note

In production, it is recommended to keep 3 or more FE nodes in a high-availability configuration. If there is only 1 FE node, the metadata compatibility test must be run before the upgrade. Metadata compatibility is critical: if an upgrade fails because of metadata incompatibility, data loss is possible.

Also note the following:

It is recommended to run the metadata compatibility test on a development machine or a BE node, and to avoid running it on an FE node.
If the test can only be run on an FE node, choose a non-Master node and stop its original FE process first.

1. Back Up Metadata

Before starting the upgrade, back up the metadata of the Master FE node.

The Master FE node can be identified by the IsMaster column in show frontends. The FE metadata can be backed up hot, without stopping the FE process. By default, the FE metadata is stored in the fe/doris-meta directory; the metadata directory can also be confirmed through the meta_dir parameter in fe.conf.

2. Modify the FE Configuration File for Testing

Edit fe.conf in the test environment:

vi ${DORIS_NEW_HOME}/conf/fe.conf

Set all ports to values different from the production ones, and also modify the clusterId parameter:

...
## modify port
http_port = 18030
rpc_port = 19020
query_port = 19030
arrow_flight_sql_port = 19040
edit_log_port = 19010

## modify clusterIP
clusterId=<a_new_clusterID, such as 123456>
...

Example ports for the test environment:

Parameter	Example value	Description
`http_port`	18030	FE HTTP service port
`rpc_port`	19020	FE Thrift Server port
`query_port`	19030	FE MySQL protocol query port
`arrow_flight_sql_port`	19040	Arrow Flight SQL port
`edit_log_port`	19010	FE BDBJE communication port
`clusterId`	123456	Test cluster ID, must differ from production

3. Copy the Master FE Metadata

Copy the backed-up Master FE metadata to the new compatibility-test environment:

cp ${DORIS_OLD_HOME}/fe/doris-meta/* ${DORIS_NEW_HOME}/fe/doris-meta

4. Modify the cluster_id in the Metadata VERSION File

Modify the cluster_id in the VERSION file under the copied metadata directory to the new cluster ID, for example 123456 as in the example above:

vi ${DORIS_NEW_HOME}/fe/doris-meta/image/VERSION
clusterId=123456

5. Start the FE Process in the Test Environment

sh ${DORIS_NEW_HOME}/bin/start_fe.sh --daemon --metadata_failure_recovery

In versions before 2.0.2, add metadata_failure_recovery=true to fe.conf first, then start the FE process:

echo "metadata_failure_recovery=true" >> ${DORIS_NEW_HOME}/conf/fe.conf
sh ${DORIS_NEW_HOME}/bin/start_fe.sh --daemon

6. Verify That the FE Started Successfully

Connect to the test FE through a MySQL client, using query_port 19030 from the example above:

mysql -uroot -P19030 -h127.0.0.1

If the connection succeeds, the new version can load the current metadata correctly, and the metadata compatibility test passes.

Upgrade Flow Overview

The full upgrade flow is as follows, and the steps must be executed strictly in order:

Disable replica repair and balance.
Upgrade the BE nodes (a multi-replica cluster can use a canary upgrade).
Upgrade the FE nodes (upgrade Observer/Follower nodes first, and the Master last).
Re-enable replica repair and balance.

The overall principle is upgrade BE first, then upgrade FE. When upgrading FE, upgrade the Observer FE and Follower FE nodes first, and upgrade the Master FE node last.

Upgrade Steps

Step 1: Disable Replica Repair and Balance

Nodes are restarted during the upgrade, which may trigger unnecessary cluster balancing and replica repair. Disable the following three configurations first:

admin set frontend config("disable_balance" = "true");
admin set frontend config("disable_colocate_balance" = "true");
admin set frontend config("disable_tablet_scheduler" = "true");

The configurations involved are:

Configuration	Value before upgrade	Effect
`disable_balance`	`true`	Disables replica balancing, preventing replica migration triggered by node restarts
`disable_colocate_balance`	`true`	Disables replica balancing for Colocate Join tables
`disable_tablet_scheduler`	`true`	Disables tablet scheduling, preventing replica repair

Step 2: Upgrade BE Nodes

Note

To ensure data safety, store data with 3 replicas to avoid data loss caused by upgrade mis-operations or failures.

In a multi-replica cluster, you can pick one BE node for a canary upgrade. After it is verified, upgrade the remaining nodes one by one.

2.1 Stop the BE Node to Be Upgraded

sh ${DORIS_OLD_HOME}/be/bin/stop_be.sh

2.2 Back Up the Existing `/bin` and `/lib` Directories

mv ${DORIS_OLD_HOME}/be/bin ${DORIS_OLD_HOME}/be/bin_back
mv ${DORIS_OLD_HOME}/be/lib ${DORIS_OLD_HOME}/be/lib_back

2.3 Copy the New Version's `/bin` and `/lib` Directories

cp -r ${DORIS_NEW_HOME}/be/bin ${DORIS_OLD_HOME}/be/bin
cp -r ${DORIS_NEW_HOME}/be/lib ${DORIS_OLD_HOME}/be/lib

2.4 Start the BE Node

sh ${DORIS_OLD_HOME}/be/bin/start_be.sh --daemon

2.5 Verify the Upgrade

Connect to the cluster and check the node information:

show backends\G

If the Alive status of the BE node is true and the Version value is the new version, the node has been upgraded successfully. After confirming, upgrade the remaining BE nodes one by one in the same way.

Step 3: Upgrade FE Nodes

When there are multiple FE nodes, upgrade the non-Master nodes (Observer or Follower) first, and upgrade the Master node last after all the others are done.

3.1 Stop the FE Node to Be Upgraded

sh ${DORIS_OLD_HOME}/fe/bin/stop_fe.sh

3.2 Back Up the Existing Directories

Back up the three directories /bin, /lib, and /mysql_ssl_default_certificate:

mv ${DORIS_OLD_HOME}/fe/bin ${DORIS_OLD_HOME}/fe/bin_back
mv ${DORIS_OLD_HOME}/fe/lib ${DORIS_OLD_HOME}/fe/lib_back
mv ${DORIS_OLD_HOME}/fe/mysql_ssl_default_certificate ${DORIS_OLD_HOME}/fe/mysql_ssl_default_certificate_back

3.3 Copy the New Version's Directories

cp -r ${DORIS_NEW_HOME}/fe/bin ${DORIS_OLD_HOME}/fe/bin
cp -r ${DORIS_NEW_HOME}/fe/lib ${DORIS_OLD_HOME}/fe/lib
cp -r ${DORIS_NEW_HOME}/fe/mysql_ssl_default_certificate ${DORIS_OLD_HOME}/fe/mysql_ssl_default_certificate

3.4 Start the FE Node

sh ${DORIS_OLD_HOME}/fe/bin/start_fe.sh --daemon

3.5 Verify the Upgrade

Connect to the cluster and check the node information:

show frontends\G

If the Alive status of the FE node is true and the Version value is the new version, the node has been upgraded successfully.

3.6 Upgrade the Remaining FE Nodes One by One

Upgrade the other non-Master FE nodes one by one in the same way, and upgrade the Master FE node last.

Step 4: Re-enable Replica Repair and Balance

After the upgrade is complete and the Alive status of all BE nodes has returned to true, re-enable cluster replica repair and balance:

admin set frontend config("disable_balance" = "false");
admin set frontend config("disable_colocate_balance" = "false");
admin set frontend config("disable_tablet_scheduler" = "false");

FAQ

Q: After the upgrade, a BE / FE node shows `Alive` as `false` and `Version` is still the old version. What should I do?

The process did not start, or the new version failed to start. Check the be.log / fe.log error logs and confirm that /bin and /lib were replaced correctly.

Q: After the upgrade, the FE fails to start with a metadata incompatibility error. What should I do?

The cross-version jump was too large, or the metadata compatibility test was not run. Roll back to the old version, run the test following the "Metadata Compatibility Test" procedure in this document, and upgrade step by step if necessary.

Q: After the upgrade, jar files in `custom_lib/` are missing. What happened?

custom_lib/ was overwritten by mistake. Only replace /bin and /lib; custom_lib/ must not be replaced.

Q: Stream Load tasks fail during the upgrade. Why?

Node restarts interrupt client connections. Add retry logic on the client side. Routine Load and the Flink/Spark Doris Connectors have built-in retries.

Q: After the upgrade, the replica count is abnormal or replica migration happens frequently. What is wrong?

disable_balance / disable_tablet_scheduler were not disabled, or they were not re-enabled after the upgrade. Verify the enable/disable flow of these configurations: disable them before the upgrade and re-enable them after.

Q: A cross-minor-version upgrade failed. Why?

The versions were not upgraded in sequence. Roll back and follow a step-by-step path such as 3.0 to 3.1 to 3.2 to 3.3.

Q: A single-FE cluster upgrade failed and metadata was lost. What should I do?

The metadata compatibility test was not run. Before upgrading, scale out to a 3-FE high-availability deployment, or run the metadata compatibility test on a development machine or BE node.

Applicable Scenarios​

Prerequisites​

Version Compatibility​

Upgrade Considerations​

Metadata Compatibility Test​

1. Back Up Metadata​

2. Modify the FE Configuration File for Testing​

3. Copy the Master FE Metadata​

4. Modify the cluster_id in the Metadata VERSION File​

5. Start the FE Process in the Test Environment​

6. Verify That the FE Started Successfully​

Upgrade Flow Overview​

Upgrade Steps​

Step 1: Disable Replica Repair and Balance​

Step 2: Upgrade BE Nodes​

2.1 Stop the BE Node to Be Upgraded​

2.2 Back Up the Existing /bin and /lib Directories​

2.3 Copy the New Version's /bin and /lib Directories​

2.4 Start the BE Node​

2.5 Verify the Upgrade​

Step 3: Upgrade FE Nodes​

3.1 Stop the FE Node to Be Upgraded​

3.2 Back Up the Existing Directories​

3.3 Copy the New Version's Directories​

3.4 Start the FE Node​

3.5 Verify the Upgrade​

3.6 Upgrade the Remaining FE Nodes One by One​

Step 4: Re-enable Replica Repair and Balance​

FAQ​

Q: After the upgrade, a BE / FE node shows Alive as false and Version is still the old version. What should I do?​

Q: After the upgrade, the FE fails to start with a metadata incompatibility error. What should I do?​

Q: After the upgrade, jar files in custom_lib/ are missing. What happened?​

Q: Stream Load tasks fail during the upgrade. Why?​

Q: After the upgrade, the replica count is abnormal or replica migration happens frequently. What is wrong?​

Q: A cross-minor-version upgrade failed. Why?​

Q: A single-FE cluster upgrade failed and metadata was lost. What should I do?​

Applicable Scenarios

Prerequisites

Version Compatibility

Upgrade Considerations

Metadata Compatibility Test

1. Back Up Metadata

2. Modify the FE Configuration File for Testing

3. Copy the Master FE Metadata

4. Modify the cluster_id in the Metadata VERSION File

5. Start the FE Process in the Test Environment

6. Verify That the FE Started Successfully

Upgrade Flow Overview

Upgrade Steps

Step 1: Disable Replica Repair and Balance

Step 2: Upgrade BE Nodes

2.1 Stop the BE Node to Be Upgraded

2.2 Back Up the Existing `/bin` and `/lib` Directories

2.3 Copy the New Version's `/bin` and `/lib` Directories

2.4 Start the BE Node

2.5 Verify the Upgrade

Step 3: Upgrade FE Nodes

3.1 Stop the FE Node to Be Upgraded

3.2 Back Up the Existing Directories

3.3 Copy the New Version's Directories

3.4 Start the FE Node

3.5 Verify the Upgrade

3.6 Upgrade the Remaining FE Nodes One by One

Step 4: Re-enable Replica Repair and Balance

FAQ

Q: After the upgrade, a BE / FE node shows `Alive` as `false` and `Version` is still the old version. What should I do?

Q: After the upgrade, the FE fails to start with a metadata incompatibility error. What should I do?

Q: After the upgrade, jar files in `custom_lib/` are missing. What happened?

Q: Stream Load tasks fail during the upgrade. Why?

Q: After the upgrade, the replica count is abnormal or replica migration happens frequently. What is wrong?

Q: A cross-minor-version upgrade failed. Why?

Q: A single-FE cluster upgrade failed and metadata was lost. What should I do?