Blog/Best Practice

Best practice of Apache Doris in Xiaomi Group

Apache Doris2022年7月20日

In order to improve the query performance of the Xiaomi growth analysis platform and reduce the operation and maintenance costs, Xiaomi Group introduced Apache Doris in September 2019. In the past two and a half years, Apache Doris has been widely used in Xiaomi Group, such as business growth analytic platform, realtime dashboards for all business groups, finance analysis, user profile analysis, advertising reports, A/B testing platform and so on. This article will share the best practice of Apache Doris in Xiaomi Group.

Business Practice

The typical business practices of Apache Doris in Xiaomi are as follows:

01 User Access

Data Factory is a one-stop data development platform developed by Xiaomi for data developers and data analysts. This platform supports data sources such as Doris, Hive, Kudu, Iceberg, ES, Talso, TiDB, MySQL, etc. It also supports computing engines such as Flink, Spark, Presto,etc.

Inside Xiaomi, users need to access the Doris service through the data factory. Users need to register in the data factory and complete the approval for building the database. The Doris operation and maintenance classmates will connect according to the descriptions of the business scenarios and data usage expectations submitted by users in the data factory. After completing the access approval, users can use the Doris service to perform operations such as visual table creation and data import in the data factory.

02 Data import

In Xiaomi's business, the two most common ways to import data into Doris are Stream Load and Broker Load. User data will be divided into real-time data and offline data, and users' real-time and offline data will generally be written to Talos first (Talos is a distributed, high-throughput message queue developed by Xiaomi). The offline data from Talos will be sink to HDFS, and then imported to Doris through the data factory. Users can directly submit Broker Load tasks in the data factory to import large batches of data on HDFS into Doris, In addition, you can run the SparkSQL command in the data factory to query data from Hive, Import the data found in SparkSQL into Doris through Spark-doris-Connector, and encapsulate Stream Load at the bottom layer of Spark-doris-Connector. Real-time data from Talos is generally imported into Doris in two ways. One is to first perform ETL on the data through Flink, and then import small batches of data to Doris through.Flink- Doris-connector encapsulates the Stream Load at the bottom layer. Another way is to import small batches of data into Doris through Stream Load encapsulated by Spark Streaming at regular intervals.

03 Data Query

Doris users of Xiaomi generally analyze and query Doris and display the results through the ShuJing platform.ShuJing is a general-purpose BI analysis tool developed by Xiaomi. Users can query and visualize Doris through ShuJing platform, and realize user behavior analysis (in order to meet the needs of business event analysis, retention analysis, funnel analysis, path analysis and other behavior analysis needs, We added corresponding UDF and UDAF ) and user profile analysis for Doris.

04 Compaction Tuning

For Doris, each data import will generate a data version under the relevant data shard (Tablet) of the storage layer, and the Compaction mechanism will asynchronously merge the smaller data versions generated by the import (the detailed principle of the Compaction mechanism can be Refer to the previous article "Doris Compaction Mechanism Analysis").

Xiaomi has many high-frequency, high-concurrency, near-real-time import business scenarios, and a large number of small versions will be generated in a short period of time. If Compaction does not merge data versions in time, it will cause version accumulation.On the one hand, too many minor versions will increase the pressure on metadata, and on the other hand, too many versions will affect query performance.In Xiaomi's usage scenarios, many tables use the Unique and Aggregate data models, and the query performance is heavily dependent on whether Compaction can merge data versions in time.In our business scenario, the query performance was reduced by tens of times due to delayed version merging, thus affecting online services.When a Compaction happens, it consumes CPU, memory, and disk I/O resources. Too much compaction will take up too many machine resources, affect query performance, and may cause OOM.

In response to this problem of Compaction, we first start from the business side and guide users through the following aspects:

Set reasonable partitions and buckets for tables to avoid generating too many data fragments.
Standardize the user's data import operation, reduce the frequency of data import, increase the amount of data imported in a single time, and reduce the pressure of Compaction.
Avoid using delete operations too much.The delete operation will generate a delete version under the relevant data shard in the storage layer.The Cumulative Compaction task will be truncated when the delete version is encountered. This task can only merge the data version after the Cumulative Point and before the delete version, move the Cumulative Point to the delete version, and hand over the delete version to the subsequent Base Compaction task. to process. If you use the delete operation too much, too many delete versions will be generated under the Tablet, which will cause the Cumulative Compaction task to slow down the progress of version merging. Using the delete operation does not actually delete the data from the disk, but records the deletion conditions in the delete version. When the data is queried, the deleted data will be filtered out by Merge-On-Read. Only the delete version is merged by the Base Compaction task. After that, the data to be deleted by the delete operation can be cleared from the disk as expired data with the Stale Rowset. If you need to delete the data of an entire partition, you can use the truncated partition operation instead of the delete operation.

Second, we tuned Compaction from the operation and maintenance side:

According to different business scenarios, different Compaction parameters (Compaction strategy, number of threads, etc.) are configured for different clusters.
Appropriately lowers the priority of the Base Compaction task and increases the priority of the Cumulative Compaction task, because the Base Compaction task takes a long time to execute and has serious write amplification problems, while the Cumulative Compaction task executes faster and can quickly merge a large number of small versions.
Version backlog alarm, dynamic adjustment of Compaction parameters.When the Compaction Producer produces Compaction tasks, it will update the corresponding metric.It records the value of the largest Compaction Score on the BE node. You can check the trend of this indicator through Grafana to determine whether there is a version backlog. In addition, we have added a Version backlog alert.In order to facilitate the adjustment of Compaction parameters, we have optimized the code level to support dynamic adjustment of the Compaction strategy and the number of Compaction threads at runtime, avoiding the need to restart the process when adjusting the Compaction parameters.
Supports manual triggering of the Compaction task of the specified Table and data shards under the specified Partition, and improves the Compaction priority of the specified Table and data shards under the specified Partition.

Monitoring and Alarm Management

01 Monitoring System

Prometheus will regularly pull Metrics metrics from Doris's FE and BE and display them in the Grafana monitoring panel.The service metadata based on QingZhou Warehouse will be automatically registered in Zookeeper, and Prometheus will regularly pull the latest cluster metadata information from Zookeeper and display it dynamically in the Grafana monitoring panel.（Qingzhou Data Warehouse is a data warehouse constructed by the Qingzhou platform based on the operation data of Xiaomi's full-scale big data service. It consists of 2 base tables and 30+ dimension tables.Covers the whole process data such as resources, server cmdb, cost, process status and so on when big data components are running）We have also added statistics and display boards for common troubleshooting data such as Doris large query list, real-time write data volume, data import transaction numbers, etc. in Grafana.In Grafana, we also added statistics and display boards for common troubleshooting data such as the Doris big query list, the amount of real-time data written, and the number of data import transactions, so that alarms can be linked. When the cluster is abnormal, Doris' operation and maintenance students can locate the cause of the cluster failure in the shortest time.

02 Falcon

Falcon is a monitoring and alarm system widely used inside Xiaomi.Because Doris provides a relatively complete metrics interface, which can easily provide monitoring functions based on Prometheus and Grafana, we only use Falcon's alarm function in the Doris service.For different levels of faults in Doris, we define alarms as three levels of P0, P1 and P2:

P2 alarm (alarm level is low): single node failure alarm.When a single node indicator or process status is abnormal, an alarm is generally issued as a P2 level.The alarm information is sent to the members of the alarm group in the form of Xiaomi Office messages.(Xiaomi Office is a privatized deployment product of ByteDance Feishu in Xiaomi, and its functions are similar to Feishu.)
P1 alarm (alarm level is higher):In a short period of time (within 3 minutes), the cluster will issue a P1 level alarm if there are short-term exceptions such as increased query delay and abnormal writing,etc.The alarm information is sent to the members of the alarm group in the form of Xiaomi Office messages.P1 level alarms require Oncall engineers to respond and provide feedback.
P0 alarm (alarm level is high):In a long period of time (more than 3 minutes), the cluster will issue a P0 level alarm if there are exceptions such as increased query delay and abnormal writing,etc.Alarm information is sent in the form of Xiaomi office messages and phone alarms.P0 level alarm requires Oncall engineers to respond within 1 minute and coordinate resources for failure recovery and review preparation.

03 Cloud-Doris

cloud-Doris is a data collection component developed by Xiaomi for the internal Doris service. Its main capability is to detect the availability of the Doris service and collect the cluster indicator data of internal concern.For example, Cloud-Doris can periodically simulate users reading and writing to the Doris system to detect the availability of services.If the cluster has abnormal availability, it will be alerted through Falcon.Collect user's read and write data, and then generate user bill.Collect information such as table-level data volume, unhealthy copies, and oversized Tablets, and send alarms to abnormal information through Falcon.

04 QingZhou inspection

For chronic hidden dangers such as capacity, user growth, resource allocation, etc., we use the unified QingZhou big data service inspection platform for inspection and reporting.The inspection generally consists of two parts:Service-specific inspections and basic indicator inspections.Among them, the service-specific inspection refers to the indicators that are unique to each big data service and cannot be used universally.For Doris, it mainly includes: Quota, number of shard copies, number of single table columns, number of table partitions, etc.By increasing the inspection method, the chronic hidden dangers that are difficult to be alarmed in advance can be well avoided, which provides support for the failure-free major festivals.

Failure Recovery

When an online cluster fails, the first principle should be to quickly restore services.If the cause of the failure is clear, handle it according to the specific cause and restore the service.If the cause of the failure is not clear, you should try restarting the process as soon as you keep the snapshot to restore the service.

01 Access Failures Handling

Doris uses Xiaomi LVS as the access layer, which is similar to the LB service of open source or public cloud, and provides layer 4 or layer 7 traffic load scheduling capability.After Doris binds a reasonable port,Generally speaking, if an abnormality occurs in a single FE node, it will be automatically kicked out, and the service can be restored without the user's perception, and an alarm will be issued for the abnormal node.Of course, for FE faults that cannot be processed in a short time, we will first adjust the weight of the faulty node to 0 or delete the abnormal node from LVS first to prevent unpredictable problems caused by process detection exceptions.

02 Node Failure Handling

For FE node failures, if the cause of the failure cannot be quickly located, it is generally necessary to keep thread snapshots and memory snapshots and restart the process.

jstack 进程ID >> 快照文件名.jstack

Save a memory snapshot of FE with the command:

jmap -dump:live,format=b,file=快照文件名.heap 进程ID

In the case of version upgrade or some unexpected scenarios, the image of the FE node may have abnormal metadata, and the abnormal metadata may be synchronized to other FE, resulting in all FE not working.Once a failed image is discovered, the fastest recovery option is to use Recovery mode to stop FE elections and replace the failed image with the backup image.Of course, it is not easy to backup images all the time.Since this failure is common in cluster upgrades, we recommend adding simple local image backup logic to the cluster upgrade procedure.Ensure that a copy of the current and latest image data will be retained before each upgrade starts the FE process.For BE node failure, if the process crashes, a core file will be generated, and minos will automatically pull the process;If the task is stuck, you need to restart the process after retaining the thread snapshot with the following command:

pstack 进程ID >> 快照文件名.pstack

Concluding Remarks

Apache Doris has been widely used by Xiaomi since the first use of open source software Apache Doris by Xiaomi Group in September 2019.At present, it has served dozens of businesses of Xiaomi, with dozens of clusters and hundreds of nodes, and a set of data ecology with Apache Doris as the core has been formed within Xiaomi.In order to improve the efficiency of operation and maintenance, Xiaomi has also developed a complete set of automated management and operation and maintenance systems around Doris.With the increasing number of services, Doris also exposed some problems. For example, there was no better resource isolation mechanism in the past version, and services would affect each other. In addition, system monitoring needs to be further improved.With the rapid development of the community, more and more small partners have participated in the community construction, the vectorized engine has been transformed, the transformation of the query optimizer is in full swing, and Apache Doris is gradually maturing.

Business Practice

01 User Access​

02 Data import​

03 Data Query​

04 Compaction Tuning​

Monitoring and Alarm Management

01 Monitoring System​

02 Falcon​

03 Cloud-Doris​

04 QingZhou inspection​