Skip to main content

Diagnostic Tools

Overview

Efficient and effective performance diagnostic tools are crucial for database system tuning, as they determine whether problematic business SQL queries can be quickly identified, and subsequently, performance bottlenecks can be rapidly pinpointed and resolved, ensuring that the database system meets its Service Level Agreements (SLAs).

Currently, Doris considers SQL queries with execution times exceeding 5 seconds as slow SQL by default. This threshold can be configured via config.qe_slow_log_ms. Doris currently offers the following three diagnostic channels to help quickly identify slow SQL queries with performance issues:

Doris Manager Logs

The log module in Doris Manager provides a slow SQL filtering function. Users can view slow SQL by selecting the fe.audit.log on a specific FE node. By simply entering slow_query in the search box, the historical slow SQL information of the current system will be displayed on the page, as shown in the figure below:

Doris Manager Monitoring and Logging

Audit Log

Currently, Doris FE provides four types of Audit Logs, including slow_query, query, load, and stream_load. Besides accessing the logs through the log page on the cluster where the Manager service is installed and deployed, Audit Logs can also be directly obtained by accessing the fe/log/fe.audit.log file on the node where FE is located.

By directly searching for the slow_query tag in fe.audit.log, you can quickly filter out slow-executing SQL queries, as shown below:

2024-07-18 11:23:13,042 [slow_query] |Client=127.0.0.1:63510|User=root|Ctl=internal|Db=tpch_sf1000|State=EOF|ErrorCode=0|ErrorMessage=|Time(ms)=11603|ScanBytes=236667379712|ScanRows=13649979418|ReturnRows=100|StmtId=1689|QueryId=91ff336304f14182-9ca537eee75b3856|IsQuery=true|isNereids=true|feIp=172.21.0.10|Stmt=select     c_name,     c_custkey,     o_orderkey,     o_orderdate,     o_totalprice,     sum(l_quantity) from     customer,     orders,     lineitem where     o_orderkey  in  (         select             l_orderkey         from             lineitem         group  by             l_orderkey  having                 sum(l_quantity)  >  300     )     and  c_custkey  =  o_custkey     and  o_orderkey  =  l_orderkey group  by     c_name,     c_custkey,     o_orderkey,     o_orderdate,     o_totalprice order  by     o_totalprice  desc,     o_orderdate limit  100|CpuTimeMS=918556|ShuffleSendBytes=3267419|ShuffleSendRows=89668|SqlHash=b4e1de9f251214a30188180f37907f7d|peakMemoryBytes=38720935552|SqlDigest=f23c7a7ecff61da33f537b2699e9b053|cloudClusterName=UNKNOWN|TraceId=|WorkloadGroup=normal|FuzzyVariables=|scanBytesFromLocalStorage=0|scanBytesFromRemoteStorage=0
2024-07-18 11:23:33,043 [slow_query] |Client=127.0.0.1:26672|User=root|Ctl=internal|Db=tpch_sf1000|State=EOF|ErrorCode=0|ErrorMessage=|Time(ms)=8978|ScanBytes=334985555968|ScanRows=10717654374|ReturnRows=100|StmtId=1815|QueryId=6e1fae453cb04d9a-b1e5f94d9cea1885|IsQuery=true|isNereids=true|feIp=172.21.0.10|Stmt=select s_name, count(*) as numwait from supplier, lineitem l1, orders, nation where s_suppkey = l1.l_suppkey and o_orderkey = l1.l_orderkey and o_orderstatus = 'F' and l1.l_receiptdate > l1.l_commitdate and exists ( select * from lineitem l2 where l2.l_orderkey = l1.l_orderkey and l2.l_suppkey <> l1.l_suppkey ) and not exists ( select * from lineitem l3 where l3.l_orderkey = l1.l_orderkey and l3.l_suppkey <> l1.l_suppkey and l3.l_receiptdate > l3.l_commitdate ) and s_nationkey = n_nationkey and n_name = 'SAUDI ARABIA' group by s_name order by numwait desc, s_name limit 100|CpuTimeMS=990127|ShuffleSendBytes=59208164|ShuffleSendRows=3651504|SqlHash=f8a30e4182d72cce3eff6cb385005b1f|peakMemoryBytes=10495660672|SqlDigest=fec5a7136f9375aa968a4de971b994da|cloudClusterName=UNKNOWN|TraceId=|WorkloadGroup=normal|FuzzyVariables=|scanBytesFromLocalStorage=0|scanBytesFromRemoteStorage=0
2024-07-18 11:23:41,044 [slow_query] |Client=127.0.0.1:26684|User=root|Ctl=internal|Db=tpch_sf1000|State=EOF|ErrorCode=0|ErrorMessage=|Time(ms)=8514|ScanBytes=334986551296|ScanRows=10717654374|ReturnRows=100|StmtId=1833|QueryId=4f91483464ce4aa8-beeed7dcb8675bc8|IsQuery=true|isNereids=true|feIp=172.21.0.10|Stmt=select s_name, count(*) as numwait from supplier, lineitem l1, orders, nation where s_suppkey = l1.l_suppkey and o_orderkey = l1.l_orderkey and o_orderstatus = 'F' and l1.l_receiptdate > l1.l_commitdate and exists ( select * from lineitem l2 where l2.l_orderkey = l1.l_orderkey and l2.l_suppkey <> l1.l_suppkey ) and not exists ( select * from lineitem l3 where l3.l_orderkey = l1.l_orderkey and l3.l_suppkey <> l1.l_suppkey and l3.l_receiptdate > l3.l_commitdate ) and s_nationkey = n_nationkey and n_name = 'SAUDI ARABIA' group by s_name order by numwait desc, s_name limit 100|CpuTimeMS=925841|ShuffleSendBytes=59223190|ShuffleSendRows=3651602|SqlHash=f8a30e4182d72cce3eff6cb385005b1f|peakMemoryBytes=10505123104|SqlDigest=fec5a7136f9375aa968a4de971b994da|cloudClusterName=UNKNOWN|TraceId=|WorkloadGroup=normal|FuzzyVariables=|scanBytesFromLocalStorage=0|scanBytesFromRemoteStorage=0
2024-07-18 11:23:49,044 [slow_query] |Client=127.0.0.1:10748|User=root|Ctl=internal|Db=tpch_sf1000|State=EOF|ErrorCode=0|ErrorMessage=|Time(ms)=8660|ScanBytes=334987673600|ScanRows=10717654374|ReturnRows=100|StmtId=1851|QueryId=4599cb1bab204f80-ac430dd78b45e3da|IsQuery=true|isNereids=true|feIp=172.21.0.10|Stmt=select s_name, count(*) as numwait from supplier, lineitem l1, orders, nation where s_suppkey = l1.l_suppkey and o_orderkey = l1.l_orderkey and o_orderstatus = 'F' and l1.l_receiptdate > l1.l_commitdate and exists ( select * from lineitem l2 where l2.l_orderkey = l1.l_orderkey and l2.l_suppkey <> l1.l_suppkey ) and not exists ( select * from lineitem l3 where l3.l_orderkey = l1.l_orderkey and l3.l_suppkey <> l1.l_suppkey and l3.l_receiptdate > l3.l_commitdate ) and s_nationkey = n_nationkey and n_name = 'SAUDI ARABIA' group by s_name order by numwait desc, s_name limit 100|CpuTimeMS=932664|ShuffleSendBytes=59223178|ShuffleSendRows=3651991|SqlHash=f8a30e4182d72cce3eff6cb385005b1f|peakMemoryBytes=10532849344|SqlDigest=fec5a7136f9375aa968a4de971b994da|cloudClusterName=UNKNOWN|TraceId=|WorkloadGroup=normal|FuzzyVariables=|scanBytesFromLocalStorage=0|scanBytesFromRemoteStorage=0

The slow SQL obtained through fe.audit.log allows users to easily access detailed information such as execution time, number of rows scanned, number of rows returned, and the SQL statement itself, laying the foundation for further reproducing and locating performance issues.

Additionally, the Audit Log includes a SqlDigest field (e.g., SqlDigest=... in the example above). This field is a hash value generated from the structure of the SQL statement (with specific parameter values removed). By aggregating and analyzing SqlDigest in slow_query, you can identify "patterns" of slow queries. This means that even if specific SQL statements differ slightly due to parameters, their SqlDigest will be identical as long as the structure is the same.

Using SqlDigest, users can determine which SQL patterns appear most frequently or consume the most time, allowing them to prioritize optimization for these "high-frequency" or "high-latency" patterns. This approach significantly improves the efficiency of slow query optimization by avoiding the inefficiency of analyzing individual SQL statements one by one.

It is important to note that SqlDigest itself is just a hash value and is not directly readable. Once the slow query pattern to be optimized is identified, you need to refer to the Stmt field in the Audit Log to get the specific SQL statement content corresponding to that pattern. Furthermore, the QueryId field can be used to retrieve detailed Profile information for the query (Profile retrieval and analysis will be detailed in subsequent sections) for in-depth performance analysis and optimization.

Slow Query Analysis Example

Taking the 4 slow query logs in the fe.audit.log above as an example, we can observe:

  1. The first log (Time(ms)=11603) has a SqlDigest of f23c7a7ecff61da33f537b2699e9b053.
  2. The subsequent three logs (Time(ms)=8978, 8514, 8660) all have the SqlDigest of fec5a7136f9375aa968a4de971b994da.

This indicates that the latter three slow queries belong to the same SQL pattern (structurally identical), although their specific execution times (Time(ms)) and details differ slightly.

In real-world optimization scenarios, if we find that a certain SqlDigest (such as fec5a7136f9375aa968a4de971b994da) appears repeatedly in the slow query logs, or accounts for a high proportion of the total execution time, we should prioritize optimizing this pattern.

Recommended Optimization Steps:

  1. Identify Business Logic: By examining the Stmt field in any log entry of this pattern:

    select     s_name,     count(*) as numwait from     supplier,     lineitem l1,     orders,     nation where ...

    We can identify the specific SQL logic.

  2. Deep Analysis: Use the QueryId (e.g., 6e1fae453cb04d9a-b1e5f94d9cea1885) from the log to retrieve the corresponding Query Profile in Doris. The Profile helps analyze whether the issue is due to excessive data scanning, prolonged Join operations, or other reasons, enabling targeted optimization strategies (such as adding indexes, optimizing SQL writing, or adjusting table structures).

Once the issue with this pattern is resolved, all slow queries belonging to this pattern will be improved.

audit_log System Table

Starting from Doris version 2.1, the audit_log system table is provided under the __internal_schema database for users to view the execution status of SQL queries. Before using it, the global configuration set global enable_audit_plugin=true; needs to be enabled (this switch is disabled by default).

mysql> use __internal_schema;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> show tables;
+-----------------------------+
| Tables_in___internal_schema |
+-----------------------------+
| audit_log |
| column_statistics |
| histogram_statistics |
| partition_statistics |
+-----------------------------+
4 rows in set (0.00 sec)

mysql> desc audit_log;
+-------------------+--------------+------+-------+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------------+--------------+------+-------+---------+-------+
| query_id | varchar(48) | Yes | true | NULL | |
| time | datetime | Yes | true | NULL | |
| client_ip | varchar(128) | Yes | true | NULL | |
| user | varchar(128) | Yes | false | NULL | NONE |
| catalog | varchar(128) | Yes | false | NULL | NONE |
| db | varchar(128) | Yes | false | NULL | NONE |
| state | varchar(128) | Yes | false | NULL | NONE |
| error_code | int | Yes | false | NULL | NONE |
| error_message | text | Yes | false | NULL | NONE |
| query_time | bigint | Yes | false | NULL | NONE |
| scan_bytes | bigint | Yes | false | NULL | NONE |
| scan_rows | bigint | Yes | false | NULL | NONE |
| return_rows | bigint | Yes | false | NULL | NONE |
| stmt_id | bigint | Yes | false | NULL | NONE |
| is_query | tinyint | Yes | false | NULL | NONE |
| frontend_ip | varchar(128) | Yes | false | NULL | NONE |
| cpu_time_ms | bigint | Yes | false | NULL | NONE |
| sql_hash | varchar(128) | Yes | false | NULL | NONE |
| sql_digest | varchar(128) | Yes | false | NULL | NONE |
| peak_memory_bytes | bigint | Yes | false | NULL | NONE |
| stmt | text | Yes | false | NULL | NONE |
+-------------------+--------------+------+-------+---------+-------+

Through the audit_log internal table, users can query detailed SQL execution information and perform detailed statistical analysis such as slow query filtering.

Summary

Doris Manager logs, audit logs, and the audit_log system table provide capabilities such as automatic or manual filtering of slow SQL queries, as well as fine-grained statistical analysis of SQL execution information. These tools offer powerful support for systematic performance diagnosis and tuning.