Skip to main content

Monitoring Metrics

Both the FE and BE processes of Doris have built-in monitoring metrics, exposed by default in a Prometheus-compatible format. This document lists all observable metrics systematically and annotates their priority (Priority), so that you can quickly build a monitoring and alerting system after integrating with Prometheus + Grafana.

Applicable Scenarios

ScenarioPurpose
Cluster health monitoringQuickly detect FE/BE node anomalies through P0 metrics
Alert configurationConfigure threshold-based alerts on metrics such as QPS, error rate, and Compaction Score
Performance troubleshootingLocate bottlenecks using query latency, thread pool queueing, and memory usage metrics
Capacity planningEvaluate cluster load using disk usage, tablet count, and connection count
Incident reviewAnalyze the root cause of failures by combining machine metrics (CPU, IO, network) with process metrics

Metric Categories

Doris monitoring metrics are divided into two categories by observation target:

  1. Process monitoring: Shows the runtime metrics of the Doris process itself, such as query count, transaction count, and Compaction status.
  2. Node monitoring: Shows the resource metrics of the machine on which the Doris process runs, such as CPU, memory, IO, and network.

Retrieving Monitoring Metrics

Prometheus-Compatible Format (Default)

Access the HTTP port of an FE or BE node directly:

curl http://fe_host:http_port/metrics
curl http://be_host:webserver_port/metrics

Sample response:

doris_fe_cache_added{type="partition"} 0
doris_fe_cache_added{type="sql"} 0
doris_fe_cache_hit{type="partition"} 0
doris_fe_cache_hit{type="sql"} 0
doris_fe_connection_total 2

JSON Format

Use the type=json parameter to retrieve metrics in JSON format:

curl http://fe_host:http_port/metrics?type=json
curl http://be_host:webserver_port/metrics?type=json

Monitoring Priority and Best Practices

  • The last column of each metric table is the priority (Priority). P0 is the most important, and the larger the value, the lower the importance. Integrate P0 metrics first.

  • The vast majority of monitoring metrics are of type Counter (cumulative value). You must sample at intervals (for example, every 15 seconds) and calculate the slope per unit time to obtain useful information.

    For example, by calculating the slope of doris_fe_query_err, you can obtain the query error rate (errors per second).

You are welcome to contribute to this table to provide more comprehensive and useful monitoring metrics.

FE Monitoring Metrics

Process Monitoring

NameLabelUnitMeaningDescriptionPriority
doris_fe_cache_added{type="partition"}NumCumulative number of newly added Partition Caches
{type="sql"}NumCumulative number of newly added SQL Caches
doris_fe_cache_hit{type="partition"}NumCount of Partition Cache hits
{type="sql"}NumCount of SQL Cache hits
doris_fe_connection_totalNumCurrent number of MySQL port connections on the FEUsed to monitor query connection count. If the connection count exceeds the limit, new connections cannot be acceptedP0
doris_fe_counter_hit_sql_block_ruleNumNumber of queries blocked by SQL BLOCK RULE
doris_fe_edit_log_clean{type="failed"}NumNumber of failures when cleaning historical metadata logsThis should not fail. If it does, manual intervention is requiredP0
{type="success"}NumNumber of successful historical metadata log cleanups
doris_fe_edit_log{type="accumulated_bytes"}BytesCumulative value of metadata log write volumeCalculate the slope to obtain the write rate and observe whether metadata writes are delayedP0
{type="current_bytes"}BytesCurrent value of metadata logUsed to monitor editlog size. If the size exceeds the limit, manual intervention is requiredP0
{type="read"}NumCount of metadata log readsObserve whether the metadata read frequency is normal via the slopeP0
{type="write"}NumCount of metadata log writesObserve whether the metadata write frequency is normal via the slopeP0
{type="current"}NumCurrent count of metadata logsUsed to monitor editlog count. If the count exceeds the limit, manual intervention is requiredP0
doris_fe_editlog_write_latency_msMillisecondsPercentile statistics of metadata log write latency. For example, {quantile="0.75"} indicates the 75th-percentile write latency
doris_fe_image_clean{type="failed"}NumNumber of failures when cleaning historical metadata image filesThis should not fail. If it does, manual intervention is requiredP0
{type="success"}NumNumber of successful historical metadata image file cleanups
doris_fe_image_push{type="failed"}NumNumber of failures when pushing metadata image files to other FE nodes
{type="success"}NumNumber of successful pushes of metadata image files to other FE nodes
doris_fe_image_write{type="failed"}NumNumber of failures when generating metadata image filesThis should not fail. If it does, manual intervention is requiredP0
{type="success"}NumNumber of successful metadata image file generations
doris_fe_jobNumCount of jobs of different types and states. For example, {job="load", type="INSERT", state="LOADING"} indicates the number of INSERT-type load jobs in the LOADING stateObserve the number of jobs of different types in the cluster as neededP0
doris_fe_max_journal_idNumThe maximum metadata log ID on the current FE node. For the Master FE, this is the maximum ID currently written; for a non-Master FE, this is the maximum ID of the metadata logs currently replayedUsed to observe whether the IDs across multiple FEs diverge significantly. A large gap indicates a metadata sync problemP0
doris_fe_max_tablet_compaction_scoreNumThe maximum Compaction Score value across all BE nodesUse this value to observe the maximum Compaction Score in the current cluster and determine whether it is too high. A high value may cause query or write latencyP0
doris_fe_qpsNum/SecQueries per second on the current FE (counts query requests only)QPSP0
doris_fe_query_errNumCumulative number of failed queries
doris_fe_query_err_rateNum/SecFailed queries per secondObserve whether query errors occur in the clusterP0
doris_fe_query_latency_msMillisecondsPercentile statistics of query request latency. For example, {quantile="0.75"} indicates the 75th-percentile query latencyInspect query latency at each percentile in detailP0
doris_fe_query_latency_ms_dbMillisecondsPercentile statistics of query request latency for each DB. For example, {quantile="0.75",db="test"} indicates the 75th-percentile query latency for DB testInspect query latency at each percentile per DB in detailP0
doris_fe_query_olap_tableNumCount of query requests against internal tables (OlapTable)
doris_fe_query_totalNumCumulative count of all query requests
doris_fe_report_queue_sizeNumLength of the queue on the FE side for various periodic report tasks from BEThis value reflects the degree of blocking of report tasks on the Master FE node. The larger the value, the more limited the FE's processing capacityP0
doris_fe_request_totalNumAll operation requests (including queries and other statements) received through the MySQL port
doris_fe_routine_load_error_rowsNumTotal number of error rows across all Routine Load jobs in the cluster
doris_fe_routine_load_receive_bytesBytesTotal volume of data received by all Routine Load jobs in the cluster
doris_fe_routine_load_rowsNumTotal number of data rows received by all Routine Load jobs in the cluster
doris_fe_routine_load_get_meta_latencyMillisecondsLatency of metadata retrieval for all Routine Load Jobs in the cluster
doris_fe_routine_load_get_meta_countNumNumber of metadata retrieval operations for all Routine Load Jobs in the cluster
doris_fe_routine_load_get_meta_fail_countNumNumber of failed metadata retrievals for all Routine Load Jobs in the cluster
doris_fe_routine_load_task_execute_timeMillisecondsExecution time of all Routine Load Tasks in the cluster
doris_fe_routine_load_task_execute_countNumNumber of executions of all Routine Load Tasks in the cluster
doris_fe_routine_load_lagMillisecondsConsumption lag of all Routine Load Jobs in the cluster
doris_fe_routine_load_progressMillisecondsConsumption progress of all Routine Load Jobs in the cluster
doris_fe_routine_load_abort_task_numNumNumber of failed Tasks across all Routine Load Jobs in the cluster
doris_fe_rpsNumRequests per second on the current FE (includes queries and other statement types)Use together with QPS to see the volume of requests processed by the clusterP0
doris_fe_scheduled_tablet_numNumNumber of tablets being scheduled by the Master FE node, including replicas being repaired and replicas being balancedThis value reflects the number of tablets currently being migrated in the cluster. A non-zero value over a long period indicates an unstable clusterP0
doris_fe_tablet_max_compaction_scoreNumCompaction Score reported by each BE node. For example, {backend="172.21.0.1:9556"} indicates the value reported by the BE at "172.21.0.1:9556"
doris_fe_tablet_numNumTotal number of tablets on each BE node. For example, {backend="172.21.0.1:9556"} indicates the current tablet count on the BE at "172.21.0.1:9556"Check whether tablet distribution is uniform and whether the absolute value is reasonableP0
doris_fe_tablet_status_countNumCumulative count of tablets scheduled by the Tablet scheduler on the Master FE node
{type="added"}NumCumulative count of tablets scheduled by the Tablet scheduler on the Master FE node. "added" indicates the number of tablets that have been scheduled
{type="in_sched"}NumSame as above. Indicates the number of tablets that have been repeatedly scheduledIf this value grows rapidly, some tablets have been in an unhealthy state for a long time, causing the scheduler to schedule them repeatedly
{type="not_ready"}NumSame as above. Indicates the number of tablets that have not yet met the scheduling trigger conditionsIf this value grows rapidly, a large number of tablets are unhealthy but cannot be scheduled
{type="total"}NumSame as above. Indicates the cumulative number of tablets that have been checked (but not necessarily scheduled)
{type="unhealthy"}NumSame as above. Indicates the cumulative number of unhealthy tablets that have been checked
doris_fe_thread_poolNumStatistics on the worker threads and queueing of various thread pools. active_thread_num indicates the number of tasks currently being executed. pool_size indicates the total number of threads in the pool. task_in_queue indicates the number of tasks currently queued
{name="agent-task-pool"}NumThread pool used by the Master FE to send Agent Tasks to BEs
{name="connect-scheduler-check-timer"}NumThread pool used to check whether MySQL idle connections have timed out
{name="connect-scheduler-pool"}NumThread pool used to receive MySQL connection requests
{name="mysql-nio-pool"}NumThread pool used by the NIO MySQL Server to process tasks
{name="export-exporting-job-pool"}NumScheduling thread pool for export jobs in the exporting state
{name="export-pending-job-pool"}NumScheduling thread pool for export jobs in the pending state
{name="heartbeat-mgr-pool"}NumThread pool used by the Master FE to handle heartbeats from each node
{name="loading-load-task-scheduler"}NumScheduling thread pool used by the Master FE to schedule loading tasks in Broker Load jobs
{name="pending-load-task-scheduler"}NumScheduling thread pool used by the Master FE to schedule pending tasks in Broker Load jobs
{name="schema-change-pool"}NumThread pool used by the Master FE to schedule schema change jobs
{name="thrift-server-pool"}NumWorker thread pool of the FE-side ThriftServer. Corresponds to rpc_port in fe.conf and is used to interact with BEs
doris_fe_txn_counterNumCumulative count of load transactions in each stateObserve the execution status of load transactionsP0
{type="begin"}NumNumber of committed transactions
{type="failed"}NumNumber of failed transactions
{type="reject"}NumNumber of rejected transactions (for example, when the current number of running transactions exceeds the threshold, new transactions are rejected)
{type="succes"}NumNumber of successful transactions
doris_fe_txn_statusNumNumber of load transactions currently in each state. For example, {type="committed"} indicates the number of transactions in the committed stateObserve the number of load transactions in each state to determine whether there is a backlogP0
doris_fe_query_instance_numNumNumber of fragment instances currently being requested by a specific user. For example, {user="test_u"} indicates the number of instances currently being requested by user test_uUse this value to observe whether a specific user is consuming too many query resourcesP0
doris_fe_query_instance_beginNumNumber of fragment instances for which a specific user has started requests. For example, {user="test_u"} indicates the number of instances for which user test_u has started requestsUse this value to observe whether a specific user has submitted too many queriesP0
doris_fe_query_rpc_totalNumNumber of RPCs sent to a specific BE. For example, {be="192.168.10.1"} indicates the number of RPCs sent to the BE at IP 192.168.10.1Use this value to observe whether too many RPCs have been submitted to a specific BE
doris_fe_query_rpc_failedNumNumber of failed RPCs sent to a specific BE. For example, {be="192.168.10.1"} indicates the number of failed RPCs sent to the BE at IP 192.168.10.1Use this value to observe whether a specific BE has RPC issues
doris_fe_query_rpc_sizeNumRPC data size for a specific BE. For example, {be="192.168.10.1"} indicates the byte count of RPC data sent to the BE at IP 192.168.10.1Use this value to observe whether oversized RPCs have been submitted to a specific BE
doris_fe_txn_exec_latency_msMillisecondsPercentile statistics of transaction execution time. For example, {quantile="0.75"} indicates the 75th-percentile transaction execution timeInspect transaction execution time at each percentile in detailP0
doris_fe_txn_publish_latency_msMillisecondsPercentile statistics of transaction publish time. For example, {quantile="0.75"} indicates the 75th-percentile transaction publish timeInspect transaction publish time at each percentile in detailP0
doris_fe_txn_numNumNumber of transactions currently executing in a specific DB. For example, {db="test"} indicates the number of transactions currently executing in DB testUse this value to observe whether a specific DB has submitted a large number of transactionsP0
doris_fe_publish_txn_numNumNumber of transactions currently being published in a specific DB. For example, {db="test"} indicates the number of transactions currently being published in DB testUse this value to observe the number of publish transactions in a specific DBP0
doris_fe_txn_replica_numNumNumber of replicas opened by transactions currently executing in a specific DB. For example, {db="test"} indicates the number of replicas opened by transactions currently executing in DB testUse this value to observe whether a specific DB has opened too many replicas, which may affect the execution of other transactionsP0
doris_fe_thrift_rpc_totalNumNumber of RPC requests received by each method of the FE thrift interface. For example, {method="report"} indicates the number of RPC requests received by the report methodUse this value to observe the load of a specific thrift rpc method
doris_fe_thrift_rpc_latency_msMillisecondsRPC request processing time for each method of the FE thrift interface. For example, {method="report"} indicates the RPC request processing time of the report methodUse this value to observe the load of a specific thrift rpc method
doris_fe_external_schema_cache{catalog="hive"}NumNumber of entries in the schema cache for a specific External Catalog
doris_fe_hive_meta_cache{catalog="hive"}Num
{type="partition_value"}NumNumber of entries in the partition value cache for a specific External Hive Metastore Catalog
{type="partition"}NumNumber of entries in the partition cache for a specific External Hive Metastore Catalog
{type="file"}NumNumber of entries in the file cache for a specific External Hive Metastore Catalog

JVM Monitoring

NameLabelUnitMeaningDescriptionPriority
jvm_heap_size_bytesBytesJVM memory monitoring. The labels include max, used, and committed, corresponding to maximum, used, and committed memory respectivelyObserve JVM memory usageP0
jvm_non_heap_size_bytesBytesJVM off-heap memory statistics
<GarbageCollector>GC monitoringGarbageCollector refers to a specific garbage collectorP0
{type="count"}NumCumulative number of GCs
{type="time"}MillisecondsCumulative GC time
jvm_old_size_bytesBytesJVM old generation memory statisticsP0
jvm_threadNumJVM thread count statisticsObserve whether the JVM thread count is reasonableP0
jvm_young_size_bytesBytesJVM young generation memory statisticsP0

Machine Monitoring

NameLabelUnitMeaningDescriptionPriority
system_meminfoBytesMemory monitoring of the FE node machine. Collected from /proc/meminfo. Includes buffers, cached, memory_available, memory_free, and memory_total
system_snmpNetwork monitoring of the FE node machine. Collected from /proc/net/snmp
{name="tcp_in_errs"}NumNumber of tcp packet receive errors
{name="tcp_in_segs"}NumNumber of tcp packets received
{name="tcp_out_segs"}NumNumber of tcp packets sent
{name="tcp_retrans_segs"}NumNumber of tcp packet retransmissions

BE Monitoring Metrics

Process Monitoring

NameLabelUnitMeaningDescriptionPriority
doris_be_active_scan_context_countNumShows the number of scanners currently opened directly by external systems
doris_be_add_batch_task_queue_sizeNumRecords the queue size of the thread pool that receives batches during loadIf greater than 0, there is a backlog on the receiving side of load tasksP0
agent_task_queue_sizeNumShows the length of each Agent Task processing queue. For example, {type="CREATE_TABLE"} indicates the length of the CREATE_TABLE task queue
doris_be_brpc_endpoint_stub_countNumNumber of created brpc stubs used for interaction between BEs
doris_be_brpc_function_endpoint_stub_countNumNumber of created brpc stubs used for interaction with Remote RPC
doris_be_cache_capacityRecords the capacity of a specific LRU Cache
doris_be_cache_usageRecords the usage of a specific LRU CacheObserve memory usageP0
doris_be_cache_usage_ratioRecords the usage ratio of a specific LRU Cache
doris_be_cache_lookup_countRecords the number of lookups on a specific LRU Cache
doris_be_cache_hit_countRecords the hit count of a specific LRU Cache
doris_be_cache_hit_ratioRecords the hit ratio of a specific LRU CacheObserve whether the cache is effectiveP0
{name="DataPageCache"}NumDataPageCache caches the Data Page of dataData cache, directly affects query efficiencyP0
{name="IndexPageCache"}NumIndexPageCache caches the Index Page of dataIndex cache, directly affects query efficiencyP0
{name="LastSuccessChannelCache"}NumLastSuccessChannelCache caches the LoadChannel on the load receiver side
{name="SegmentCache"}NumSegmentCache caches opened Segments, such as index information
doris_be_chunk_pool_local_core_alloc_countNumNumber of times memory is allocated from the memory queue of the bound core in the ChunkAllocator
doris_be_chunk_pool_other_core_alloc_countNumNumber of times memory is allocated from the memory queue of other cores in the ChunkAllocator
doris_be_chunk_pool_reserved_bytesBytesSize of memory reserved in the ChunkAllocator
doris_be_chunk_pool_system_alloc_cost_nsNanosecondsCumulative time spent allocating memory by the SystemAllocatorObserve the time cost of memory allocation via the slopeP0
doris_be_chunk_pool_system_alloc_countNumNumber of times the SystemAllocator allocates memory
doris_be_chunk_pool_system_free_cost_nsNanosecondsCumulative time spent freeing memory by the SystemAllocatorObserve the time cost of memory release via the slopeP0
doris_be_chunk_pool_system_free_countNumNumber of times the SystemAllocator frees memory
doris_be_compaction_bytes_totalBytesCumulative volume of data processed by CompactionRecords the disk size of input rowsets in Compaction tasks. Observe the Compaction rate via the slopeP0
{type="base"}BytesCumulative data volume of Base Compaction
{type="cumulative"}BytesCumulative data volume of Cumulative Compaction
doris_be_compaction_deltas_totalNumCumulative number of rowsets processed by CompactionRecords the number of input rowsets in Compaction tasks
{type="base"}NumCumulative number of rowsets processed by Base Compaction
{type="cumulative"}NumCumulative number of rowsets processed by Cumulative Compaction
doris_be_disks_compaction_numNumNumber of Compaction tasks currently running on a specific data directory. For example, {path="/path1/"} indicates the number of tasks currently running on the /path1 directoryObserve whether the number of Compaction tasks on each disk is reasonableP0
doris_be_disks_compaction_scoreNumNumber of Compaction tokens currently running on a specific data directory. For example, {path="/path1/"} indicates the number of tokens currently running on the /path1 directory
doris_be_compaction_used_permitsNumNumber of tokens already used by Compaction tasksReflects the resource consumption of Compaction
doris_be_compaction_waitting_permitsNumNumber of items waiting for Compaction tokens
doris_be_data_stream_receiver_countNumNumber of data Receivers on the receiving sideFIXME: This metric is missing in the vectorized engine
doris_be_disks_avail_capacityBytesRemaining space on the disk where a specific data directory resides. For example, {path="/path1/"} indicates the remaining space on the disk where the /path1 directory residesP0
doris_be_disks_local_used_capacityBytesLocal used space on the disk where a specific data directory resides
doris_be_disks_remote_used_capacityBytesUsed space of the remote directory corresponding to the disk where a specific data directory resides
doris_be_disks_stateBooleanDisk state of a specific data directory. 1 indicates normal, 0 indicates abnormal
doris_be_disks_total_capacityBytesTotal capacity of the disk where a specific data directory residesUse together with doris_be_disks_avail_capacity to calculate disk usageP0
doris_be_engine_requests_totalNumCumulative count of execution states of various tasks on the BE
{status="failed",type="xxx"}NumCumulative number of failures for tasks of type xxx
{status="total",type="xxx"}NumCumulative total number of executions for tasks of type xxxMonitor the failure count of various task types as neededP0
{status="skip",type="report_all_tablets"}NumCumulative number of times tasks of type xxx were skipped
doris_be_fragment_endpoint_countNumSame as doris_be_data_stream_receiver_countFIXME: Same count as doris_be_data_stream_receiver_count. Also missing in the vectorized engine
doris_be_fragment_request_duration_usMicrosecondsCumulative execution time of all fragment instancesObserve instance execution time via the slopeP0
doris_be_fragment_requests_totalNumCumulative number of fragment instances executed
doris_be_load_channel_countNumNumber of load channels currently openThe larger the value, the more load tasks are currently runningP0
doris_be_local_bytes_read_totalBytesNumber of bytes read by LocalFileReaderP0
doris_be_local_bytes_written_totalBytesNumber of bytes written by LocalFileWriterP0
doris_be_local_file_reader_totalNumCumulative count of opened LocalFileReader instances
doris_be_local_file_open_readingNumNumber of LocalFileReader instances currently open
doris_be_local_file_writer_totalNumCumulative count of opened LocalFileWriter instances
doris_be_mem_consumptionBytesCurrent memory consumption of a specific module. For example, {type="compaction"} indicates the current total memory consumption of the Compaction moduleThe value is taken from the MemTracker of the same type. FIXME
doris_be_memory_allocated_bytesBytesBE process physical memory size, taken from /proc/self/status/VmRSSP0
doris_be_memory_jemallocBytesJemalloc stats, taken from je_mallctlFor the meaning, see: https://jemalloc.net/jemalloc.3.htmlP0
doris_be_memory_pool_bytes_totalBytesMemory size currently occupied by all MemPools. A statistical value that does not represent real memory usage
doris_be_memtable_flush_duration_usMicrosecondsCumulative time spent writing memtables to diskObserve write latency via the slopeP0
doris_be_memtable_flush_totalNumCumulative number of memtables written to diskCalculate the frequency of file writes via the slopeP0
doris_be_meta_request_durationMicrosecondsCumulative time spent accessing meta in RocksDBObserve the BE metadata read/write latency via the slopeP0
{type="read"}MicrosecondsRead time
{type="write"}MicrosecondsWrite time
doris_be_meta_request_totalNumCumulative number of accesses to meta in RocksDBObserve the BE metadata access frequency via the slopeP0
{type="read"}NumNumber of reads
{type="write"}NumNumber of writes
doris_be_fragment_instance_countNumNumber of fragment instances currently receivedObserve whether instances are accumulatingP0
doris_be_process_fd_num_limit_hardNumHard limit on the number of file handles for the BE process. Collected via /proc/pid/limits
doris_be_process_fd_num_limit_softNumSoft limit on the number of file handles for the BE process. Collected via /proc/pid/limits
doris_be_process_fd_num_usedNumNumber of file handles used by the BE process. Collected via /proc/pid/limits
doris_be_process_thread_numNumNumber of threads in the BE process. Collected via /proc/pid/taskP0
doris_be_query_cache_memory_total_byteBytesBytes occupied by the Query Cache
doris_be_query_cache_partition_total_countNumCurrent number of entries in the Partition Cache
doris_be_query_cache_sql_total_countNumCurrent number of entries in the SQL Cache
doris_be_query_scan_bytesBytesCumulative volume of data read. Only counts data read from Olap tables
doris_be_query_scan_bytes_per_secondBytes/SecRead rate calculated from doris_be_query_scan_bytesObserve query rateP0
doris_be_query_scan_rowsNumCumulative number of rows read. Only counts data read from Olap tables. This is RawRowsRead (some data rows may be skipped by indexes and not actually read, but are still counted in this value)Observe query rate via the slopeP0
doris_be_result_block_queue_countNumNumber of fragment instances currently in the query result cacheThis queue is only used when external systems read directly. For example, Spark on Doris queries data via external scan
doris_be_result_buffer_block_countNumNumber of queries currently in the query result cacheThis value reflects how many query results in the current BE are waiting to be consumed by the FEP0
doris_be_routine_load_task_countNumNumber of routine load tasks currently running
doris_be_rowset_count_generated_and_in_useNumNumber of newly added rowset IDs in use since the last startup
doris_be_s3_bytes_read_totalNumCumulative number of times S3FileReader has been opened
doris_be_s3_file_open_readingNumNumber of S3FileReader instances currently open
doris_be_s3_bytes_read_totalBytesCumulative number of bytes read by S3FileReader
doris_be_scanner_thread_pool_queue_sizeNumCurrent queue size of the thread pool used for OlapScannerA value greater than zero indicates that Scanners are starting to accumulateP0
doris_be_segment_read{type="segment_read_total"}NumCumulative number of segments read
doris_be_segment_read{type="segment_row_total"}NumCumulative number of rows read across segmentsThis value also includes rows filtered by indexes. It is equivalent to the number of segments read multiplied by the total rows per segment
doris_be_send_batch_thread_pool_queue_sizeNumQueue size of the thread pool used to send data packets during loadA value greater than 0 indicates accumulationP0
doris_be_send_batch_thread_pool_thread_numNumNumber of threads in the thread pool used to send data packets during load
doris_be_small_file_cache_countNumNumber of small files currently cached on the BE
doris_be_streaming_load_current_processingNumNumber of stream load tasks currently runningOnly includes tasks sent via the curl command
doris_be_streaming_load_duration_msMillisecondsCumulative execution time of all stream load tasks
doris_be_streaming_load_requests_totalNumCumulative number of stream load tasksObserve task submission frequency via the slopeP0
doris_be_stream_load_pipe_countNumCurrent number of stream load data pipesIncludes both stream load and routine load tasks
doris_be_stream_load{type="load_rows"}NumCumulative number of rows finally loaded by stream loadIncludes both stream load and routine load tasksP0
doris_be_stream_load{type="receive_bytes"}BytesCumulative number of bytes received by stream loadIncludes data received by stream load over HTTP and data read from Kafka by routine loadP0
doris_be_tablet_base_max_compaction_scoreNumCurrent maximum Base Compaction ScoreThis value changes in real time and may miss peak data. The higher the value, the more severe the Compaction backlogP0
doris_be_tablet_cumulative_max_compaction_scoreNumSame as above. Current maximum Cumulative Compaction Score
doris_be_tablet_version_num_distributionNumHistogram of the number of tablet versionsReflects the distribution of tablet version countsP0
doris_be_thrift_connections_totalNumCumulative number of created thrift connections. For example, {name="heartbeat"} indicates the cumulative number of connections to the heartbeat serviceThis value is for the thrift server in which the BE acts as the server side
doris_be_thrift_current_connectionsNumCurrent number of thrift connections. For example, {name="heartbeat"} indicates the current number of connections to the heartbeat serviceSame as above
doris_be_thrift_opened_clientsNumCurrent number of opened thrift clients. For example, {name="frontend"} indicates the number of clients accessing the FE service
doris_be_thrift_used_clientsNumCurrent number of thrift clients in use. For example, {name="frontend"} indicates the number of clients currently used to access the FE service
doris_be_timeout_canceled_fragment_countNumCumulative number of fragment instances canceled due to timeoutThis value may be recorded multiple times. For example, some fragment instances may be canceled multiple timesP0
doris_be_stream_load_txn_request{type="begin"}NumCumulative number of stream load transactions startedIncludes both stream load and routine load tasks
doris_be_stream_load_txn_request{type="commit"}NumCumulative number of stream load transactions successfully committedSame as above
doris_be_stream_load_txn_request{type="rollback"}NumCumulative number of stream load transactions that failedSame as above
doris_be_unused_rowsets_countNumNumber of currently deprecated rowsetsThese rowsets are periodically deleted under normal circumstances
doris_be_upload_fail_countNumCumulative number of times rowsets failed to upload to remote storage in the tiered storage feature
doris_be_upload_rowset_countNumCumulative number of times rowsets were successfully uploaded to remote storage in the tiered storage feature
doris_be_upload_total_byteBytesCumulative data volume of rowsets successfully uploaded to remote storage in the tiered storage feature
doris_be_load_bytesBytesCumulative number of bytes sent via tablet sinkObserve load data volumeP0
doris_be_load_rowsNumCumulative number of rows sent via tablet sinkObserve load data volumeP0
fragment_thread_pool_queue_sizeNumCurrent length of the wait queue for the query execution thread poolIf greater than zero, query threads are exhausted and queries are accumulatingP0
doris_be_all_rowsets_numNumTotal number of all current rowsetsP0
doris_be_all_segments_numNumTotal number of all current segmentsP0
doris_be_heavy_work_max_threadsNumNumber of threads in the brpc heavy thread poolP0
doris_be_light_work_max_threadsNumNumber of threads in the brpc light thread poolP0
doris_be_heavy_work_pool_queue_sizeNumMaximum queue length of the brpc heavy thread pool. Submission of work is blocked when exceededP0
doris_be_light_work_pool_queue_sizeNumMaximum queue length of the brpc light thread pool. Submission of work is blocked when exceededP0
doris_be_heavy_work_active_threadsNumNumber of active threads in the brpc heavy thread poolP0
doris_be_light_work_active_threadsNumNumber of active threads in the brpc light thread poolP0
routine_load_get_msg_latencyMillisecondsLatency of Routine Load retrieving Kafka messages
routine_load_get_msg_countNumNumber of times Routine Load retrieves Kafka messages
routine_load_consume_bytesBytesVolume of data consumed from Kafka by Routine Load
routine_load_consume_rowsNumNumber of rows consumed from Kafka by Routine Load

Machine Monitoring

NameLabelUnitMeaningDescriptionPriority
doris_be_cpuNumCPU-related monitoring metrics, collected from /proc/stat. Values are collected for each logical core. For example, {device="cpu0",mode="nice"} indicates the nice value of cpu0CPU usage can be calculated from thisP0
doris_be_disk_bytes_readBytesCumulative disk read volume. Collected from /proc/diskstats. Values are collected for each disk. For example, {device="vdd"} indicates the value of the vdd disk
doris_be_disk_bytes_writtenBytesCumulative disk write volume. Collected in the same way as above
doris_be_disk_io_time_msMillisecondsCollected in the same way as aboveIO Util can be calculated from thisP0
doris_be_disk_io_time_weightedMillisecondsCollected in the same way as above
doris_be_disk_reads_completedNumCollected in the same way as above
doris_be_disk_read_time_msMillisecondsCollected in the same way as above
doris_be_disk_writes_completedNumCollected in the same way as above
doris_be_disk_write_time_msMillisecondsCollected in the same way as above
doris_be_fd_num_limitNumSystem file handle limit ceiling. Collected from /proc/sys/fs/file-nr
doris_be_fd_num_usedNumNumber of file handles used by the system. Collected from /proc/sys/fs/file-nr
doris_be_file_created_totalNumCumulative number of local file creationsCounts all files that called local_file_writer and were finally closed
doris_be_load_averageNumMachine Load Avg metric monitoring. For example, {mode="15_minutes"} is the 15-minute Load AvgObserve the overall machine loadP0
doris_be_max_disk_io_util_percentPercentageThe calculated maximum IO UTIL value among all disksP0
doris_be_max_network_receive_bytes_rateBytes/SecThe calculated maximum receive rate among all network interfacesP0
doris_be_max_network_send_bytes_rateBytes/SecThe calculated maximum send rate among all network interfacesP0
doris_be_memory_pgpginBytesVolume of data written from disk to memory pages by the system
doris_be_memory_pgpgoutBytesVolume of data written from system memory pages to disk
doris_be_memory_pswpinBytesVolume swapped in from disk to memory by the systemNormally, swap should be disabled, so this value should be 0
doris_be_memory_pswpoutBytesVolume swapped out from memory to disk by the systemNormally, swap should be disabled, so this value should be 0
doris_be_network_receive_bytesBytesCumulative receive bytes for each network interface. Collected from /proc/net/dev
doris_be_network_receive_packetsNumCumulative receive packet count for each network interface. Collected from /proc/net/dev
doris_be_network_send_bytesBytesCumulative send bytes for each network interface. Collected from /proc/net/dev
doris_be_network_send_packetsNumCumulative send packet count for each network interface. Collected from /proc/net/dev
doris_be_proc{mode="ctxt_switch"}NumCumulative number of CPU context switches. Collected from /proc/statObserve whether there are abnormal context switchesP0
doris_be_proc{mode="interrupt"}NumCumulative number of CPU interrupts. Collected from /proc/stat
doris_be_proc{mode="procs_blocked"}NumNumber of processes currently blocked in the system (for example, waiting for IO). Collected from /proc/stat
doris_be_proc{mode="procs_running"}NumNumber of processes currently running in the system. Collected from /proc/stat
doris_be_snmp_tcp_in_errsNumNumber of tcp packet receive errors. Collected from /proc/net/snmpObserve network errors such as retransmissions and packet loss. Use together with other snmp metricsP0
doris_be_snmp_tcp_in_segsNumNumber of tcp packets received. Collected from /proc/net/snmp
doris_be_snmp_tcp_out_segsNumNumber of tcp packets sent. Collected from /proc/net/snmp
doris_be_snmp_tcp_retrans_segsNumNumber of tcp packet retransmissions. Collected from /proc/net/snmp

FAQ

Q: The monitoring endpoint returns 404. What should I do?

Confirm that you are accessing the http_port of the FE or the webserver_port of the BE, and verify the actual port values in fe.conf / be.conf.

Q: Counter metrics only increase and never decrease. Is this normal?

This is normal. A Counter represents a cumulative value. You must sample at intervals and calculate the slope (for example, PromQL's rate()) to obtain the instantaneous rate.

Q: The gap of doris_fe_max_journal_id between multiple FEs is too large. What should I do?

This indicates a metadata sync delay. Check the network connectivity and replay speed of the Follower / Observer FE relative to the Master FE.

Q: Compaction Score remains too high. How do I troubleshoot?

Focus on doris_be_tablet_base_max_compaction_score and doris_be_tablet_cumulative_max_compaction_score, and combine them with disk IO and the number of Compaction tasks to identify the bottleneck.

Q: Queries or loads are accumulating. How do I troubleshoot?

Check thread pool queueing metrics such as fragment_thread_pool_queue_size, doris_be_scanner_thread_pool_queue_size, and doris_be_send_batch_thread_pool_queue_size.