Monitoring Metrics

Both the FE and BE processes of Doris have built-in monitoring metrics, exposed by default in a Prometheus-compatible format. This document lists all observable metrics systematically and annotates their priority (Priority), so that you can quickly build a monitoring and alerting system after integrating with Prometheus + Grafana.

Applicable Scenarios

Scenario	Purpose
Cluster health monitoring	Quickly detect FE/BE node anomalies through P0 metrics
Alert configuration	Configure threshold-based alerts on metrics such as QPS, error rate, and Compaction Score
Performance troubleshooting	Locate bottlenecks using query latency, thread pool queueing, and memory usage metrics
Capacity planning	Evaluate cluster load using disk usage, tablet count, and connection count
Incident review	Analyze the root cause of failures by combining machine metrics (CPU, IO, network) with process metrics

Metric Categories

Doris monitoring metrics are divided into two categories by observation target:

Process monitoring: Shows the runtime metrics of the Doris process itself, such as query count, transaction count, and Compaction status.
Node monitoring: Shows the resource metrics of the machine on which the Doris process runs, such as CPU, memory, IO, and network.

Retrieving Monitoring Metrics

Prometheus-Compatible Format (Default)

Access the HTTP port of an FE or BE node directly:

curl http://fe_host:http_port/metrics
curl http://be_host:webserver_port/metrics

Sample response:

doris_fe_cache_added{type="partition"} 0
doris_fe_cache_added{type="sql"} 0
doris_fe_cache_hit{type="partition"} 0
doris_fe_cache_hit{type="sql"} 0
doris_fe_connection_total 2

JSON Format

Use the type=json parameter to retrieve metrics in JSON format:

curl http://fe_host:http_port/metrics?type=json
curl http://be_host:webserver_port/metrics?type=json

Monitoring Priority and Best Practices

The last column of each metric table is the priority (Priority). P0 is the most important, and the larger the value, the lower the importance. Integrate P0 metrics first.
The vast majority of monitoring metrics are of type Counter (cumulative value). You must sample at intervals (for example, every 15 seconds) and calculate the slope per unit time to obtain useful information.

For example, by calculating the slope of doris_fe_query_err, you can obtain the query error rate (errors per second).

You are welcome to contribute to this table to provide more comprehensive and useful monitoring metrics.

FE Monitoring Metrics

Process Monitoring

Name	Label	Unit	Meaning	Description	Priority
`doris_fe_cache_added`	{type="partition"}	Num	Cumulative number of newly added Partition Caches
	{type="sql"}	Num	Cumulative number of newly added SQL Caches
`doris_fe_cache_hit`	{type="partition"}	Num	Count of Partition Cache hits
	{type="sql"}	Num	Count of SQL Cache hits
`doris_fe_connection_total`		Num	Current number of MySQL port connections on the FE	Used to monitor query connection count. If the connection count exceeds the limit, new connections cannot be accepted	P0
`doris_fe_counter_hit_sql_block_rule`		Num	Number of queries blocked by SQL BLOCK RULE
`doris_fe_edit_log_clean`	{type="failed"}	Num	Number of failures when cleaning historical metadata logs	This should not fail. If it does, manual intervention is required	P0
	{type="success"}	Num	Number of successful historical metadata log cleanups
`doris_fe_edit_log`	{type="accumulated_bytes"}	Bytes	Cumulative value of metadata log write volume	Calculate the slope to obtain the write rate and observe whether metadata writes are delayed	P0
	{type="current_bytes"}	Bytes	Current value of metadata log	Used to monitor editlog size. If the size exceeds the limit, manual intervention is required	P0
	{type="read"}	Num	Count of metadata log reads	Observe whether the metadata read frequency is normal via the slope	P0
	{type="write"}	Num	Count of metadata log writes	Observe whether the metadata write frequency is normal via the slope	P0
	{type="current"}	Num	Current count of metadata logs	Used to monitor editlog count. If the count exceeds the limit, manual intervention is required	P0
`doris_fe_editlog_write_latency_ms`		Milliseconds	Percentile statistics of metadata log write latency. For example, {quantile="0.75"} indicates the 75th-percentile write latency
`doris_fe_image_clean`	{type="failed"}	Num	Number of failures when cleaning historical metadata image files	This should not fail. If it does, manual intervention is required	P0
	{type="success"}	Num	Number of successful historical metadata image file cleanups
`doris_fe_image_push`	{type="failed"}	Num	Number of failures when pushing metadata image files to other FE nodes
	{type="success"}	Num	Number of successful pushes of metadata image files to other FE nodes
`doris_fe_image_write`	{type="failed"}	Num	Number of failures when generating metadata image files	This should not fail. If it does, manual intervention is required	P0
	{type="success"}	Num	Number of successful metadata image file generations
`doris_fe_job`		Num	Count of jobs of different types and states. For example, {job="load", type="INSERT", state="LOADING"} indicates the number of INSERT-type load jobs in the LOADING state	Observe the number of jobs of different types in the cluster as needed	P0
`doris_fe_max_journal_id`		Num	The maximum metadata log ID on the current FE node. For the Master FE, this is the maximum ID currently written; for a non-Master FE, this is the maximum ID of the metadata logs currently replayed	Used to observe whether the IDs across multiple FEs diverge significantly. A large gap indicates a metadata sync problem	P0
`doris_fe_max_tablet_compaction_score`		Num	The maximum Compaction Score value across all BE nodes	Use this value to observe the maximum Compaction Score in the current cluster and determine whether it is too high. A high value may cause query or write latency	P0
`doris_fe_qps`		Num/Sec	Queries per second on the current FE (counts query requests only)	QPS	P0
`doris_fe_query_err`		Num	Cumulative number of failed queries
`doris_fe_query_err_rate`		Num/Sec	Failed queries per second	Observe whether query errors occur in the cluster	P0
`doris_fe_query_latency_ms`		Milliseconds	Percentile statistics of query request latency. For example, {quantile="0.75"} indicates the 75th-percentile query latency	Inspect query latency at each percentile in detail	P0
`doris_fe_query_latency_ms_db`		Milliseconds	Percentile statistics of query request latency for each DB. For example, {quantile="0.75",db="test"} indicates the 75th-percentile query latency for DB test	Inspect query latency at each percentile per DB in detail	P0
`doris_fe_query_olap_table`		Num	Count of query requests against internal tables (OlapTable)
`doris_fe_query_total`		Num	Cumulative count of all query requests
`doris_fe_report_queue_size`		Num	Length of the queue on the FE side for various periodic report tasks from BE	This value reflects the degree of blocking of report tasks on the Master FE node. The larger the value, the more limited the FE's processing capacity	P0
`doris_fe_request_total`		Num	All operation requests (including queries and other statements) received through the MySQL port
`doris_fe_routine_load_error_rows`		Num	Total number of error rows across all Routine Load jobs in the cluster
`doris_fe_routine_load_receive_bytes`		Bytes	Total volume of data received by all Routine Load jobs in the cluster
`doris_fe_routine_load_rows`		Num	Total number of data rows received by all Routine Load jobs in the cluster
`doris_fe_routine_load_get_meta_latency`		Milliseconds	Latency of metadata retrieval for all Routine Load Jobs in the cluster
`doris_fe_routine_load_get_meta_count`		Num	Number of metadata retrieval operations for all Routine Load Jobs in the cluster
`doris_fe_routine_load_get_meta_fail_count`		Num	Number of failed metadata retrievals for all Routine Load Jobs in the cluster
`doris_fe_routine_load_task_execute_time`		Milliseconds	Execution time of all Routine Load Tasks in the cluster
`doris_fe_routine_load_task_execute_count`		Num	Number of executions of all Routine Load Tasks in the cluster
`doris_fe_routine_load_lag`		Milliseconds	Consumption lag of all Routine Load Jobs in the cluster
`doris_fe_routine_load_progress`		Milliseconds	Consumption progress of all Routine Load Jobs in the cluster
`doris_fe_routine_load_abort_task_num`		Num	Number of failed Tasks across all Routine Load Jobs in the cluster
`doris_fe_rps`		Num	Requests per second on the current FE (includes queries and other statement types)	Use together with QPS to see the volume of requests processed by the cluster	P0
`doris_fe_scheduled_tablet_num`		Num	Number of tablets being scheduled by the Master FE node, including replicas being repaired and replicas being balanced	This value reflects the number of tablets currently being migrated in the cluster. A non-zero value over a long period indicates an unstable cluster	P0
`doris_fe_tablet_max_compaction_score`		Num	Compaction Score reported by each BE node. For example, {backend="172.21.0.1:9556"} indicates the value reported by the BE at "172.21.0.1:9556"
`doris_fe_tablet_num`		Num	Total number of tablets on each BE node. For example, {backend="172.21.0.1:9556"} indicates the current tablet count on the BE at "172.21.0.1:9556"	Check whether tablet distribution is uniform and whether the absolute value is reasonable	P0
`doris_fe_tablet_status_count`		Num	Cumulative count of tablets scheduled by the Tablet scheduler on the Master FE node
	{type="added"}	Num	Cumulative count of tablets scheduled by the Tablet scheduler on the Master FE node. "added" indicates the number of tablets that have been scheduled
	{type="in_sched"}	Num	Same as above. Indicates the number of tablets that have been repeatedly scheduled	If this value grows rapidly, some tablets have been in an unhealthy state for a long time, causing the scheduler to schedule them repeatedly
	{type="not_ready"}	Num	Same as above. Indicates the number of tablets that have not yet met the scheduling trigger conditions	If this value grows rapidly, a large number of tablets are unhealthy but cannot be scheduled
	{type="total"}	Num	Same as above. Indicates the cumulative number of tablets that have been checked (but not necessarily scheduled)
	{type="unhealthy"}	Num	Same as above. Indicates the cumulative number of unhealthy tablets that have been checked
`doris_fe_thread_pool`		Num	Statistics on the worker threads and queueing of various thread pools. `active_thread_num` indicates the number of tasks currently being executed. `pool_size` indicates the total number of threads in the pool. `task_in_queue` indicates the number of tasks currently queued
	{name="agent-task-pool"}	Num	Thread pool used by the Master FE to send Agent Tasks to BEs
	{name="connect-scheduler-check-timer"}	Num	Thread pool used to check whether MySQL idle connections have timed out
	{name="connect-scheduler-pool"}	Num	Thread pool used to receive MySQL connection requests
	{name="mysql-nio-pool"}	Num	Thread pool used by the NIO MySQL Server to process tasks
	{name="export-exporting-job-pool"}	Num	Scheduling thread pool for export jobs in the exporting state
	{name="export-pending-job-pool"}	Num	Scheduling thread pool for export jobs in the pending state
	{name="heartbeat-mgr-pool"}	Num	Thread pool used by the Master FE to handle heartbeats from each node
	{name="loading-load-task-scheduler"}	Num	Scheduling thread pool used by the Master FE to schedule loading tasks in Broker Load jobs
	{name="pending-load-task-scheduler"}	Num	Scheduling thread pool used by the Master FE to schedule pending tasks in Broker Load jobs
	{name="schema-change-pool"}	Num	Thread pool used by the Master FE to schedule schema change jobs
	{name="thrift-server-pool"}	Num	Worker thread pool of the FE-side ThriftServer. Corresponds to `rpc_port` in fe.conf and is used to interact with BEs
`doris_fe_txn_counter`		Num	Cumulative count of load transactions in each state	Observe the execution status of load transactions	P0
	{type="begin"}	Num	Number of begun transactions
	{type="failed"}	Num	Number of failed transactions
	{type="reject"}	Num	Number of rejected transactions (for example, when the current number of running transactions exceeds the threshold, new transactions are rejected)
	{type="success"}	Num	Number of successful transactions
`doris_fe_txn_status`		Num	Number of load transactions currently in each state. For example, {type="committed"} indicates the number of transactions in the committed state	Observe the number of load transactions in each state to determine whether there is a backlog	P0
`doris_fe_query_instance_num`		Num	Number of fragment instances currently being requested by a specific user. For example, {user="test_u"} indicates the number of instances currently being requested by user test_u	Use this value to observe whether a specific user is consuming too many query resources	P0
`doris_fe_query_instance_begin`		Num	Number of fragment instances for which a specific user has started requests. For example, {user="test_u"} indicates the number of instances for which user test_u has started requests	Use this value to observe whether a specific user has submitted too many queries	P0
`doris_fe_query_rpc_total`		Num	Number of RPCs sent to a specific BE. For example, {be="192.168.10.1"} indicates the number of RPCs sent to the BE at IP 192.168.10.1	Use this value to observe whether too many RPCs have been submitted to a specific BE
`doris_fe_query_rpc_failed`		Num	Number of failed RPCs sent to a specific BE. For example, {be="192.168.10.1"} indicates the number of failed RPCs sent to the BE at IP 192.168.10.1	Use this value to observe whether a specific BE has RPC issues
`doris_fe_query_rpc_size`		Num	RPC data size for a specific BE. For example, {be="192.168.10.1"} indicates the byte count of RPC data sent to the BE at IP 192.168.10.1	Use this value to observe whether oversized RPCs have been submitted to a specific BE
`doris_fe_txn_exec_latency_ms`		Milliseconds	Percentile statistics of transaction execution time. For example, {quantile="0.75"} indicates the 75th-percentile transaction execution time	Inspect transaction execution time at each percentile in detail	P0
`doris_fe_txn_publish_latency_ms`		Milliseconds	Percentile statistics of transaction publish time. For example, {quantile="0.75"} indicates the 75th-percentile transaction publish time	Inspect transaction publish time at each percentile in detail	P0
`doris_fe_txn_num`		Num	Number of transactions currently executing in a specific DB. For example, {db="test"} indicates the number of transactions currently executing in DB test	Use this value to observe whether a specific DB has submitted a large number of transactions	P0
`doris_fe_publish_txn_num`		Num	Number of transactions currently being published in a specific DB. For example, {db="test"} indicates the number of transactions currently being published in DB test	Use this value to observe the number of publish transactions in a specific DB	P0
`doris_fe_txn_replica_num`		Num	Number of replicas opened by transactions currently executing in a specific DB. For example, {db="test"} indicates the number of replicas opened by transactions currently executing in DB test	Use this value to observe whether a specific DB has opened too many replicas, which may affect the execution of other transactions	P0
`doris_fe_thrift_rpc_total`		Num	Number of RPC requests received by each method of the FE thrift interface. For example, {method="report"} indicates the number of RPC requests received by the report method	Use this value to observe the load of a specific thrift rpc method
`doris_fe_thrift_rpc_latency_ms`		Milliseconds	RPC request processing time for each method of the FE thrift interface. For example, {method="report"} indicates the RPC request processing time of the report method	Use this value to observe the load of a specific thrift rpc method
`doris_fe_external_schema_cache`	{catalog="hive"}	Num	Number of entries in the schema cache for a specific External Catalog
`doris_fe_hive_meta_cache`	{catalog="hive"}	Num
	`{type="partition_value"}`	Num	Number of entries in the partition value cache for a specific External Hive Metastore Catalog
	`{type="partition"}`	Num	Number of entries in the partition cache for a specific External Hive Metastore Catalog
	`{type="file"}`	Num	Number of entries in the file cache for a specific External Hive Metastore Catalog

JVM Monitoring

Name	Label	Unit	Meaning	Description	Priority
`jvm_heap_size_bytes`		Bytes	JVM memory monitoring. The labels include max, used, and committed, corresponding to maximum, used, and committed memory respectively	Observe JVM memory usage	P0
`jvm_non_heap_size_bytes`		Bytes	JVM off-heap memory statistics
`<GarbageCollector>`			GC monitoring	GarbageCollector refers to a specific garbage collector	P0
	{type="count"}	Num	Cumulative number of GCs
	{type="time"}	Milliseconds	Cumulative GC time
`jvm_old_size_bytes`		Bytes	JVM old generation memory statistics		P0
`jvm_thread`		Num	JVM thread count statistics	Observe whether the JVM thread count is reasonable	P0
`jvm_young_size_bytes`		Bytes	JVM young generation memory statistics		P0

Machine Monitoring

Name	Label	Unit	Meaning
`system_meminfo`		Bytes	Memory monitoring of the FE node machine. Collected from `/proc/meminfo`. Includes `buffers`, `cached`, `memory_available`, `memory_free`, and `memory_total`
`system_snmp`			Network monitoring of the FE node machine. Collected from `/proc/net/snmp`
	`{name="tcp_in_errs"}`	Num	Number of tcp packet receive errors
	`{name="tcp_in_segs"}`	Num	Number of tcp packets received
	`{name="tcp_out_segs"}`	Num	Number of tcp packets sent
	`{name="tcp_retrans_segs"}`	Num	Number of tcp packet retransmissions

BE Monitoring Metrics

Process Monitoring

Name	Label	Unit	Meaning	Description	Priority
`doris_be_active_scan_context_count`		Num	Shows the number of scanners currently opened directly by external systems
`doris_be_add_batch_task_queue_size`		Num	Records the queue size of the thread pool that receives batches during load	If greater than 0, there is a backlog on the receiving side of load tasks	P0
`agent_task_queue_size`		Num	Shows the length of each Agent Task processing queue. For example, `{type="CREATE_TABLE"}` indicates the length of the CREATE_TABLE task queue
`doris_be_brpc_endpoint_stub_count`		Num	Number of created brpc stubs used for interaction between BEs
`doris_be_brpc_function_endpoint_stub_count`		Num	Number of created brpc stubs used for interaction with Remote RPC
`doris_be_cache_capacity`			Records the capacity of a specific LRU Cache
`doris_be_cache_usage`			Records the usage of a specific LRU Cache	Observe memory usage	P0
`doris_be_cache_usage_ratio`			Records the usage ratio of a specific LRU Cache
`doris_be_cache_lookup_count`			Records the number of lookups on a specific LRU Cache
`doris_be_cache_hit_count`			Records the hit count of a specific LRU Cache
`doris_be_cache_hit_ratio`			Records the hit ratio of a specific LRU Cache	Observe whether the cache is effective	P0
	{name="DataPageCache"}	Num	DataPageCache caches the Data Page of data	Data cache, directly affects query efficiency	P0
	{name="IndexPageCache"}	Num	IndexPageCache caches the Index Page of data	Index cache, directly affects query efficiency	P0
	{name="LastSuccessChannelCache"}	Num	LastSuccessChannelCache caches the LoadChannel on the load receiver side
	{name="SegmentCache"}	Num	SegmentCache caches opened Segments, such as index information
`doris_be_chunk_pool_local_core_alloc_count`		Num	Number of times memory is allocated from the memory queue of the bound core in the ChunkAllocator
`doris_be_chunk_pool_other_core_alloc_count`		Num	Number of times memory is allocated from the memory queue of other cores in the ChunkAllocator
`doris_be_chunk_pool_reserved_bytes`		Bytes	Size of memory reserved in the ChunkAllocator
`doris_be_chunk_pool_system_alloc_cost_ns`		Nanoseconds	Cumulative time spent allocating memory by the SystemAllocator	Observe the time cost of memory allocation via the slope	P0
`doris_be_chunk_pool_system_alloc_count`		Num	Number of times the SystemAllocator allocates memory
`doris_be_chunk_pool_system_free_cost_ns`		Nanoseconds	Cumulative time spent freeing memory by the SystemAllocator	Observe the time cost of memory release via the slope	P0
`doris_be_chunk_pool_system_free_count`		Num	Number of times the SystemAllocator frees memory
`doris_be_compaction_bytes_total`		Bytes	Cumulative volume of data processed by Compaction	Records the disk size of input rowsets in Compaction tasks. Observe the Compaction rate via the slope	P0
	{type="base"}	Bytes	Cumulative data volume of Base Compaction
	{type="cumulative"}	Bytes	Cumulative data volume of Cumulative Compaction
`doris_be_compaction_deltas_total`		Num	Cumulative number of rowsets processed by Compaction	Records the number of input rowsets in Compaction tasks
	{type="base"}	Num	Cumulative number of rowsets processed by Base Compaction
	{type="cumulative"}	Num	Cumulative number of rowsets processed by Cumulative Compaction
`doris_be_disks_compaction_num`		Num	Number of Compaction tasks currently running on a specific data directory. For example, `{path="/path1/"}` indicates the number of tasks currently running on the `/path1` directory	Observe whether the number of Compaction tasks on each disk is reasonable	P0
`doris_be_disks_compaction_score`		Num	Number of Compaction tokens currently running on a specific data directory. For example, `{path="/path1/"}` indicates the number of tokens currently running on the `/path1` directory
`doris_be_compaction_used_permits`		Num	Number of tokens already used by Compaction tasks	Reflects the resource consumption of Compaction
`doris_be_compaction_waitting_permits`		Num	Number of items waiting for Compaction tokens
`doris_be_data_stream_receiver_count`		Num	Number of data Receivers on the receiving side	FIXME: This metric is missing in the vectorized engine
`doris_be_disks_avail_capacity`		Bytes	Remaining space on the disk where a specific data directory resides. For example, `{path="/path1/"}` indicates the remaining space on the disk where the `/path1` directory resides		P0
`doris_be_disks_local_used_capacity`		Bytes	Local used space on the disk where a specific data directory resides
`doris_be_disks_remote_used_capacity`		Bytes	Used space of the remote directory corresponding to the disk where a specific data directory resides
`doris_be_disks_state`		Boolean	Disk state of a specific data directory. 1 indicates normal, 0 indicates abnormal
`doris_be_disks_total_capacity`		Bytes	Total capacity of the disk where a specific data directory resides	Use together with `doris_be_disks_avail_capacity` to calculate disk usage	P0
`doris_be_engine_requests_total`		Num	Cumulative count of execution states of various tasks on the BE
	{status="failed",type="xxx"}	Num	Cumulative number of failures for tasks of type xxx
	{status="total",type="xxx"}	Num	Cumulative total number of executions for tasks of type xxx	Monitor the failure count of various task types as needed	P0
	`{status="skip",type="report_all_tablets"}`	Num	Cumulative number of times tasks of type xxx were skipped
`doris_be_fragment_endpoint_count`		Num	Same as `doris_be_data_stream_receiver_count`	FIXME: Same count as `doris_be_data_stream_receiver_count`. Also missing in the vectorized engine
`doris_be_fragment_request_duration_us`		Microseconds	Cumulative execution time of all fragment instances	Observe instance execution time via the slope	P0
`doris_be_fragment_requests_total`		Num	Cumulative number of fragment instances executed
`doris_be_load_channel_count`		Num	Number of load channels currently open	The larger the value, the more load tasks are currently running	P0
`doris_be_local_bytes_read_total`		Bytes	Number of bytes read by `LocalFileReader`		P0
`doris_be_local_bytes_written_total`		Bytes	Number of bytes written by `LocalFileWriter`		P0
`doris_be_local_file_reader_total`		Num	Cumulative count of opened `LocalFileReader` instances
`doris_be_local_file_open_reading`		Num	Number of `LocalFileReader` instances currently open
`doris_be_local_file_writer_total`		Num	Cumulative count of opened `LocalFileWriter` instances
`doris_be_mem_consumption`		Bytes	Current memory consumption of a specific module. For example, {type="compaction"} indicates the current total memory consumption of the Compaction module	The value is taken from the MemTracker of the same type. FIXME
`doris_be_memory_allocated_bytes`		Bytes	BE process physical memory size, taken from `/proc/self/status/VmRSS`		P0
`doris_be_memory_jemalloc`		Bytes	Jemalloc stats, taken from `je_mallctl`	For the meaning, see: https://jemalloc.net/jemalloc.3.html	P0
`doris_be_memory_pool_bytes_total`		Bytes	Memory size currently occupied by all MemPools. A statistical value that does not represent real memory usage
`doris_be_memtable_flush_duration_us`		Microseconds	Cumulative time spent writing memtables to disk	Observe write latency via the slope	P0
`doris_be_memtable_flush_total`		Num	Cumulative number of memtables written to disk	Calculate the frequency of file writes via the slope	P0
`doris_be_meta_request_duration`		Microseconds	Cumulative time spent accessing meta in RocksDB	Observe the BE metadata read/write latency via the slope	P0
	{type="read"}	Microseconds	Read time
	{type="write"}	Microseconds	Write time
`doris_be_meta_request_total`		Num	Cumulative number of accesses to meta in RocksDB	Observe the BE metadata access frequency via the slope	P0
	{type="read"}	Num	Number of reads
	{type="write"}	Num	Number of writes
`doris_be_fragment_instance_count`		Num	Number of fragment instances currently received	Observe whether instances are accumulating	P0
`doris_be_process_fd_num_limit_hard`		Num	Hard limit on the number of file handles for the BE process. Collected via `/proc/pid/limits`
`doris_be_process_fd_num_limit_soft`		Num	Soft limit on the number of file handles for the BE process. Collected via `/proc/pid/limits`
`doris_be_process_fd_num_used`		Num	Number of file handles used by the BE process. Collected via `/proc/pid/limits`
`doris_be_process_thread_num`		Num	Number of threads in the BE process. Collected via `/proc/pid/task`		P0
`doris_be_query_cache_memory_total_byte`		Bytes	Bytes occupied by the Query Cache
`doris_be_query_cache_partition_total_count`		Num	Current number of entries in the Partition Cache
`doris_be_query_cache_sql_total_count`		Num	Current number of entries in the SQL Cache
`doris_be_query_scan_bytes`		Bytes	Cumulative volume of data read. Only counts data read from Olap tables
`doris_be_query_scan_bytes_per_second`		Bytes/Sec	Read rate calculated from `doris_be_query_scan_bytes`	Observe query rate	P0
`doris_be_query_scan_rows`		Num	Cumulative number of rows read. Only counts data read from Olap tables. This is RawRowsRead (some data rows may be skipped by indexes and not actually read, but are still counted in this value)	Observe query rate via the slope	P0
`doris_be_result_block_queue_count`		Num	Number of fragment instances currently in the query result cache	This queue is only used when external systems read directly. For example, Spark on Doris queries data via external scan
`doris_be_result_buffer_block_count`		Num	Number of queries currently in the query result cache	This value reflects how many query results in the current BE are waiting to be consumed by the FE	P0
`doris_be_routine_load_task_count`		Num	Number of routine load tasks currently running
`doris_be_rowset_count_generated_and_in_use`		Num	Number of newly added rowset IDs in use since the last startup
`doris_be_s3_bytes_read_total`		Num	Cumulative number of times `S3FileReader` has been opened
`doris_be_s3_file_open_reading`		Num	Number of `S3FileReader` instances currently open
`doris_be_s3_bytes_read_total`		Bytes	Cumulative number of bytes read by `S3FileReader`
`doris_be_scanner_thread_pool_queue_size`		Num	Current queue size of the thread pool used for OlapScanner	A value greater than zero indicates that Scanners are starting to accumulate	P0
`doris_be_segment_read`	`{type="segment_read_total"}`	Num	Cumulative number of segments read
`doris_be_segment_read`	`{type="segment_row_total"}`	Num	Cumulative number of rows read across segments	This value also includes rows filtered by indexes. It is equivalent to the number of segments read multiplied by the total rows per segment
`doris_be_send_batch_thread_pool_queue_size`		Num	Queue size of the thread pool used to send data packets during load	A value greater than 0 indicates accumulation	P0
`doris_be_send_batch_thread_pool_thread_num`		Num	Number of threads in the thread pool used to send data packets during load
`doris_be_small_file_cache_count`		Num	Number of small files currently cached on the BE
`doris_be_streaming_load_current_processing`		Num	Number of stream load tasks currently running	Only includes tasks sent via the curl command
`doris_be_streaming_load_duration_ms`		Milliseconds	Cumulative execution time of all stream load tasks
`doris_be_streaming_load_requests_total`		Num	Cumulative number of stream load tasks	Observe task submission frequency via the slope	P0
`doris_be_stream_load_pipe_count`		Num	Current number of stream load data pipes	Includes both stream load and routine load tasks
`doris_be_stream_load`	{type="load_rows"}	Num	Cumulative number of rows finally loaded by stream load	Includes both stream load and routine load tasks	P0
`doris_be_stream_load`	{type="receive_bytes"}	Bytes	Cumulative number of bytes received by stream load	Includes data received by stream load over HTTP and data read from Kafka by routine load	P0
`doris_be_tablet_base_max_compaction_score`		Num	Current maximum Base Compaction Score	This value changes in real time and may miss peak data. The higher the value, the more severe the Compaction backlog	P0
`doris_be_tablet_cumulative_max_compaction_score`		Num	Same as above. Current maximum Cumulative Compaction Score
`doris_be_tablet_version_num_distribution`		Num	Histogram of the number of tablet versions	Reflects the distribution of tablet version counts	P0
`doris_be_thrift_connections_total`		Num	Cumulative number of created thrift connections. For example, `{name="heartbeat"}` indicates the cumulative number of connections to the heartbeat service	This value is for the thrift server in which the BE acts as the server side
`doris_be_thrift_current_connections`		Num	Current number of thrift connections. For example, `{name="heartbeat"}` indicates the current number of connections to the heartbeat service	Same as above
`doris_be_thrift_opened_clients`		Num	Current number of opened thrift clients. For example, `{name="frontend"}` indicates the number of clients accessing the FE service
`doris_be_thrift_used_clients`		Num	Current number of thrift clients in use. For example, `{name="frontend"}` indicates the number of clients currently used to access the FE service
`doris_be_timeout_canceled_fragment_count`		Num	Cumulative number of fragment instances canceled due to timeout	This value may be recorded multiple times. For example, some fragment instances may be canceled multiple times	P0
`doris_be_stream_load_txn_request`	{type="begin"}	Num	Cumulative number of stream load transactions started	Includes both stream load and routine load tasks
`doris_be_stream_load_txn_request`	{type="commit"}	Num	Cumulative number of stream load transactions successfully committed	Same as above
`doris_be_stream_load_txn_request`	{type="rollback"}	Num	Cumulative number of stream load transactions that failed	Same as above
`doris_be_unused_rowsets_count`		Num	Number of currently deprecated rowsets	These rowsets are periodically deleted under normal circumstances
`doris_be_upload_fail_count`		Num	Cumulative number of times rowsets failed to upload to remote storage in the tiered storage feature
`doris_be_upload_rowset_count`		Num	Cumulative number of times rowsets were successfully uploaded to remote storage in the tiered storage feature
`doris_be_upload_total_byte`		Bytes	Cumulative data volume of rowsets successfully uploaded to remote storage in the tiered storage feature
`doris_be_load_bytes`		Bytes	Cumulative number of bytes sent via tablet sink	Observe load data volume	P0
`doris_be_load_rows`		Num	Cumulative number of rows sent via tablet sink	Observe load data volume	P0
`fragment_thread_pool_queue_size`		Num	Current length of the wait queue for the query execution thread pool	If greater than zero, query threads are exhausted and queries are accumulating	P0
`doris_be_all_rowsets_num`		Num	Total number of all current rowsets		P0
`doris_be_all_segments_num`		Num	Total number of all current segments		P0
`doris_be_heavy_work_max_threads`		Num	Number of threads in the brpc heavy thread pool		P0
`doris_be_light_work_max_threads`		Num	Number of threads in the brpc light thread pool		P0
`doris_be_heavy_work_pool_queue_size`		Num	Maximum queue length of the brpc heavy thread pool. Submission of work is blocked when exceeded		P0
`doris_be_light_work_pool_queue_size`		Num	Maximum queue length of the brpc light thread pool. Submission of work is blocked when exceeded		P0
`doris_be_heavy_work_active_threads`		Num	Number of active threads in the brpc heavy thread pool		P0
`doris_be_light_work_active_threads`		Num	Number of active threads in the brpc light thread pool		P0
`routine_load_get_msg_latency`		Milliseconds	Latency of Routine Load retrieving Kafka messages
`routine_load_get_msg_count`		Num	Number of times Routine Load retrieves Kafka messages
`routine_load_consume_bytes`		Bytes	Volume of data consumed from Kafka by Routine Load
`routine_load_consume_rows`		Num	Number of rows consumed from Kafka by Routine Load

Machine Monitoring

Name	Label	Unit	Meaning	Description	Priority
`doris_be_cpu`		Num	CPU-related monitoring metrics, collected from `/proc/stat`. Values are collected for each logical core. For example, `{device="cpu0",mode="nice"}` indicates the nice value of cpu0	CPU usage can be calculated from this	P0
`doris_be_disk_bytes_read`		Bytes	Cumulative disk read volume. Collected from `/proc/diskstats`. Values are collected for each disk. For example, `{device="vdd"}` indicates the value of the vdd disk
`doris_be_disk_bytes_written`		Bytes	Cumulative disk write volume. Collected in the same way as above
`doris_be_disk_io_time_ms`		Milliseconds	Collected in the same way as above	IO Util can be calculated from this	P0
`doris_be_disk_io_time_weighted`		Milliseconds	Collected in the same way as above
`doris_be_disk_reads_completed`		Num	Collected in the same way as above
`doris_be_disk_read_time_ms`		Milliseconds	Collected in the same way as above
`doris_be_disk_writes_completed`		Num	Collected in the same way as above
`doris_be_disk_write_time_ms`		Milliseconds	Collected in the same way as above
`doris_be_fd_num_limit`		Num	System file handle limit ceiling. Collected from `/proc/sys/fs/file-nr`
`doris_be_fd_num_used`		Num	Number of file handles used by the system. Collected from `/proc/sys/fs/file-nr`
`doris_be_file_created_total`		Num	Cumulative number of local file creations	Counts all files that called `local_file_writer` and were finally closed
`doris_be_load_average`		Num	Machine Load Avg metric monitoring. For example, {mode="15_minutes"} is the 15-minute Load Avg	Observe the overall machine load	P0
`doris_be_max_disk_io_util_percent`		Percentage	The calculated maximum IO UTIL value among all disks		P0
`doris_be_max_network_receive_bytes_rate`		Bytes/Sec	The calculated maximum receive rate among all network interfaces		P0
`doris_be_max_network_send_bytes_rate`		Bytes/Sec	The calculated maximum send rate among all network interfaces		P0
`doris_be_memory_pgpgin`		Bytes	Volume of data written from disk to memory pages by the system
`doris_be_memory_pgpgout`		Bytes	Volume of data written from system memory pages to disk
`doris_be_memory_pswpin`		Bytes	Volume swapped in from disk to memory by the system	Normally, swap should be disabled, so this value should be 0
`doris_be_memory_pswpout`		Bytes	Volume swapped out from memory to disk by the system	Normally, swap should be disabled, so this value should be 0
`doris_be_network_receive_bytes`		Bytes	Cumulative receive bytes for each network interface. Collected from `/proc/net/dev`
`doris_be_network_receive_packets`		Num	Cumulative receive packet count for each network interface. Collected from `/proc/net/dev`
`doris_be_network_send_bytes`		Bytes	Cumulative send bytes for each network interface. Collected from `/proc/net/dev`
`doris_be_network_send_packets`		Num	Cumulative send packet count for each network interface. Collected from `/proc/net/dev`
`doris_be_proc`	`{mode="ctxt_switch"}`	Num	Cumulative number of CPU context switches. Collected from `/proc/stat`	Observe whether there are abnormal context switches	P0
`doris_be_proc`	`{mode="interrupt"}`	Num	Cumulative number of CPU interrupts. Collected from `/proc/stat`
`doris_be_proc`	`{mode="procs_blocked"}`	Num	Number of processes currently blocked in the system (for example, waiting for IO). Collected from `/proc/stat`
`doris_be_proc`	`{mode="procs_running"}`	Num	Number of processes currently running in the system. Collected from `/proc/stat`
`doris_be_snmp_tcp_in_errs`		Num	Number of tcp packet receive errors. Collected from `/proc/net/snmp`	Observe network errors such as retransmissions and packet loss. Use together with other snmp metrics	P0
`doris_be_snmp_tcp_in_segs`		Num	Number of tcp packets received. Collected from `/proc/net/snmp`
`doris_be_snmp_tcp_out_segs`		Num	Number of tcp packets sent. Collected from `/proc/net/snmp`
`doris_be_snmp_tcp_retrans_segs`		Num	Number of tcp packet retransmissions. Collected from `/proc/net/snmp`

FAQ

Q: The monitoring endpoint returns 404. What should I do?

Confirm that you are accessing the http_port of the FE or the webserver_port of the BE, and verify the actual port values in fe.conf / be.conf.

Q: Counter metrics only increase and never decrease. Is this normal?

This is normal. A Counter represents a cumulative value. You must sample at intervals and calculate the slope (for example, PromQL's rate()) to obtain the instantaneous rate.

Q: The gap of `doris_fe_max_journal_id` between multiple FEs is too large. What should I do?

This indicates a metadata sync delay. Check the network connectivity and replay speed of the Follower / Observer FE relative to the Master FE.

Q: Compaction Score remains too high. How do I troubleshoot?

Focus on doris_be_tablet_base_max_compaction_score and doris_be_tablet_cumulative_max_compaction_score, and combine them with disk IO and the number of Compaction tasks to identify the bottleneck.

Q: Queries or loads are accumulating. How do I troubleshoot?

Check thread pool queueing metrics such as fragment_thread_pool_queue_size, doris_be_scanner_thread_pool_queue_size, and doris_be_send_batch_thread_pool_queue_size.

Applicable Scenarios​

Metric Categories​

Retrieving Monitoring Metrics​

Prometheus-Compatible Format (Default)​

JSON Format​

Monitoring Priority and Best Practices​

FE Monitoring Metrics​

Process Monitoring​

JVM Monitoring​

Machine Monitoring​

BE Monitoring Metrics​

Process Monitoring​

Machine Monitoring​

FAQ​

Q: The monitoring endpoint returns 404. What should I do?​

Q: Counter metrics only increase and never decrease. Is this normal?​

Q: The gap of doris_fe_max_journal_id between multiple FEs is too large. What should I do?​

Q: Compaction Score remains too high. How do I troubleshoot?​

Q: Queries or loads are accumulating. How do I troubleshoot?​

Applicable Scenarios

Metric Categories

Retrieving Monitoring Metrics

Prometheus-Compatible Format (Default)

JSON Format

Monitoring Priority and Best Practices

FE Monitoring Metrics

Process Monitoring

JVM Monitoring

Machine Monitoring

BE Monitoring Metrics

Process Monitoring

Machine Monitoring

FAQ

Q: The monitoring endpoint returns 404. What should I do?

Q: Counter metrics only increase and never decrease. Is this normal?

Q: The gap of `doris_fe_max_journal_id` between multiple FEs is too large. What should I do?

Q: Compaction Score remains too high. How do I troubleshoot?

Q: Queries or loads are accumulating. How do I troubleshoot?