Skip to main content

Bitmap and HLL Data Format

This document targets developers who need to understand or parse the Apache Doris Bitmap / HLL binary format. It describes the byte layout and Flag meanings for each type, and points to the C++ and Java serialization / deserialization entry points.

Contents


Bitmap Format

Overview

Doris stores Bitmaps with Roaring Bitmap. The BE side uses CRoaring.

  • The Roaring serialization format is compatible across C++, Java, Go, and other languages.
  • The C++ Roaring64Map serialization result is not compatible with Java's Roaring64NavigableMap.

Doris Bitmap has five types in total. Each type is identified by a one-byte flag. The overall byte layout is:

 | flag     | data .....|
<--1Byte--><--n bytes-->

Flag Values

FlagNameData LayoutType Description
0EMPTYNo data. The whole serialization result is a single byte.Empty Bitmap
1SINGLE324 bytes representing one 32-bit unsigned integer.The Bitmap contains a single 32-bit integer value.
2BITMAP32The serialization result of roaring::Roaring.32-bit Bitmap. On the Java side this corresponds to org.roaringbitmap.RoaringBitmap, and on the C++ side to roaring::Roaring. You can deserialize it directly with these types.
3SINGLE648 bytes representing one 64-bit unsigned integer.The Bitmap contains a single 64-bit integer value.
4BITMAP64A 1-to-8-byte variable-length-encoded uint64 representing the size of the Bitmap, followed by repeated [4-byte high bits + 32-bit Roaring Bitmap serialization data] entries.64-bit Bitmap. On the Java side this corresponds to org.roaringbitmap.RoaringBitmap, and on the C++ side to Doris's Roaring64Map. The data structure matches the Roaring library, but the serialization / deserialization methods differ.
5SET1 byte representing the number of values, followed by all values (8 bytes per value).When the number of values in the Bitmap is between 1 and 32, the actual storage type is a hashset.

HLL Format

Overview

The HLL serialization format is implemented within Doris. Similar to Bitmap, the binary layout of HLL is a one-byte flag followed by multi-byte data:

 | flag     | data .....|
<--1Byte--><--n bytes-->

Flag Values

FlagNameData LayoutType Description
0HLL_DATA_EMPTYNo data. The whole serialization result is a single byte.Empty HLL
1HLL_DATA_EXPLICIT1 byte for the number of explicit data blocks, followed by the data blocks. Each data block consists of an 8-byte length and the data.Explicitly stores all distinct values.
2HLL_DATA_SPARSE4 bytes for the number of registers, followed by the registers. Each register consists of a 2-byte index and a 1-byte value.Sparse storage. Only non-zero values are stored.
3HLL_DATA_FULLA contiguous block of 16 * 1024 bytes of value data.Indicates that all 16 * 1024 registers have values.

Serialization / Deserialization Entry Points

TypeLanguageFileMethod
BitmapC++be/src/util/bitmap_value.hBitmapValue::write() / BitmapValue::deserialize()
BitmapJavafe/fe-common/src/main/java/org/apache/doris/common/io/BitmapValue.javaserialize() / deserialize()
HLLC++be/src/olap/hll.hserialize() / deserialize()
HLLJavafe/fe-common/src/main/java/org/apache/doris/common/io/hll.javaserialize() / deserialize()

FAQ

Q: Why can a 64-bit Bitmap written by the C++ side not be deserialized with Java's Roaring64NavigableMap?

A: The serialization format of Doris C++ Roaring64Map is not compatible with Java's Roaring64NavigableMap. When exchanging 64-bit Bitmaps across languages, use the format defined by Doris Bitmap itself (BITMAP64, Flag = 4) and parse it according to the byte layout above.

Q: How do you tell whether a piece of Bitmap binary is 32-bit or 64-bit?

A: Read the first byte as the Flag: 0/1/2 indicates 32-bit semantics (empty / single value / Roaring32), 3/4 indicates 64-bit semantics, and 5 indicates a hashset (with 64-bit values).

Q: Why is the total number of HLL registers 16 * 1024?

A: The Doris HLL implementation uses a fixed 16384 registers (precision 2^14). HLL_DATA_FULL is the dense representation where all registers have values.