Embedding

TL;DR Apache Doris ships a built-in EMBED() SQL function that sends text (or an image, video, audio file) to an external embedding model and returns an ARRAY<FLOAT> ready for a column or a distance function. The model is configured once as an AI RESOURCE, then EMBED('your text') works anywhere a value is allowed. No UDFs, no client-side glue, no second service to operate.

Apache Doris Embedding: Doris turns text and media into vectors with EMBED(), a SQL function that calls an external model and returns ARRAY of FLOAT for vector search.

Why use the EMBED function in Apache Doris?

The Apache Doris EMBED() function collapses the usual application-side embedding pipeline into one SQL call, so vectors are generated, stored, and queried inside the database without a separate ETL service. Vector search needs vectors, and vectors come from an embedding model. The usual setup looks like this: an application server pulls rows out of the database, sends them to OpenAI or a local model, gets vectors back, and writes them to a vector table. At query time the same round trip happens for the search string. You end up maintaining a small ETL service whose only job is to call an HTTP API and stash the result somewhere.

A few things tend to go wrong:

The pipeline drifts out of sync with the source table. New rows land without embeddings until the next batch run.
Backfills are jobs in their own right, often slower and more expensive than the original ingest.
Two teams own one logical step. The data team owns the rows, the application team owns the vectors, and nobody owns the gap.

EMBED() pulls the whole loop back into SQL. The model is an Apache Doris resource, the call is a function, and the result is a column you can persist with INSERT or compute on the fly.

What is the Apache Doris EMBED function?

The Apache Doris EMBED() function is a built-in AI function that takes text or a multimodal file reference, asks an external embedding provider for a vector, and returns an ARRAY<FLOAT>. Apache Doris reaches the provider through an AI resource you set up once with CREATE RESOURCE. The same function call works at load time, so you can persist embeddings into a column, and at query time, so you can embed the user's search string on the fly without leaving SQL.

Key terms

AI resource: a named connection to an embedding provider. EMBED() ships dedicated adapters for openai, gemini, voyageai, jina, qwen, minimax, plus a local option for in-house model servers. Stores the endpoint, model name, API key, and optional dimensions.
EMBED([resource_name], input): the SQL function. With one argument, Apache Doris uses the session default; with two, the first names the resource explicitly.
default_ai_resource: a session variable that picks a resource for the rest of the session.
Multimodal input: a JSON value describing a file (URI, content type, optional S3 credentials). Apache Doris presigns S3 URLs and forwards them to providers that support image, video, or audio embeddings.

How does the Apache Doris EMBED function work?

The Apache Doris EMBED() function works in five stages: register the AI resource, plan the call on the FE, batch rows on the BE, retry on failures, and return one ARRAY<FLOAT> per row.

Register the model. CREATE RESOURCE stores the provider type, endpoint, model name, and credentials. Apache Doris validates the resource with a probe call to the provider, except when provider_type = "local", where the check is skipped.
Plan the call. When the planner sees EMBED(), it sends the work to the BE. The function processes the input column row by row, but it does not call the provider row by row.
Batch on the BE. The BE accumulates rows into a batch, capped by embed_max_batch_size (default 5 inputs) and by ai_context_window_size (default 128 KB of accumulated text). Each batch becomes one HTTP request, which keeps round trips and per-minute rate limits in check.
Retry and fail loudly. The provider call honors ai.max_retries and ai.retry_delay_second from the resource. If the API returns the wrong number of vectors, Apache Doris fails the query rather than risk misaligning rows with vectors.
Return a column. Every row gets its own ARRAY<FLOAT>. You can store it, feed it to cosine_distance, or load it into an ANN index for vector search.

Quick start

CREATE RESOURCE "openai_embed" PROPERTIES (
  "type" = "ai", "ai.provider_type" = "openai",
  "ai.endpoint"   = "https://api.openai.com/v1/embeddings",
  "ai.model_name" = "text-embedding-3-small",
  "ai.api_key"    = "sk-xxx", "ai.dimensions" = "8"
);

SET default_ai_resource = "openai_embed";

CREATE TABLE notes (id INT, body STRING, vec ARRAY<FLOAT>)
  DUPLICATE KEY(id) DISTRIBUTED BY HASH(id) BUCKETS 1;

INSERT INTO notes VALUES
  (1, 'travel reimbursement policy', EMBED('travel reimbursement policy')),
  (2, 'VPN setup guide',             EMBED('VPN setup guide'));

SELECT id, body, cosine_distance(vec, EMBED('how do I expense a trip?')) AS d
FROM notes ORDER BY d ASC LIMIT 1;

Expected result

+----+-----------------------------+--------+
| id | body                        | d      |
+----+-----------------------------+--------+
|  1 | travel reimbursement policy | 0.4463 |
+----+-----------------------------+--------+

The INSERT precomputes one vector per row. The SELECT embeds the query string once at runtime, then ranks rows by cosine distance. Two API calls in total, both hidden inside SQL.

When should you use the Apache Doris EMBED function?

The Apache Doris EMBED() function fits RAG and semantic search over corpora that already live in Apache Doris, backfills that keep the work inside SQL, hybrid pipelines that pair vectors with keyword filters, and multimodal embeddings whose source files live in S3. It is not a fit for per-row embedding over millions of rows on the hot path, models outside the supported provider list, or air-gapped clusters with no path to the provider.

Good fit

RAG and semantic search where the corpus already lives in Apache Doris and you want one system to handle ingestion, embedding, and retrieval.
Backfills over existing tables: UPDATE t SET vec = EMBED(body) WHERE vec IS NULL keeps the work inside the database.
Hybrid pipelines that pair EMBED() with an ANN index for storage and a MATCH_* predicate for keyword filtering. See Hybrid Search.
Multimodal embeddings (image, video, audio) where the file lives in S3 and you would rather not write a download-then-upload script.

Not a good fit

Real-time queries that embed at the per-row level over millions of rows. Each row is one provider call, so the bill and the latency scale with row count. Precompute once, reuse forever.
Workloads that need a specific embedding model not on the supported provider list. The providers with an embedding adapter are OpenAI, Gemini, VoyageAI, Jina, Qwen, and MiniMax, plus local for an in-house model server. Apache Doris's broader AI resource list also accepts DeepSeek, MoonShot, Anthropic, Zhipu, and Baichuan, but those route through the LLM SQL Functions — chat / completion only, not EMBED(). Anything outside the embedding-adapter set needs to expose an OpenAI- or Gemini-compatible API to plug in.
Air-gapped clusters with no path to the provider. Use provider_type = "local" against an in-house model server, or precompute vectors in your own pipeline.
Columns where the embedding model's dimension does not match the ANN index. The ARRAY length is fixed by the model (or by ai.dimensions for models that allow truncation); the index is fixed at table creation. Mismatch fails at write time.

Cost and throughput notes

EMBED() runs only as fast as the provider behind it. Before you put EMBED() on a hot path, a few things are worth knowing:

External APIs are rate limited. Raising embed_max_batch_size cuts round trips but pushes more tokens into each request, so tune the two together.
A row whose input exceeds ai_context_window_size gets its own batch automatically, so one oversized document does not stall the rest of the query.
The provider bills you for input tokens, not for rows. Embedding the same text twice costs twice. If you will read a vector more than once, persist it.
local providers skip credential checks and run inside your network. That removes per-token billing but trades it for an inference server you have to operate.

Why use the EMBED function in Apache Doris?​

What is the Apache Doris EMBED function?​

How does the Apache Doris EMBED function work?​

Quick start​

When should you use the Apache Doris EMBED function?​

Cost and throughput notes​

Further reading​