跳到主要内容

2 篇博文 含有标签「Glossary」

查看所有标签
Blog/Glossary

Apache Doris

What Is LLM Observability?

LLM observability is the ability to understand, monitor, and debug how a large language model behaves in a real-world application.

In practice, it focuses on making LLM systems more transparent by capturing what the model sees, what it produces, and how it arrives at those outputs across a full interaction.

It typically includes:

  • tracing LLM calls and multi-step workflows
  • monitoring inputs (prompts, context) and outputs
  • evaluating response quality and correctness
  • tracking latency, token usage, and cost
  • analyzing how system components (e.g., retrieval, tools) influence results

Unlike traditional monitoring, which focuses on system health (such as uptime or error rates), LLM observability focuses on model behavior and decision outcomes.

This distinction is important because LLM systems are not purely deterministic. Observability is not just about detecting failures—it is about understanding why a response was generated, whether it was appropriate, and how it could be improved.

In modern AI applications, LLM observability often spans the entire pipeline, including prompt construction, retrieval (in RAG systems), model inference, and post-processing. This broader scope helps teams debug issues such as hallucinations, irrelevant answers, or inconsistent behavior.

Why LLM Observability Matters (Beyond Traditional Monitoring)

LLM systems are fundamentally harder to monitor than traditional software systems.

The main reasons include:

  • Non-deterministic outputs: The same input can produce different responses, making issues difficult to reproduce and debug.
  • Prompt-driven behavior: Small changes in prompts or context can lead to large differences in output, even when the underlying model remains the same.
  • Hidden reasoning (black-box models): Most LLMs do not expose internal reasoning processes, so developers must rely on indirect signals to understand behavior.
  • Multi-step pipelines (RAG and agents): Many systems involve retrieval, tool usage, or chained model calls, where failures can originate from multiple points.

As a result, traditional monitoring signals—such as latency, uptime, or error rates—provide only a partial view of system performance.

LLM observability is designed to address this gap by providing visibility into how inputs are transformed into outputs across the entire system.

It helps answer questions such as:

  • Why did the model generate this response?
  • Was the retrieved context relevant?
  • Is the issue caused by the prompt, the model, or the data?
  • How does output quality change over time?

In practice, this deeper visibility is essential for:

  • debugging hallucinations and incorrect answers
  • improving prompt and system design
  • maintaining consistent user experience
  • controlling cost and performance at scale

Without observability, LLM systems can appear to work while silently degrading in quality or reliability. With observability, teams can move from reactive debugging to systematic improvement.

What to Monitor in LLM Systems (Key Signals)

The most important signals in LLM observability include both system-level metrics and model-specific signals that reflect how the LLM behaves in real-world usage.

In practice, effective observability focuses not just on whether the system is running, but whether it is producing useful, reliable, and cost-efficient outputs.

Input and Prompt Monitoring

Tracking prompts and user inputs helps identify issues at the very beginning of the pipeline.

This includes:

  • prompt injection or unsafe inputs
  • unclear or poorly structured prompts
  • unexpected user behavior patterns

Because LLM outputs are highly sensitive to input phrasing, even small changes in prompts can lead to significantly different results. Monitoring inputs is often the fastest way to diagnose inconsistent behavior.

Output Quality and Evaluation

Evaluating outputs is one of the most important—and most challenging—parts of LLM observability.

Common evaluation dimensions include:

  • relevance (does the answer match the question?)
  • correctness (is the information accurate?)
  • consistency (does the model behave predictably?)
  • safety (does the output avoid harmful or biased content?)

In practice, most systems combine:

  • automated evaluation (e.g., scoring, heuristics)
  • human review or feedback loops

Since many LLM tasks are open-ended, output quality cannot be captured by a single metric and often requires context-aware evaluation.

Latency and Cost

LLM systems often introduce a new category of operational constraints: cost per request.

Key signals include:

  • response time (end-to-end latency)
  • token usage (input and output tokens)
  • cost per query or per user

Monitoring these signals is essential not only for performance optimization but also for maintaining sustainable system design at scale.

In many cases, improving latency or reducing token usage can have a direct impact on both user experience and infrastructure cost.

Retrieval Quality (RAG Systems)

In systems that use Retrieval-Augmented Generation (RAG), many failures originate from the retrieval step rather than the model itself.

Important signals include:

  • whether relevant documents are retrieved
  • how well retrieved context matches the user query
  • whether the model actually uses the retrieved information

Poor retrieval can lead to hallucinations or irrelevant answers, even when the underlying model performs well. This is why retrieval monitoring is a critical part of LLM observability. In systems that rely heavily on retrieval, analyzing retrieval logs and query patterns becomes critical. This often requires systems capable of handling large volumes of structured and semi-structured data, where analytical databases such as Apache Doris may be used to support query analysis and debugging workflows.

Errors, Failures, and Edge Cases

LLM failures often look different from traditional system errors.

Instead of explicit crashes, issues may appear as:

  • incomplete or vague responses
  • hallucinated or fabricated information
  • incorrect tool usage in agent systems
  • unexpected or off-topic outputs

These edge cases are often harder to detect because they may not trigger standard error signals. Observability systems therefore need to capture both explicit failures and subtle quality degradations.

A Practical Insight

No single metric can fully capture LLM performance.

Most production systems rely on a combination of:

  • quantitative metrics (latency, token usage, error rates)
  • qualitative evaluation (human feedback, relevance scoring)
  • system-level signals (retrieval quality, workflow traces)

Effective LLM observability is not about tracking more metrics—it is about tracking the right signals and understanding how they interact.

How LLM Observability Works (System-Level View)

In a modern AI system, observability is not a single component—it spans the entire pipeline.

A typical LLM-powered workflow looks like this:

llm-observability-architecture-diagram

Observability works by capturing signals at each step of this pipeline.

For example:

  • tracing how a request flows through multiple components
  • capturing prompts and generated outputs
  • logging retrieval results and context
  • measuring latency and token usage
  • evaluating output quality

This allows teams to reconstruct what happened during a specific interaction and identify where issues originate—whether in the prompt, retrieval step, or model response.

In practice, observability data is often analyzed across many interactions, helping identify recurring failure patterns, performance bottlenecks, or cost inefficiencies.

LLM Observability vs Monitoring vs AI Observability

These terms are often used interchangeably, but they represent different levels of system visibility and serve different purposes in practice.

At a high level:

  • Monitoring focuses on detecting issues through metrics and alerts
  • Observability focuses on understanding system behavior
  • LLM observability focuses specifically on how language models behave in real-world applications
  • AI observability covers broader machine learning systems beyond just LLMs

The main differences include:

ConceptFocus
MonitoringTracks system metrics such as latency, uptime, and errors
ObservabilityProvides deeper insight into system behavior using logs, traces, and metrics
LLM ObservabilityFocuses on prompts, outputs, and model behavior in LLM systems
AI ObservabilityCovers broader machine learning systems, including training and inference

A Practical Way to Think About the Differences

A useful way to understand the relationship between these concepts is:

  • Monitoring tells you when something is wrong
  • Observability helps you understand why it is wrong
  • LLM observability explains how the model contributed to the problem
  • AI observability provides a broader view across all ML systems

These layers are not mutually exclusive—they are often used together in production systems.

Common Challenges in LLM Observability

In practice, implementing LLM observability is far from trivial.

Unlike traditional systems, many issues in LLM applications are not clearly defined as “failures,” which makes them harder to detect and diagnose.

Key challenges include:

Evaluating subjective outputs

Many LLM responses do not have a single correct answer. A response can be technically correct but still irrelevant, incomplete, or poorly phrased. This makes evaluation highly context-dependent and difficult to standardize.

Lack of ground truth

In many use cases—such as open-ended Q&A or conversational systems—there is no definitive reference answer. As a result, it can be difficult to measure accuracy or track improvements over time.

High cost of logging and storage

Capturing prompts, outputs, traces, and intermediate steps at scale can quickly become expensive. Teams often need to balance observability depth with storage and processing costs.

Debugging multi-step pipelines

Modern LLM systems often include retrieval (RAG), tools, or chained model calls. When something goes wrong, the root cause may lie in any part of the pipeline, making debugging more complex.

Noisy signals (false positives)

Metrics do not always reflect real user experience. For example, a response may pass automated evaluation but still be unhelpful to users, or vice versa.

A common pattern is that collecting observability data is relatively easy, but interpreting it correctly—and turning it into actionable improvements—is significantly harder.

LLM Observability Tools (And How to Choose)

LLM observability tools generally fall into a few categories, each addressing a different part of the problem.

Tracing-focused tools

These tools capture how requests flow through the system, including prompts, model calls, and intermediate steps. They are useful for debugging workflows and understanding execution paths.

Evaluation-focused tools

These tools focus on measuring output quality using automated scoring, benchmarks, or human feedback. They help assess whether the system is producing useful and accurate results.

Full-stack observability platforms

These platforms combine tracing, evaluation, and monitoring, providing a more complete view of system behavior across the entire pipeline.

Choosing the right approach depends on several factors:

  • the complexity of the application (simple chat vs multi-step AI systems)
  • whether the system includes RAG or agents
  • the need for real-time monitoring versus offline analysis
  • scalability, data volume, and cost constraints

In practice, many production systems use a combination of tools rather than relying on a single solution.

A useful way to think about this is that tracing helps you understand what happened, evaluation helps you understand how good the result was, and monitoring helps you track system performance over time.

Best Practices for LLM Monitoring and Observability

Common best practices include:

Start with tracing before optimization

Before improving performance or quality, it is important to understand how the system behaves end-to-end. Tracing provides the foundation for identifying bottlenecks and failure points.

Evaluate outputs, not just system metrics

Latency and cost are important, but they do not reflect whether the system is actually useful. Output quality—relevance, correctness, and clarity—should be treated as a first-class signal.

Combine automated and human evaluation

Automated metrics can scale, but they may miss subtle issues in language quality. Human feedback helps capture real-world usefulness and edge cases.

Monitor retrieval in RAG systems

In many cases, issues attributed to the model are actually caused by poor retrieval. Monitoring retrieval quality is essential for diagnosing these problems.

Design for cost visibility early

Token usage and infrastructure costs can increase rapidly as usage grows. Tracking cost-related metrics early helps prevent unexpected scaling issues.

In practice, effective observability is not about collecting more data, but about focusing on the signals that directly impact system behavior and user experience.

The Future of LLM Observability

LLM observability is evolving as AI systems become more complex and move into production environments.

Several trends are emerging:

Agent observability

As AI agents become more common, observability is expanding to cover multi-step reasoning, tool usage, and decision chains rather than single model calls.

Real-time evaluation

Systems are shifting from offline analysis to continuous, real-time feedback, allowing faster iteration and adaptation.

AI-native monitoring approaches

New approaches are being developed specifically for generative AI workloads, where traditional monitoring methods are not sufficient.

Feedback-driven improvement loops

User interactions, feedback signals, and evaluation results are increasingly used to continuously improve prompts, retrieval strategies, and system behavior.

Overall, LLM observability is increasingly becoming an important part of how AI systems are designed, operated, and improved over time.

FAQ

Why is observability critical for LLMs?

LLM observability helps control costs, reduce the risk of hallucinations or harmful outputs, and continuously improve prompt quality and system performance.

What are traces in LLM observability?

Traces record the full sequence of events in an LLM system—from user input to final output—including prompt construction, retrieval steps, API calls, and model responses. They are essential for debugging and understanding system behavior.

Blog/Glossary

Apache Doris

Vector search is a search method that retrieves results based on semantic similarity rather than exact keyword matches.

Instead of matching words directly, vector search converts data—such as text, images, or logs—into numerical representations called embeddings (vectors). It then compares these vectors in a high-dimensional space to find the most similar results.

The key idea behind vector search is that similar meanings are represented by vectors that are close to each other.

The main characteristics of vector search include:

  • Understanding user intent rather than exact wording
  • Supporting unstructured data such as text and images
  • Enabling AI applications like semantic search and RAG

How Vector Search Works (Step-by-Step)

Vector search follows a simple but powerful pipeline. Instead of matching exact words, it converts both the data and the query into numerical representations and then compares them based on similarity.

1. Convert Data into Embeddings

The first step is to convert raw data into embeddings.

Embeddings are numerical vectors generated by machine learning models that capture the semantic meaning of the input. These inputs can include:

  • text documents
  • product descriptions
  • images
  • logs or events

For example, two sentences with similar meanings may produce vectors that are located close to each other in vector space, even if they do not share the same keywords.

2. Store Vectors in a Vector Database

Once generated, these embeddings are stored in a vector database or another system that supports vector indexing.

Unlike traditional databases that are optimized for exact filtering, vector search systems are designed to store high-dimensional vectors and retrieve the nearest matches efficiently. This is especially important when dealing with millions or billions of embeddings.

In production systems, vector data is often stored alongside metadata such as:

  • document ID
  • timestamp
  • category
  • status

This allows vector search to be combined with structured filtering.

3. Convert the Query into a Vector

When a user submits a search query, the system applies the same embedding model to the query itself.

This produces a query vector that can be compared directly against the stored vectors. Because both the query and the data are represented in the same vector space, the system can search for semantic similarity rather than exact wording.

For example, a query like:

How to reduce database latency

may still retrieve content containing phrases such as:

improve query performance
speed up data access

even if the exact words do not match.

After the query vector is created, the system searches for the nearest vectors in the database.

This is typically done using similarity metrics such as:

  • Cosine similarity: Measures how close two vectors are in direction.
  • Euclidean distance: Measures the geometric straight-line distance between vectors.
  • Dot product: Often used in embedding-based retrieval systems to measure magnitude and direction.

Because exact nearest-neighbor search can be expensive at large scale, most production systems use Approximate Nearest Neighbor (ANN) algorithms to speed up retrieval while keeping results highly relevant.

5. Re-rank and Return the Best Results

In many real-world AI systems, vector search is only the first retrieval step.

The initial results may then be:

  • filtered using metadata
  • combined with keyword search
  • re-ranked by another model

This improves precision and ensures that the final results are both semantically relevant and contextually useful.

This is why modern AI search systems often use hybrid search, combining vector similarity with structured filters or keyword relevance.

In essence, vector search retrieves similar meaning, not just exact matches.

To understand where vector search fits in, it helps to compare it with two other commonly used approaches: keyword search and semantic search. While these terms are often used interchangeably, they actually represent different ways of thinking about search.

FeatureKeyword SearchSemantic SearchVector Search
Matching methodExact keywordsMeaning (conceptual)Meaning (vector similarity)
TechnologyInverted index (BM25)NLP modelsEmbeddings + ANN
AccuracyHigh precisionModerateHigh (context-aware)
FlexibilityLowMediumHigh
Typical toolsElasticsearchSearch enginesVector databases

Traditional keyword search is based on matching exact words. Systems like Elasticsearch use techniques such as inverted indexes and BM25 to find documents that contain the same terms as the query.

This approach works well when users know exactly what they are looking for. For example, if someone searches for a specific error code or product name, keyword search can return highly precise results very quickly.

However, keyword search struggles when the wording changes. If a user searches for “how to fix database performance,” a keyword-based system may miss relevant content that uses different phrasing like “optimize query latency” or “improve database speed.”

This is where vector search becomes useful.

Instead of matching words, vector search matches meaning. It converts both the query and the data into embeddings, and then retrieves results that are semantically similar—even if they don’t share the same keywords.

In practice, this means vector search can handle:

  • synonyms
  • paraphrased queries
  • natural language input

But it also comes with trade-offs. While vector search is more flexible, it can sometimes return results that are less precise, especially without additional filtering or re-ranking.

The difference between vector search and semantic search is more subtle, and often misunderstood.

Semantic search is not a specific technology—it’s a goal. It refers to the idea of understanding the intent behind a query, rather than just matching words. For example, recognizing that “cheap laptop” and “affordable notebook” mean the same thing.

Vector search is one of the most common ways to implement semantic search in modern systems.

By representing text as embeddings, vector search turns meaning into something that can be computed mathematically. This allows systems to compare concepts at scale and retrieve relevant results efficiently.

In modern AI systems, the two are closely connected:

  • semantic search defines what the system is trying to do (understand meaning)
  • vector search defines how the system actually does it (compute similarity using vectors)

You can see this clearly in applications like RAG (Retrieval-Augmented Generation). The system first interprets the user’s intent (semantic understanding), and then uses vector search to retrieve the most relevant pieces of information.

Vector Search Example

Vector search is already used in many real-world applications, often without users realizing it. Here are a few common examples that show how it works in practice.

Retrieval-Augmented Generation (RAG)

In AI systems such as ChatGPT-style assistants, vector search plays a critical role in retrieving relevant information.

When a user asks a question, the system:

  • converts the query into an embedding
  • searches for similar documents using vector search
  • passes the retrieved context into the LLM

For example, if a user asks:

How do I fix high database latency?

The system may retrieve documents about:

  • query optimization
  • indexing strategies
  • caching techniques

—even if those exact words do not appear in the query.

This can help reduce hallucinations and improve answer grounding, especially when the retrieval pipeline is well-designed.

Vector search is also widely used in image-based applications.

When you upload an image, it is converted into a vector representation that captures visual features such as shapes, colors, and patterns. The system then finds images with similar vectors.

This is commonly used in:

  • e-commerce (“find similar products”)
  • visual search engines
  • design and inspiration tools

Instead of matching metadata, the system understands visual similarity directly.

Recommendation Systems

Streaming platforms like Netflix or Spotify rely heavily on vector search to power recommendations.

Users and content are represented as vectors based on behavior, preferences, and attributes. The system then recommends items that are “close” in vector space.

For example:

  • users who watch similar content → similar vectors
  • movies with similar themes → similar vectors

This allows platforms to recommend content that feels relevant, even if users cannot explicitly describe what they want.

Vector Search Architecture (for AI Systems)

To understand why vector search is so important, it helps to look at how it fits into a modern AI system.

A typical architecture looks like this:

architecture image

When a user submits a query, the system does not search it directly as text.

Instead:

  1. The query is converted into an embedding using a model (e.g., OpenAI, BERT, or other embedding models)
  2. The vector database retrieves the most similar items based on vector similarity.
  3. The system may apply additional filtering (e.g., time range, metadata)
  4. A re-ranking step improves precision by selecting the most relevant results
  5. The final results are passed into an LLM to generate a response

Why This Architecture Matters

Traditional keyword-based systems were originally optimized for lexical matching rather than semantic retrieval, so they may require additional components to support this kind of pipeline effectively.

They struggle with:

  • natural language queries
  • unstructured data
  • semantic understanding

Vector search, on the other hand, enables:

  • semantic retrieval at scale
  • real-time AI applications
  • integration with LLM workflows (RAG)

A Practical Insight

In real-world systems, vector search is rarely used alone.

Most production systems combine:

vector search + keyword search + structured filtering

This is known as hybrid search, and it balances:

  • precision (keyword search)
  • flexibility (vector search)

The Role of a Vector Search Database (And How to Choose)

Vector search is not just an algorithm—it requires a system that can store and retrieve high-dimensional vectors efficiently at scale. This is where vector search databases come in.

What Is a Vector Search Database?

A vector search database is a system specifically designed to handle embeddings and perform fast similarity search.

Unlike traditional databases that focus on exact matching and filtering, vector databases are optimized for:

  • storing high-dimensional vectors (embeddings)
  • performing nearest-neighbor search efficiently
  • scaling to millions or billions of data points

In practical terms, a vector search database allows you to take a query, convert it into an embedding, and quickly find the most similar items—even in very large datasets.

Vector Search vs. Elasticsearch (and Traditional Search Systems)

Elasticsearch and similar systems were originally built for keyword-based search using inverted indexes.

This makes them extremely effective for:

  • exact matches
  • filtering and aggregation
  • structured queries

However, their original strength lies in lexical retrieval, filtering, and aggregation rather than vector-native similarity search.

Modern versions of Elasticsearch now support vector search, but there is still a conceptual difference:

  • Keyword search (Elasticsearch classic) → matches exact terms
  • Vector search → matches semantic similarity

In real-world systems, these approaches are often combined.

For example, a system might use keyword search for precision and vector search for semantic relevance.

Dedicated Vector Databases vs. Integrated Databases

As vector search has grown, two main approaches have emerged.

Dedicated Vector Databases

Examples include Pinecone, Milvus, and Qdrant.

These systems are built specifically for vector similarity search and are typically easy to adopt for AI use cases.

They work well when:

  • the primary requirement is vector retrieval
  • the system is relatively simple
  • structured filtering is minimal

However, they may require additional systems to handle analytics, filtering, or complex queries.

Integrated Analytical Databases

In real-world applications, vector search rarely exists in isolation.

Most production systems need to combine:

  • vector search (for semantic similarity)
  • metadata filtering (time, status, user, etc.)
  • aggregation and analytics
  • real-time data ingestion

For example, a real query might look like:

“Find logs similar to this error, from yesterday, where status = failed”

This is not just a vector search problem—it is a hybrid query that requires both semantic understanding and structured filtering. Some analytical databases, such as Apache Doris, follow this integrated approach by supporting vector similarity search together with real-time analytics, filtering, and aggregation in a single system. This allows teams to simplify architecture when building AI applications that require both semantic retrieval and structured queries.

How to Choose the Right Approach

Choosing between different types of vector search systems depends on your use case.

Choose a dedicated vector database if:

  • your workload is primarily similarity search
  • you are building a prototype or early-stage AI feature

An integrated analytical database may be a good fit if:

  • you need vector retrieval together with filtering, analytics, and real-time data ingestion
  • your workload involves logs, events, or operational analytics
  • you want to reduce the number of systems used in a production pipeline

Despite its advantages, vector search is not a perfect solution and comes with several practical limitations.

One of the main trade-offs is between accuracy and performance. Most production systems rely on Approximate Nearest Neighbor (ANN) algorithms to achieve fast retrieval at scale, but this means the results may not always be the exact closest matches.

Another challenge is computational cost. Generating embeddings and performing similarity search—especially across large datasets—can be resource-intensive, requiring optimized infrastructure and indexing strategies.

As data volume grows, latency can also become an issue. Maintaining low response times while searching millions or billions of vectors requires careful system design.

In addition, vector search alone may lack precision in certain scenarios. Because it focuses on semantic similarity, it can sometimes return results that are related but not strictly relevant. This is why many systems introduce a re-ranking step or combine vector search with structured filters.

In practice, most production systems use hybrid search, combining vector search with keyword search and filtering to balance relevance and precision.

Vector search is evolving rapidly as AI systems become more complex and data-driven.

One clear trend is the rise of hybrid search, where vector similarity is combined with keyword matching and structured filtering. This approach allows systems to balance semantic understanding with precision, and is quickly becoming the default in production environments.

Another major shift is the adoption of Retrieval-Augmented Generation (RAG). As LLM-based applications become more common, vector search is increasingly used to retrieve external knowledge and improve model accuracy.

We are also seeing the emergence of AI agents and memory systems, where vector search is used to store and retrieve past interactions or contextual information. In this setting, vector databases effectively act as a form of long-term memory for AI systems.

At the infrastructure level, real-time vector analytics is becoming more important. Instead of working only on static datasets, modern systems need to handle streaming data, logs, and events while still supporting fast similarity search.

Overall, vector search is moving from a niche technique to a core component of modern AI and data infrastructure.

FAQ

Yes, many modern databases support vector search either through extensions (such as pgvector) or built-in vector data types. However, performance and scalability depend on how well the system is optimized for high-dimensional similarity search.

Hybrid search combines keyword search (for precision) with vector search (for semantic understanding). This approach is widely used in modern AI systems because it provides both accuracy and flexibility.