跳到主要内容

4 篇博文 含有标签「Top News」

查看所有标签
Blog/Top News

Zaki Lu

OpenAI dropped a bomb on the data world by announcing the acquisition of Rockset, a cloud-based, fully managed analytical database. Among all the congratulating voices, one question is raised: why Rockset?

OpenAI acquisition Rockset

Founded in 2016 by Venkat Venkataramani, former Engineering Director at Meta, Rockset focuses on real-time search and data analytics. Compared to other DBMS, Rockset stands out by its:

  • Real-time data updates: Rockset ensures data freshness for users by its capabilities in fetching and delivering the latest data. It supports real-time updates at the granularity of data fields, which can be performed within milliseconds.

  • Converged index: It reaps the benefits of inverted index, columnar storage, and row-oriented storage, and provides efficient and flexible data querying services.

  • Native support for semi-structured data: Rockset is well-suited to the growing demand for semi-structured data processing, hash joins, and nested loop joins.

  • SQL and JOIN compatibility: The Search Index of Rockset is optimized for various join queries.

The news also gaves all Rockset users a ticking time bomb: they have to find an appropriate alternative to Rockset for their own use case within three months. This, of course, arises as an opportunity for other analytical databases on the market. However, of all the claim-to-be alternatives, only a few of them cover all the above-mentioned key features of Rockset. Among them, Apache Doris is worth looking into.

As an open-source real-time data warehouse, Apache Doris is trusted by over 4000 enterprise users worldwide with powerful functionalities including:

  • Real-time data updates: Apache Doris supports not only real-time updates and deletion, but also real-time partial column updates, making it particularly useful in cases involving frequent data updates.

  • Row/column hybrid storage: Apache Doris is a column-oriented data warehouse that achieves world-leading OLAP performance on ClickBench. Additionally, it supports row-oriented storage to serve high-concurrency point query scenarios, which allows it to respond to almost a million query requests within milliseconds.

  • Inverted index and full-text searches: Apache Doris provides high efficiency and flexibility in keyword searching. It allows index creation on all fields and a flexible combination of data fields for multi-dimensional data analysis.

  • Native support for semi-structured data: Apache Doris has introduced the VARIANT data type to accommodate semi-structured data. It enables flexible data schema and high query speed on top of cost-efficient data storage. Compared to traditional JSON methods, VARIANT can bring a 10x performance improvement.

  • Support for various SQL and join operations: Apache Doris is highly compatible with MySQL syntaxes and interfaces. It supports INNER JOIN, CROSS JOIN, and all types of OUTER JOIN. The best part is its capability of auto-optimization based on data types to guarantee optimal performance under different circumstances.

As a Top-Level Project of the Apache Software Foundation, Apache Doris is supported by a robust and fast-growing community. It has accumulated over 11.8K GitHub stars and 636 contributors so far.

Apache Doris is the best open-source alternative to Rockset. Feel free to contact dev@doris.apache.org for more assistance.

Blog/Top News

Apache Doris

When it is cranberry and pumpkin season, we had the unforgettable Apache Doris Summit Asia 2023 with our remarkable committers, users, and community partners, to honor what we have achieved in the past year, and provide a preview of where we are going next.

The past year marks a breakthrough of Apache Doris, an open-source real-time data warehouse that has just undergone an overall upgrade after long consistent incremental optimizations:

More

Thanks to the hard work of 275 committers, the Apache Doris 2.0 milestone has merged over 4100 pull requests, representing a 70% increase from version 1.2 last year and a 10-fold increase from 1.1.

Faster

This year, Apache Doris has attained a 10-fold performance increase in blind benchmarking and single-table queries, a 13-fold increase in multi-table joins, and a 20-fold increase in concurrent point queries. The high query performance is supported by the smart design of Apache Doris, including a vectorized execution engine, Merge-on-Write mechanism, the Light Schema Change feature, a self-adaptive parallel execution model, and a new query optimizer.

Wider

We have built Apache Doris into more than just a powerful OLAP engine but also a data warehouse for a wider range of use cases, including log analysis and high-concurrency data services. To expand the data warehousing capabilities of Apache Doris, we have introduced Multi-Catalog to connect Doris to a wide array of data sources.

One of the most active open source big data projects

Apache Doris has become one of the world's most active open-source big data projects in all aspects:

  • It has hit 10K stars on GitHub, a year-on-year growth of 70%, and the momentum keeps going.
  • The community has included almost 600 contributors and welcomes new faces every week.
  • With 120 monthly active contributors, Apache Doris has become a more active project than Apache Spark, Elasticsearch, Trino, and Apache Druid.
  • Over 160 pull requests are created every week. Meanwhile, we have established a mature code review pipeline, making sure that every pull request stands the test of 3000 use cases. This is how we guarantee stability in the midst of agile iteration.

Apache-Doris-monthly-active-contributors

Along with such growth, we've also witnessed higher diversity among contributors. They are engineers from tech giants and database unicorns, like VeloDB, which is the commercial company based on Apache Doris. Many cloud service providers, including Alibaba Cloud, Tencent Cloud, Huawei Cloud, AWS and GCP (coming soon), have also jumped on the bandwagon and provided Doris-based data warehouse cloud hosting services.

Fast-expanding user base

Apache Doris now has a user base of over 30,000 data engineers from more than 4000 enterprises, including those from the tech sector, finance, telecom, manufacturing, logistics, and retail. The great majority of them keep in close touch with the Apache Doris developers, committing code, getting involved in tests, and sharing experience and feedback with the community.

Fruit that have been reaped

We aim to make Apache Doris the first choice for people in real-time data analysis. What we have done in the past year can be concluded in three keywords:

  • Real-time: We have realized high-throughput real-time data writing and updates, as well as low query latency.
  • Unified: As we've been trying to make Doris an all-in-one platform that can undertake most of the analytic workloads for users, we have expanded and enhanced the data lakehousing capabilities of Doris, enabled faster log analysis, faster ELT/ETL, and faster response to point queries.
  • Cloud-native: This is a leap towards cloud infrastructure. Apache Doris can now be deployed and run on Kubernetes to reduce storage and computation costs.

Real-time response to queries

As is said, Apache Doris 2.0 delivers 10 times faster query speed than the previous versions, but what is the key accelerator behind such high performance? It is the cost-based query optimizer and the self-adaptive pipeline parallel execution model of Apache Doris.

In traditional data reporting, data is often arranged in flat tables. The idea of flat tables and pre-aggregated tables is to trade storage space for query speed. In these cases, the key to high performance is to accelerate data scanning and aggregation. However, since nowadays data analytic workloads involve more complex computations with more and larger batch processing, data engineers often have to fine-tune the database and rewrite the SQL before they can enjoy satisfactory query speeds. That's why we have refactored the the query optimizer in Apache Doris. The new query optimizer can figure out the most efficient query execution plan for a thousand-line SQL or a join query that relates dozens of tables, saving engineers lots of efforts.

Similarly, the new version of Doris has automated another engineering-intensive process: adjusting the compute instance execution concurrency in the backend. What bothered our users was that when queries of different sizes happened concurrently, these queries tended to fight for resources and thus required human intervention. To solve that, we have introduced a pipeline execution model. It automatically decides the execution concurrency for the current situation to make sure queries of all sizes are executed smoothly. As a result, Doris now has more efficient CPU usage and higher system stability during query execution.

For high concurrency point queries, Apache Doris 2.0 reached a throughput of 30,000 QPS. It is a 20-fold improvement driven by optimizations in data storage, reading, and query execution. As a column-oriented DBMS, Apache Doris has relatively low row reading efficiency, so we have introduced ow/column hybrid storage and row cache to make up for that. We have also enabled the short circuit plan and prepared statements in Apache Doris. The former allows simple queries to skip the query planner for faster execution, and the latter allows users to reuse SQL for similar queries and thus reduce frontend overhead.

hybrid-column-row-storage

For multi-dimensional data analysis, we introduced inverted index to accelerate fuzzy keyword queries, equivalence queries, and range queries.

Real-time data writing and update

Data writing is another side of the real-time story, so we also spent great efforts improving the data ingestion speed of Apache Doris. After optimizations like Memtable parallel flushing and single-copy ingestion, Apache Doris is now 2~8 times faster in data writing.

data-writing-efficiency

The Merge-on-Write mechanism has been upgraded in version 2.0. It enables an upsert throughput of nearly 1 million rows per second, and it now supports a wider range of updating operations, including partial column updates.

merge-on-write

Support for more use cases

For data lakehousing, our last big move was to introduce Multi-Catalog for auto-mapping and auto-synchronization of heterogeneous data sources. In 2.0, we have further enhanced that. It now supports even more data sources, and it is also much faster in various production environments. With multi-catalog, users can ingest their multi-source data into Doris using the simple insert into select operation.

For log analysis, Doris 2.0 provides native support for semi-structured data, which can be arranged in data types like Json, Array, and Map. On the basis of Light Schema Change, it allows Schema Evolution. In addition to the foregoing inverted index, Doris 2.0 comes with a high-performance text analysis algorithm. Built on its large-size data writing and low-cost storage capabilities, Apache Doris is 10 times more cost-effective than the common log analytic solutions on the market.

For different analytic workloads in one single cluster, the Doris solution to resource isolation is Workload Group. As the name implies, it is to divide various workloads into groups and thus allow more flexible use of memory and CPU resources. Users can limit the number of queries that a workload group can handle concurrently, so when there are too many query requests, the excessive ones will wait in a queue. This is a way to release system burden.

resource-isolation-workload-group

Low cost and high availability

Apache Doris provides tiered storage. The less frequently accessed data, namely, cold data, will be put into object storage to reduce costs. Moreover, since object storage only requires a single copy of data, the storage costs will be further cut by 2/3 compared to 3-replica storage. Calculation based on AWS pricing shows that tiered storage can save you 70% of your cloud disk expenditure.

tiered-storage

To facilitate Kubernetes deployment, we have built a Kubernetes Operator. With it, users can easily deploy, scale, inspect, and maintain all Apache Doris nodes (frontends, backends, compute nodes, brokers) on Kubernetes. Compute node is a variant of backend nodes but it does not store any data, which is why it is a good fit for auto-scaling of clusters. During computation peaks, compute nodes can flexibly join the cluster and share the burden. Auto-scaling has been under active testing and will soon be released in upcoming versions of Apache Doris.

kubernetes-operator-for-apache-doris

For service availability guarantee, Apache Doris 2.0 supports Cross-Cluster Replication (CCR). As a disaster recovery solution, it supports read-write separation and multi-data center backup.

Reach for the stars

In the foreseeable future, Apache Doris will go further on the aforementioned three directions: real-time, unified, and cloud-native.

Get even faster

In the upcoming Apache Doris 2.1, the cost-based query optimizer (CBO) will be able to automatically collect execution statistics and provide support for hint syntax. It will also allow users to adjust the optimizing rules. To fully demonstrate the performance of our CBO, we will release a TPC-DS benchmark results.

In addition, Doris 2.1 will support multi-table materialized views and writing intermediate results to disks. Meanwhile, a Union All operator will be added to accelerate the ETL process in Apache Doris. That means users will experience higher performance and stability when processing large batches of data. You can also expect a new Join algorithm that can double the execution speed of multi-table join queries.

In terms of data writing, we try to make it simpler and more intuitive for you, and efforts will be made in three aspects.

  1. In future versions, data streams, local files, and those from relational databases or data lakes will all be put into relational tables, and they can all be written into Doris using the simple insert into statement.
  2. We will simplify the data writing pipeline. Data writing will be implemented by the built-in job scheduling mechanism, so users won't need an extra data synchronization component.
  3. When there is frequent data writing, Doris will wait until the data accumulates into a sizable batch at the server end, so as to reduce the pressure caused by small file merging.

In terms of data updating, as the Merge-on-Write mechanism advances towards maturity, it will be enabled in Doris by default. Users will be able to flexibly update or modify any columns in tables as they want. Also, based on Merge-on-Write, we will build a one-size-fits-all data model, so users don't have to rack their brains choosing the right data model for various use cases.

Apache Doris 2.1 will have enhanced observability. It will provide a brand new Profile for users to monitor operator execution, and visualize the query execution status with the aid of Doris Manager.

doris-manager

Support more analytic scenarios

The above-mentioned multi-table materialized view and built-in job scheduling mechanism will also benefit the data lakehousing capability of Doris. From heterogeneous data sources to the data warehouse, users won't need a second component to do ETL and data warehouse layering.

In version 2.0, we support data writeback to JDBC sources, and we are going to expand that functionality to more data sources, including Apache Iceberg, Apache Hudi, Delta Lake and Apache Paimon.

apache-doris-data-warehouse-layers

For data ingestion from data lakes, Apache Doris currently adopts the MySQL protocol. In large-scale data reading or data science use cases (like those involving Pandas), this might be a throughput bottleneck. Thus, what we are doing is introducing an Arrow Flight-based high-speed reading interface, which transfers data via the Doris backends directly. In our tests, the new interface delivers a writing throughput that is 100 times higher.

writing-throughput

For log analysis, the inverted index will support more complicated data types, such as Array, Map, and GEO. We will also introduce a new data type named Variant to provide schema-free support. This means users can not only put Json data of any shapes and types in the table fields, but also easily handle schema changes without any DDL operations.

schemaless-variant-data-type

For workload management, we will enable higher flexibility. Users will be able to use SQL to create, manage, and allocate resources for their Workload Groups. We will continue to maximize resource utilization while ensuring resource isolation between workload groups.

Cloud-nativeness and storage-compute separation

When Apache Doris 2.0 was released, we previewed the merging of the SelectDB Cloud storage-compute separation solution into the Apache Doris project. After some intense code refactoring and compatibility building, this functionallity will be good and ready in Apache Doris 2.2, and users will be able to experience the elastic computation capability.

storage-compute-separation

Stick to Innovation

As Apache Doris is on the ramp, we look back on its ten-year development and ask ourselves: what injects vitality to this great project and keep it vibrant for this long? The answer is, we have been working with innovators.

Back in the time when SQL on Hadoop gained currency, Apache Doris chose to stay outside the Hadoop ecosystem. It does not rely on HDFS for data storage, nor Zookeeper for distributed monitoring, but insists on providing high availability by its scalable processes. When the major databases on the market goes by their own syntaxes, Apache Doris adopts stand SQL and the MySQL protocol, in order to lower the threshold for users.

From the self-developed pre-aggregation storage engine, materialized views, and the MPP framework, to inverted index, row/column hybrid storage, Light Schema Change, Merge-on-Write, and the Variant data type, Apache Doris never stops breaking new ground to provide better performance and user experience, which is also what we are going to do next:

  • We want to work with more open-source enthusiasts to make a difference to the world.
  • We want to keep inspiring the data world by presenting more use cases.
  • We want to provide more and better choices for users by collaborating with partners along the data pipeline and cloud service providers.

By choosing Apache Doris, you choose to stay in the heartbeat of innovation. The Apache Doris community awaits newcomers.

Blog/Top News

Mingyu Chen

Hello everyone, welcome to the Doris Summit 2022, the first summit of Apache Doris since it was open-sourced. In this lecture, you will go through the development of Doris in 2022 and look into the new trends that Doris is exploring in 2023. My name is Mingyu Chen and I am the PMC Chair of the Apache Doris. I have been developing for Doris since 2014, and witnessed its whole process from open-source to graduation from Apache. My sharing will cover the following aspects. Let's get started.

As the beginning, I will briefly introduce what Doris is and why we should choose Doris in case you are new to Apache Doris. In 2022, Doris has became one of the most active open-sourced big data analysis engine projects in the world while the Doris community became one of the most active open-source communities in China, which you may get interested in. Moreover, the cutting-edge features, such as vectorized execution engine, cloud-native and efficient semi-structured data analysis, real-time processing and Lakehouse will be the focus of my lecture today. Also, it is important to prioritize tasks at the beginning of the year, so I will go through our job list with you shortly.

About Doris

Briefly speaking, Apache Doris is an easy-to-use, high-performance and unified analytical database. As shown in this enterprise data flow chart, you may have a clear vision of where Apache Doris stands. Data from various upstream data sources, such as transactional databases, log systems, event tracking, etc., as well as data from ETL components, such as Flink, Spark and Hive is ingested into Doris through data processing and integration tools.

flow

As a fully-complete database system, Doris can provide various direct query functions including report analysis, multi-dimensional analysis, log analysis, user portrait and lakehouse, etc. Thanks to Doris' MPP SQL distributed query engine. Doris can also be used to query external data sources from Hive, Iceberg, Hudi, Elasticsearch and various transactional database systems connected through JDBC, without data import and maintaining the schema of other data sources. There are several core features that can help users solve practical problems, which are as follows:

  • NO.1 is the ease of use. It supports ANSI SQL syntax, including single table aggregation, sorting, filtering and multi table join, sub query, etc. It also supports complex SQL syntax such as window function and grouping sets. At the same time, users can expand system functions through UDF, UDAF. In addition, Apache Doris is also compatible with MySQL protocol, which allows users access Doris through various BI tools.
  • NO.2 is high performance. Doris is equipped with an efficient column storage engine, which not only reduces the amount of data scanning, but also implements an ultra-high data compression ratio. At the same time, Doris also uses various index technology to speed up data reading and filtering. Using the partition and bucket pruning function, Doris can support ultra-high concurrency of online service business, and a single node can support up to thousands of QPS. Further, Apache Doris combines the vectorized execution engine to give full play to the modern CPU parallel computing power. Doris supports materialized view. In terms of the optimizer, Doris uses a combination of CBO and RBO, with RBO supporting constant folding, subquery rewriting, predicate pushdown, etc.
  • NO.3 is unified data warehouse. Thanks to the well-designed architecture, Doris can easily handle both low-latency, high-concurrency scenarios and high-throughput scenarios .
  • NO.4 is the federated query analysis. With the help of Doris's complete distributed query engine, Doris can access data lake such as Hive, Iceberg and Hudi, as well as high-speed queries to external data sources such as Elasticsearch and MySQL.
  • NO.5 is ecological enrichment. Doris provides rich data ingest methods, supports fast loading of data from localhost, Hadoop, Flink, Spark, Kafka, SeaTunnel and other systems, and can also directly access data in MySQL, PostgreSQL, Oracle, S3, Hive, Iceberg, Elasticsearch and other systems without data replication. At the same time, the data stored in Doris can also be read by Spark and Flink, and can be output to the upstream data application for display and analysis.

Next, we will review what remarkable achievements the Doris community has achieved in 2022.

How should we look back on 2022?

In 2022, the world has witnessed unprecedented changes, and countless magical moments are happening in reality. Thankfully, the power of technology and open source has navigated us to the right path. And 2022 is absolutely a fruitful year for Apache Doris. Let's review the development of Apache Doris in the past year from several angles:

Important Indicators of the Community

community

In the past year:

  • The number of cumulative community contributors has increased from 200 to nearly 420, a year-on-year increase of more than 100%, which is still rising.
  • The number of monthly active contributors has doubled from 50 to 100.
  • The number of GitHub Stars has increased from 3.6k to 6.8k, and has been on the daily/weekly/monthly GitHub Trending list many times.
  • The number of all Commits increased from 3.7k to 7.6k. The amount of newly submitted code in the past year exceeded the total of previous years.

![community 2](/images/summit/en/community 2.png)

From these data, we can see that in 2022, there was an explosive growth in Apache Doris. The data indicators of all dimensions are grown by nearly 100%. The great effort has also made Apache Doris one of the most active open-source communities in the big data and database world. As is the growth shown in the trending of GitHub Contribution above, users and developers have made tremendous contribution to the community . It is memorable that in June 2022, Apache Doris graduated from the Apache incubator and became a Top-Level Project, which is the biggest milestone since open-souced.

![top level](/images/summit/en/top level.png)

Open Source User Scale

Thanks to the voluntary technical support from the developers of SelectDB, a commercial company funding Apache Doris. In 2022 Doris became smoother in user connection and communication, and we were able to interact with users more directly and listen to their real voices. Last year, Apache Doris was applied in dozens of industries, such as the Internet, fintech, telecommunications, education, automobiles, manufacturing, logistics, energy, and government affairs, and especially in the Internet industry, which is known for massive data. 80% of the TOP 50 Chinese Internet companies have been using Apache Doris for a long time to solve data analysis problems in their own business, including Baidu, Meituan, Xiaomi, Tencent, JD.com, ByteDance , NetEase, Sina, 360 Total Security, MiHoYo, ZHIHU.COM, etc.

![logo wall](/images/summit/en/logo wall.png)

Globally, Apache Doris has served thousands of enterprise users, and this number is still growing rapidly. Most enterprise users are glad to contact the community and participate in community building through various means. Moreover, many of the enterprise users participated in Doris Summit, giving a lecture of their own practical experience based on real business.

Releases

In the early versions, ease of use has been frequently emphasized. The versions released in 2022 mainly focus on performance, stability, ease of use, which is a comprehensive evolution.

  • In April, the community released Apache Doris V1.0.0, whose major version first changed from 0 to 1(V0.1.5 to V1.0.0) since open-sourced. In version 1.0, the extraordinary vectorized execution engine was first published, marking the beginning of Apache Doris to the era of ultra-high speed data analysis.
  • In version 1.1 released in June, we further improved and optimized the vectorized engine, and set it as default. Simultaneously, the community has also prepared LTS(Long-Term-Support) versions released to quickly fix bugs and optimize functions for version 1.1 on a monthly basis, aiming to ensure higher stability required by the growing community users.
  • Launched in early December, Version 1.2 not only introduces many important functions, such as Merge-on-Write for Unique Key model, Multi-Catalog, Java UDF, Array type, JSONB type, etc., but also improves the query performance by nearly ten times. These features allow Apache Doris to be more adaptable and possible for more data analysis.
  • In version 1.2, stability and quality assurance were strongly stressed. On the one hand, using automated testing tools such as SQL Smith and test cases from various well-known open source projects, we have built millions of test case sets; On the other hand, the community access pipeline and perfect regression testing framework ensure the quality of code-merge.

Evolution of Core Features

In 2022, the community's research and development was mainly focused on four aspects, high performance, real-time processing, semi-structured data support and Lakehouse.

2022

  • Query performance improvement. From the released version 1.0 to 1.2, Apache Doris has made remarkable achievement in performance. In the single-table test, Apache Doris won 3rd place in Clickbench database performance list launched by Clickhouse. In the multi-table association, thanks to the vectorized execution engine and various query optimization, compared to the released version 0.15 at the end of 2021, Apache Doris was 10 times faster in standard test data sets under SSB and TPC-H, which marks Apache Doris one of the best databases in the world!

performance

  • Real-time processing optimization. In version 1.2, we have implemented the Merge-On-Write data update method on the original Unique Key, with a query performance improved by 5-10 times during high-frequency updates and low latency on updateable data in real-time analytics. In addition, the lightweight Schema Change enables easier column adding and substraction of data, which is unecessary for users to convert historical data any more. Tools such as Flink CDC can be used instantly to synchronize DML or DDL operations in transaction databases, making data synchronization smoother and unified.

realtime

  • Semi-structured data analysis. At present, Apache Doris supports Array and JSONB types. The Array type can not only store complex data structures, but also support user behavior analysis through Array functions. JSONB is a binary JSON storage type, which not only has 4 times faster access performance than Text JSON, but has lower memory consumption as well. Various log data structures in JSON format can be easily ingested through JSONB efficiently.

semi

  • Lakehouse. In version 1.2.0, through multiple performance optimizations for external data sources such as Native Format Reader, late materialization, asynchronous IO, data prefetching, high-performance execution engine and query optimizer, Apache Doris can easily access external data sources, for instance, Hive, Iceberg and Hudi. And the speed of access is 3-5 times faster than Trino/Presto and 10-100 times faster than Hive.

lakehouse

2023 RoadMap

In 2023, the Apache Doris community will deep dive into new features development, as you can refer to the 2023 RoadMap and the specific plan for next year below:

roadmap

In 2023, we will start the iteration of Apache Doris 2.x version on a quarterly basis . At the same time, for each 2-bit version, bug fixes and upgrades will be done on a monthly basis. From a functional point of view, the follow-up research and development will focus on the following main directions:

High Performance

High performance is the goal that Apache Doris is constantly pursuing. Doris' excellent performance on public test datasets such as Clickbench and TPC-H has proved that it has become industry-leading. In the future, we will further enhance performance, including:

  • More complex SQL: The new query optimizer will be available in the first quarter of 2023. The new query optimizer supports the strategy of combining RBO and CBO, and it can support complex queries more efficiently and fully execute all 99 SQLs of TPC-DS.
  • Higher concurrency point query: High concurrency is always what Apache Doris is good at. And in 2023 we will further strengthen this capability through a series of features such as Short-Circuit Plan, Prepare Statement, Query Cache, etc., to support ultra-high concurrency of 10,000 QPS with single node and has higher-concurrency scaling out.
  • More flexible multi-table materialized views: In previous versions, Apache Doris accelerated the analysis efficiency of fixed-dimensional data through strengthen single-table materialized views. The new multi-table materialized view will decouple the lifecycle of Base table and the MV table. In this way, Doris can easily deal with the multi-table JOINs and the pre-calculation acceleration of more complex SQL queries. And Doris is capable of asynchronous refresh and flexible incremental calculation methods. This feature will be available in the first quarter of 2023.

Cost-effective

Cost efficiency is the key to winning market competition for enterprises, which is true for databases as well. In the past, Apache Doris helped users greatly save the cost in computing and storage resources with many designs of ease of use. In the future, we will introduce a series of cloud-native capabilities to further reduce costs without affecting business efficiency, including:

  • Lower storage costs: We will explore the combination of object storage systems and file systems on the cloud to help users further reduce storage costs, including better separation of hot and cold data, and migrate cold data to cheaper object storage or file system. Combining technologies such as a single remote replica, cold data cache, and hot & cold data conversion, we can ensure that query efficiency is not affected while saving up storage costs. This feature will be released in the first quarter of 2023.
  • More elastic computing resources: We plan to separate storage and computing state and adopt Elastic Compute Node for computing. Since no data is stored, Elastic Computing Nodes have faster elastic scaling capabilities, which is convenient for users to quickly scale out during peak business periods, and further improve the analysis efficiency in massive data computing, such as lakehouse analysis. This function will be released shortly.

Hybrid Workload

Lots of users nowadays are building a unified analysis platform within the enterprise based on Apache Doris. On the one hand, Apache Doris is required to execute larger-scale data processing and analysis. On the other hand, Apache Doris is also required to deal with more analytical load challenges, such as real-time reports and Ad-hoc to ELT/ETL, log retrieval and more unified analysis. In order to better adapt to these cases, new features are about to be released in 2023, which include:

  • Pipeline execution engine: Compared with the traditional volcano model, the Pipeline model does not need to set the concurrency manually, but instead, it can do parallel computing between different pipelines, making full usage of CPUs and is more flexible in execution scheduling, which improves the overall performance under mixed load cases.
  • Workload Manager: It is also urgent to improve resource isolation and division capabilities. Based on the Pipeline execution engine, we will launch features such as flexible load management, resource queues, and isolation in shared services to balance query performance and stability in various mixed load cases.
  • Lightweight fault tolerance: It can not only take advantage of the high efficiency of MPP structure but also tolerate errors to better adapt to the challenges of users in ETL/ELT.
  • Function compatibility and UDF in multiple languages: At the same time, we will be more compatible with Hive/Trino/Spark function and support multiple UDF in the future to help users process data more flexibly. And data migration to Apache Doris will be easier than before.

Multi-model Data Analysis

In the past, Apache Doris was quite good at structured data analysis. As the demand for semi-structured and unstructured data analysis increased, we added Array and JSONB types from version 1.2 to support these data types naturally. In the future release, we will continue providing more cost-effective and better-performance solutions for log analysis cases, including:

  • Richer complex data types: In addition to Array/JSONB types, we will increase support for Map/Struct types in the first quarter of 2023, including efficient writing, storage, analysis functions to better perform multi-model data analysis. In the future, more data types will be supported, such as IP and GEO geographic information, and more time series data.
  • More efficient text analysis algorithms: For text data, we will introduce text analysis algorithms, including adaptive Like, high-performance substring matching, high-performance regular matching, predicate pushdown of Like statements, Ngram Bloomfilter, etc. The full-text search is based on the inverted index and it provides higher performance and is more cost-effective in analysis compared with that of Elasticsearch in the log analysis. These features will come out in early 2023. -Dynamic Schema table: In other databases, the schema is relatively static and DDL needs to be executed manually when the schema is changed. In recent cases, the table structure changes all the time, so we plan to launch Dynamic Table, which can automatically adapt to the Schema according to data writing without DDL execution, replacing manual adjusting. This feature will be released in the first quarter of 2023.

Lakehouse

With the development of data lake technology, analysis performance has become the biggest constraint to data-mining. Building analysis services on top of data lakes based on an easy-to-use and high-performance query analysis engine has become a new trend. In the last year, through many performance optimizations on the data lake, high-performance execution engine and query optimizer, Apache Doris has become extremely fast in analysis and easy-to-use on the data lake with a performance 3-5 times higher than that of Presto/Trino. In 2023, we will continue to go deeper, including:

  • Easier data access: In version 1.2, we released Multi-Catalog, which supports automatic metadata mapping and synchronization of multiple heterogeneous data sources and is used for accessing data lakes. Delta Lake, Iceberg and Hudi will be better supported.
  • More complete data lake capabilities: We provide incremental update and query of data on the data lake. Analysis result will be sent to data lake and the data from external tables will be ingested into internal tables. At the same time, Doris will also support multi-version Snapshot's read & delete and materialized views.

Real-time and storage engine optimization

The value of data will decrease over time, so real-time performance is very important for users. The Merge-on-Write data update in version 1.2 allows Apache Doris to be fast in both real-time updating and query. In 2023, we will upgrade the storage engine with the following:

  • More stable data writing: Through a series of compaction operations and optimization of batch processing, resource cost is able to be saved. And through a new memory management framework, stability of the writing process will be improved.
  • More mature data-updating mechanism: In the past, column updates were implemented through Replace_if_not_null on the Agg model. In the future, we will increase support for partial column updates with the Unique Key model, and data updates such as Delete, Update, and Merge.
  • A unified data model: Currently, the three data models of Apache Doris are widely used in various cases. In the future, we will try to unify the existing data models to provide a better user experience.

Ease of use and stability

In addition to improving functions, simplicity, ease of use and stability is also the goal that Apache Doris has been pursuing. In 2023, we will dive deeper in the following:

  • Simplified table creation: Currently, Apache Doris already supports time functions in table partitioning. In the future, we will further simplify Bucket settings to help users build models easily.
  • Security: At present, a permission management mechanism based on the RBAC model has been launched, which makes user permissions more secure and reliable. Functions such as ID-federation, Row&Column-level permissions, data desensitization will be further improved in the future.
  • Observability: Profile is an important means of locating query performance problems. In the future, we will strengthen the monitoring of Profile and provide visualized Profile tools to help users locate problems faster.
  • Better BI compatibility and data migration solution: Currently, various BI tools can be connected with Apache Doris through MySQL protocol, and we will further adapt mainstream BI software in the future to ensure a better query experience. With the rise of emerging data integration and migration tools such as DBT and Airbyte, more and more users synchronize data to Apache Doris in this way. So we should provide support for these users in the future.

How to join the community

Last but not the least, we hope that more developers can participate in the community to jointly create a powerful database. There are 3 ways to participate in the community. First of all, users can subscribe to our developer mailing group through this address: dev@doris.apache.org, which is recommended by the Apache Way as well. You can send any related topics that you want to discuss with the community. Secondly, you can reach out to us virtually on developer's biweekly meeting. The biweekly meeting is held on Wednesdays at 8pm(UTC+8). The topic will cover new features, disclosure and development progress and more. Thirdly, the DSIP. DSIP is short for Doris Improvement Proposal. All Doris designed functions are recorded in this document. Both users and developers can follow and see detailed design and development of important functions on this Wiki.

Apache Doris Repository

https://github.com/apache/doris

Apache Doris Website

https://doris.apache.org

Blog/Top News

morningman

Apache Doris is a modern, high-performance and real-time analytical database based on MPP. It is well known for its high-performance and easy-to-use. It can return query results under massive data within only sub-seconds. It can support not only high concurrent point query scenarios, but also complex analysis scenarios with high throughput. Based on this, Apache Doris can be well applied in many business fields, such as multi-dimensional reporting, user portrait, ad-hoc query, real-time dashboard and so on.

Apache Doris was first born in the Palo Project within Baidu's advertising report business and officially opened source in 2017. It was donated by Baidu to Apache foundation for incubation in July 2018, and then incubated and operated by members of the podling project management committee (PPMC) under the guidance of Apache incubator mentors.

We are very proud that Doris graduated from Apache incubator successfully. It is an important milestone. In the whole incubating period, with the guidance of Apache Way and the help of incubator mentors, we learned how to develop our project and community in Apache Way, and have achieved great growth in this process.

At present, Apache Doris community has gathered more than 300 contributors from nearly 100 enterprises in different industries, and the number of active contributors per month is close to 100. During the incubation period, Apache Doris released a total of 8 major versions and completed many major functions, including storage engine upgrade, vectorization execution engine and so on, and released 1.0 version. It is the strength of these open source contributors that makes Apache Doris achieve today's results.

At the same time, Apache Doris now has a wide range of users in China and even around the world. Up to now, Apache Doris has been applied in the production environment of more than 500 enterprises around the world. Among the top 50 Internet companies in China by market value or valuation, more than 80% are long-term users of Apache Doris, including Baidu, Meituan, Xiaomi, JD, ByteDance, Tencent, Kwai, Netease, Sina, 360 and other well-known companies. It also has rich applications in some traditional industries, such as finance, energy, manufacturing, telecommunications and other fields.

You can quickly build a simple, easy-to-use and powerful data analysis platform based on Apache Doris, which is very easy to start, and the learning cost is very low. In addition, the distributed architecture of Apache Doris is very simple, which can greatly reduce the workload of system operation and maintenance. This is also the key factor for more and more users to choose Apache Doris.

As a mature analytical database project, Apache Doris has the following advantages:

  • Excellent performance: it is equipped with an efficient column storage engine, which not only reduces the amount of data scanning, but also implements an ultra-high data compression ratio. At the same time, Doris also provides a rich index structure to speed up data reading and filtering. Using the partition and bucket pruning function, Doris can support ultra-high concurrency of online service business, and a single node can support up to thousands of QPS. Further, Apache Doris combines the vectorization execution engine to give full play to the modern CPU parallel computing power, supplemented by intelligent materialized view technology to accelerate pre-aggregation, and can simultaneously carry out planning based and cost based query optimization through the query optimizer. Through the above methods, Doris can reach ultimate query performance.

  • Easy to use: it supports ANSI SQL syntax, including single table aggregation, sorting, filtering and multi table join, sub query, etc. it also supports complex SQL syntax such as window function and grouping set. At the same time, users can expand system functions through UDF, UDAF and other user-defined functions. In addition, Apache Doris is also compatible with MySQL protocol. Users can access Doris through various client tools and support seamless connection with BI tools.

  • Streamlined architecture: the system has only two modules —— frontend (FE) and backend (BE). The FE node is responsible for the access of user requests, the analysis of query plans, metadata storage and cluster management, and the BE node is responsible for the implementation of data storage and query plans. It is a complete distributed database management system. Users can run the Apache Doris cluster without installing any third-party management and control components, and the deployment and upgrade process are very simple. At the same time, any module can support horizontal expansion, and the cluster can be expanded up to hundreds of nodes, supporting the storage of more than 10PB of ultra large scale data.

  • Scalability and reliability: it supports the storage of multiple replicas of data. The cluster is able to self-healing. Its own distributed management framework can automatically manage the distribution, repair and balance of data replicas. When the replicas are damaged, the system can automatically perceive and repair them. When a node is expanded, it can be completed with only one SQL command, and the data replicas will be automatically rebalanced among nodes without manual intervention or operation. Whether it is capacity expansion, capacity reduction, single node failure or upgrading, the system does not need to stop running, and can normally provide stable and reliable online services.

  • Ecological enrichment: It provides rich data synchronisation methods, supports fast loading of data from localhost, Hadoop, Flink, Spark, Kafka, SeaTunnel and other systems, and can also directly access data in MySQL, PostgreSQL, Oracle, S3, Hive, Iceberg, Elasticsearch and other systems without data replication. At the same time, the data stored in Doris can also be read by Spark and Flink, and can be output to the upstream data application for display and analysis.

Graduation is not the ultimate goal, it is the starting point of a new journey. In the past, our goal of launching Doris was to provide more people with better data analysis tools and solve their data analysis problems. Becoming an Apache top-level project is not only an affirmation of the hard work of all contributors to the Apache Doris community in the past, but also means that we have established a strong, prosperous and sustainable open source community under the guidance of Apache Way.In the future, we will continue to operate the community in the Way of Apache. I believe we will attract more excellent open source contributors to participate in the community, and the community will further grow with the help of all contributors.

Apache Doris will carry out more challenging and meaningful work in the future, including new query optimizer, support for Lakehouse integration, and architecture evolution for cloud infrastructure. More open source technology enthusiasts are welcome to join the Apache Doris community and grow together.

Once again, we sincerely thank all contributors who participated in the construction of Apache Doris community and all users who use Apache Doris and constantly put forward improvement suggestions. At the same time, we also thank our incubator mentors, IPMC members and friends in various open source project communities who have continuously encouraged, supported and helped us all the way.

Apache Doris GitHub:

https://github.com/apache/doris

Apache Doris website:

http://doris.apache.org

Please contact us via:

dev@doris.apache.org.

See How to subscribe:

https://doris.apache.org/community/subscribe-mail-list