Learn about Apache Druid in 10 minutes

Follow the public account MageByte and set the star to get the latest push. Public number back to “add group” into the technical exchange group to get more technical growth.

Apache Druid is an analytical data platform that combines the features of a time series database, data warehouse, and full-text retrieval system. This article takes you through Druid’s features, usage scenarios, technical features, and architecture. This will help you choose your data storage solution, learn more about Druid storage, learn more about time series storage, and so on.

An overview of

A modern cloud native, stream native, analytical database

Druid is designed for fast query and fast data ingestion workflows. Druid is strong because of its powerful UI, run-time query manipulation, and high performance concurrent processing. Druid can be seen as an open source alternative to data warehousing for diverse user scenarios.
Easy integration with existing data pipelines

Druid can fetch data from message bus streams (such as Kafka, Amazon Kinesis) or batch load files from data lakes (such as HDFS, Amazon S3, and other similar data sources).
Performance up to 100 times faster than traditional solutions

Druid’s benchmark performance tests for data ingestion and data query significantly outperform traditional solutions.

Druid’s architecture combines the best features of data warehousing, time series databases, and retrieval systems.
Unlock a new workflow

Druid unlocked new query methods and workflows for Clickstream, APM, Supply chain, network telemetry, digital marketing, and other event-driven scenarios. Druid is built for quick AD hoc queries of real-time and historical data.
Deployed on AWS/GCP/Azure, hybrid cloud, K8S and leased servers

Druid can be deployed in any *NIX environment. Both internal and cloud environments. Deploying Druid is easy: Add or subtract services to scale.

Usage scenarios

Apache Druid is suitable for scenarios that require real-time data extraction, high-performance queries, and high availability. Therefore, Druid is often used as an analysis system with a rich GUI, or as a backend for a highly concurrent API that needs to be aggregated quickly. Druid is better suited for event-oriented data.

Common usage scenarios:

Click flow analytics (Web and Mobile analytics)
Risk control analysis
Network Telemetry Analysis (Network Performance Monitoring)
Server Indicator Storage
Supply Chain Analysis (Manufacturing Indicators)
Application performance index
OLAP business Intelligence/Real-time online analysis system

These usage scenarios are examined in detail below:

User activities and behaviors

Druid is often used for click streams, access streams, and activity stream data. Scenarios include measuring user engagement, tracking A/B testing data for product launches, and understanding user usage. Druid can calculate user metrics accurately and approximately, such as non-double counting metrics. This means that metrics such as daily active users can be approximated in a second (with an average accuracy of 98%) to see the overall trend, or accurately calculated to present to stakeholders. Druid can be used to do “funnel analysis” to measure how many users do one action rather than another. This is useful for product tracking user registrations.

Network flow

Druid is often used to collect and analyze network flow data. Druid is used to manage stream data that is sliced and composed with arbitrary attributes. Druid’s ability to extract a large number of network flow records and to quickly combine and sort dozens of attributes at query time helps with network flow analysis. These attributes include some core attributes, such as IP and port numbers, as well as some additional enhanced attributes, such as geographic location, service, application, device, and ASN. Druid can handle non-fixed patterns, which means you can add any properties you want.

Digital marketing

Druid is often used to store and query online advertising data. This data, often from AD service providers, is crucial to measuring and understanding the effectiveness of AD campaigns, click-through rates, conversion rates (burn rates), and more.

Druid was originally designed as a powerful user-oriented analytics application for advertising data. Druid already has a lot of practice in storing advertising data, with a lot of users around the world storing petabytes of data on thousands of servers.

Application Performance Management

Druid is often used to track actionable data generated by applications. Similar to user activity usage scenarios, this data can be about how the user interacts with the application, and it can be metrics reported by the application itself. Druid can be used to drill down to find out how different components of your application are performing, locate bottlenecks, and find problems.

Unlike many traditional solutions, Druid features a smaller storage capacity, less complexity, and larger data throughput. It can quickly analyze thousands of attribute application events and calculate complex load, performance, and utilization metrics. For example, an API terminal based on 95 percent query latency. We can organize and shard data by any temporary attribute, such as day by day, by user portrait, or by data center location.

Iot and device metrics

Driud can be used as a time series database solution to store and process server and device metrics. Collect real-time machine-generated data, perform quick AD hoc analysis to measure performance, optimize hardware resources, and locate problems.

Unlike many traditional time series databases, Druid is essentially an analysis engine. Druid combines the concepts of a time series database, a column analysis database, and a retrieval system. It supports time-based partitioning, column storage, and search indexes in a single system. This means that time-based queries, digital aggregation, and search-filtered queries are particularly fast.

You can include millions of unique dimension values in your metrics and feel free to combine groups and filters by any dimension (Dimensions in Druid are similar to tags in a time series database). You can calculate a lot of complex metrics based on Tag groups and ranks. And you can search and filter on tags much faster than traditional time series databases.

OLAP and business intelligence

Druid is often used in business intelligence scenarios. Companies deploy Druid to speed up queries and enhance applications. Unlike Hadoop-based SQL engines such as Presto or Hive, Druid is designed for high-concurrency and subsecond queries and enhances interactive data queries through a UI. This makes Druid better suited for real visual interaction analysis.

technology

Apache Druid is an open source distributed data storage engine. Druid’s core design incorporates the concepts of OLAP/ Analytic Databases, timeseries Database, and Search Systems to create a unified system for a wide range of use cases. Druid brings together the main features of these three systems into Druid’s Ingestion Layer, Storage Format, Querying Layer, and Core Architecture.

Druid’s key features include:

The column type storage

Druid stores and compresses each column of data individually. In addition, only the specific data to be queried can be queried. Fast scan, Ranking and groupBy are supported.
Native search index

Druid creates reverse indexes for string values for fast searching and filtering of data.
Streaming and bulk data intake

Out of the box Apache Kafka, HDFS, AWS S3 Connector Connectors, streaming processor.
Flexible data schema

Druid gracefully ADAPTS to changing data schemas and nested data types.
Optimized partitioning based on time

Druid intelligently partitions data based on time. As a result, Druid’s time-based queries will be significantly faster than traditional databases.
SQL statement support

In addition to native JSON-based queries, Druid also supports HTTP and JDBC-based SQL.
Horizontal expansion capability

Megabits per second data ingestion rate, massive data storage, sub-second queries.
Easy operations

You can add or remove servers to expand or shrink the capacity. Druid supports automatic rebalancing and failover.

ingestion

Druid supports both streaming and bulk data intake. Druid typically connects to raw data sources either through a message bus like Kafka (which loads streaming data) or through a distributed file system like HDFS (which loads bulk data).

Druid stores raw data on data nodes as segments, a query-optimized data structure, with Indexing processing.

Data is stored

Like most analytical databases, Druid uses column storage. Druid compresses and encodes columns differently depending on their data type (string, number, and so on). Druid also builds different types of indexes for different column types.

Similar to a retrieval system, Druid creates reverse indexes for string columns for faster searching and filtering. Similar to a time series database, Druid intelligently partitions data based on time for faster time-based queries.

Unlike most traditional systems, Druid can pre-aggregate data before it is ingested. This pre-aggregation operation is called rollup and can result in significant savings in storage costs.

The query

Druid supports jSON-over-HTTP and SQL query methods. In addition to standard SQL operations, Druid supports a wide range of unique operations. Druid provides a suite of algorithms that allow you to quickly count, rank, and quantile calculations.

architecture

Druid is a microservices architecture that can be thought of as a database broken down into multiple services. Each of Druid’s core services (Ingestion (ingestion), Querying (Querying), and Coordination (Coordination)) can be deployed individually or in combination on commercial hardware.

Druid clearly names each service to ensure that operations can adjust service parameters based on usage and load. For example, when workload demands, operations personnel can give more resources to the data intake service and less to the data query service.

Druid can fail independently without affecting other services.

operations

Drui is designed to be a robust system that needs to run 24/7. Druid has the following features to ensure long-term running and data loss.

A copy of the data

Druid creates multiple data copies based on the number of configured data copies. Therefore, a single-node failure does not affect Druid query.
Independent service

Druid clearly names each master service, and each service can be tailored to suit its use. A service can fail independently without affecting the normal operation of other services. For example, if the data ingestion service fails, no new data will be loaded into the system, but existing data can still be queried.
Automatic data backup

Druid automatically backs up all indexed data to a single file system, which can be a distributed file system such as HDFS. You can lose all the Druid cluster data and quickly reload it from the backup data.
Scroll to update

By rolling updates, you can update the Druid cluster without downtime, so that users are not aware of it. All Druid versions are backward compatible.

To learn about time series databases and comparisons, refer to another article:

Understanding and selection of time Series database (TSDB)

Pay attention to the public number back to add group, welcome to add group to discuss and share with us, our first time feedback.