Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”. This article has participated in the “Digitalstar Project” and won a creative gift package to challenge the creative incentive money.


Array and table JOIN

Using array functions directly to locate and process label values can be cumbersome, especially when working across several columns of an array. Fortunately, ClickHouse has a very convenient **ARRAY JOIN** that makes it easy to “unroll” ARRAY values into a name-value pair table. Here is an example of using **ARRAY JOIN** :

SELECT date, vm_id, vm_type, name, value
FROM vm_data
ARRAY JOIN tags_name AS name, tags_value AS value
ORDER BY date, vm_id, name
Copy the code

ARRAY JOIN works as follows:

The left vm_data columns (date, vm_id, vm_type) are “joined” with the values in the ARRAY listed after ARRAY JOIN (tags_name, tags_value). ClickHouse creates a column for each listed array and fills in the values from each array in the same order. The result looks something like this.

┌ ─ ─ ─ ─ ─ ─ ─ the date ─ ┬ ─ vm_id ─ ┬ ─ vm_type ─ ─ ─ ─ ─ ┬ ─ name ─ ─ ┬ value ─ ─ ─ ─ ─ ─ ─ ─ ┐ │ │ │ 6220 2020-09-03 m5. Large │ group │ RTB │ │ │ 2020-09-03 │ 6220 m5. Large │ name │ SFG prod - 01 │ │ │ │ 6221 2020-09-03 m5ad. Xlarge │ group │ marketing │ │ │ │ 6221 2020-09-03 M5ad. xlarge │ name │ MT-PROD-65 │ 2020-09-03 │ 6221 │ m5ad.xlarge │ owner │ Casey │ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘Copy the code

ClickHouse’s documentation includes an article on ARRAY JOIN that illustrates its flexibility.

Here’s an example: The following query adds a sequence number and sorts the rows by array sequence order using the convenient **arrayEnumerate()** function, which returns the array index value in ascending order.

SELECT date, vm_id, vm_type, name, value, seq
FROM vm_data
ARRAY JOIN
  tags_name AS name,
  tags_value AS value,
  arrayEnumerate(tags_name) AS seq
ORDER BY date, vm_id, seq

/* sql answer*/┌ ─ ─ ─ ─ ─ ─ ─date─ ┬ ─ vm_id ─ ┬ ─ vm_type ─ ─ ─ ─ ─ ┬ ─ name ─ ─ ┬ ─value─ ─ ─ ─ ─ ─ ─ ┬ ─ seq ─ ┐ │2020- 09- 03 │  6220│ m5. Large │ name │ SFG-prod- 01 │   1 │
│ 2020- 09- 03 │  6220│ m5. Large │group │ rtb         │   2 │
│ 2020- 09- 03 │  6221│ │ m5ad.xlarge │ name │ MT-prod- 65.  │   1 │
│ 2020- 09- 03 │  6221│ m5ad. Xlarge │group │ marketing   │   2 │
│ 2020- 09- 03 │  6221│ m5ad. Xlarge │ owner │ Casey │3│ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ┘Copy the code

ARRAY JOIN is useful for rendering output, because the result of a query containing an ARRAY would be difficult for a human to read and might require specialized deserialization logic in the client application. It also helps to reduce query complexity.

With ARRAY JOIN, we can minimize or even eliminate ARRAY function expressions. The following example is a rewrite of the previous example to find the type of virtual machine used by “group RTB “:

SELECT distinct vm_type FROM (
  SELECT date, vm_id, vm_type, name, value
  FROM vm_data
  ARRAY JOIN tags_name AS name, tags_value AS value
  WHERE name = 'group' AND value = 'rtb'
)
Copy the code

We cannot conclude our introduction to data modeling using arrays without mentioning arrayJoin(). This function can be added to the SELECT list to produce unscrolled results, as shown in the following example:

SELECT 1.2, arrayJoin(['a'.'b']) AS a1

/* sql answer*/┌ ─1─ ┬ ─2─ ─ ┬ ─ a1 ┐ │12 │ a  │
│ 12│ │ b └ ─ ─ ─ ┴ ─ ─ ─ ┴ ─ ─ ─ ─ ┘Copy the code

This is exactly the same as the following query with ARRAY JOIN:

SELECT 1.2 FROM system.one ARRAY JOIN ['a'.'b'] AS a1
Copy the code

However, there is one key difference.

As we saw above, ARRAY JOIN allows multiple arrays to expand values in parallel across all arrays. ArrayJoin () behaves differently. If there are multiple arrayJoin() calls, they produce the following result:

SELECT  1.2, 
  arrayJoin(['a'.'b']) AS a1, arrayJoin(['i'.'ii']) AS a2

/* sql answer*/┌ ─1─ ┬ ─2─ ─ ┬ ─ a1 ┬ ─ a2 ─ ┐ │12 │ a  │ i  │
│ 12 │ a  │ ii │
│ 12 │ b  │ i  │
│ 12│ b │ II │ ├ ──── ──── ──── ──── ──── ──── ──── ──── ──── our r companyCopy the code

As you can see, the result is a Cartesian product of array values, which is probably not what you want. For the rest of this article, we’ll focus on ARRAY JOIN, because it allows us to work with arrays with related values. This behavior is critical for more advanced uses of arrays.

conclusion

The article you just read describes the basic use of arrays in ClickHouse. We showed how to use paired arrays to represent variable data, how to use ARRAY functions to extract data, and how to JOIN arrays and table rows using ARRAY JOIN and arrayJoin().

The array capabilities we’re dealing with are beyond the capabilities of many SQL databases. For ClickHouse, this is just the beginning. In the next article, we’ll show how arrays and SQL GROUP BY are closely linked. Integration between arrays and aggregations enables users to identify sequences of events, as well as build funnels to track progress toward desired goals in marketing, sales, and other areas. This is an important analysis tool for a wide range of interesting applications.