Speaking of the application of big data, many students may immediately think of user portrait. User Profile, called User Profile in English, can be perfectly abstracted from a User’s information, through the User portrait data can accurately analyze the User’s various behavior habits, such as consumption habits, interests, ability and other important User information. This article introduces how to use json data type of PostgreSQL to build user portrait database.

The label model

To illustrate the specific method, we build a simple two-level label system:

  • Occupation: farmer, worker, IT engineer, hairdresser, doctor, teacher, artist, lawyer, civil servant, official
  • Hobbies: swimming, table tennis, badminton, tennis, mountain climbing, golf, skiing, mountain climbing, travel
  • Education: No education, primary school, junior high school, senior high school, technical secondary school, junior college, undergraduate, master, doctor
  • Personality: outgoing, introverted, cautious, steady, careful, careless, impetuous, confident

Of course, you can also set up three or more levels of complex labeling system according to the actual situation.

Build tables and data

Here is a concrete example of how to build a user tag database using PostgreSQL JSON data types. Create user label table:


CREATE TABLE user_tag(uid serial primary key, tag jsonb);
Copy the code

The table has only two fields, uid representing the user ID, and the tag field containing all of the user’s tags. The tag field is of type JSONb. To illustrate how to use this scheme, you need to create some label data for the table. Create a function that will be called later in the INSERT SQL to generate some random tags:


CREATE OR REPLACE FUNCTION f_random_attr(attr text[], max_attr int)
RETURNS text[] AS ?
DECLARE
     i integer := 0; 
     r integer := 0;
     res  text[];
     v text;
     l integer;
     num integer;
 BEGIN
     num := (random()*max_attr)::int;
     IF num < 1 THEN
         num := 1;
     END IF;
     l := array_length(attr, 1);
     WHILE i < num LOOP
          r := round(random()*l)::int + 1;
          v := attr[r];
          IF res @> array[v] THEN
             continue;
          ELSE 
             res := array_append(res, v);
             i := i + 1;
          END IF;
     END LOOP;
     return res;
 END;
? LANGUAGE plpgsql;
Copy the code

We can create some random data with the following INSERT statement:

INSERT INTO user_tag(uid, tag) SELECT seq, json_build_object(' occupation ', F_random_attr (array [' farmers', 'workers',' IT engineers', 'the barber', 'doctor', 'teacher,' art ', 'lawyers',' civil servants', 'officials'], 1),' hobby ', F_random_attr (array [' swimming ', 'table tennis',' badminton, tennis, mountain climbing, 'golf', 'skiing', 'climbing the mountain', 'travel'], 5), 'degree', F_random_attr (array [' no qualifications', 'primary', 'middle school', 'high school', 'school', 'specialist', 'bachelor', 'master,' Dr '], 1), the 'character', F_random_attr (array [' outside ', 'to', 'cautious' and' and ', 'careful', 'careless', 'fickleness,' confidence '], 3)) : : jsonb FROM generate_series (1, 10000) as the t (seq);Copy the code

View the generated data:

osdba=# select * from user_tag limit 5; uid | tag -----+-------------------------------------------------------------------------------------------------- 1 | {" degree ": [" Dr"], "personality" : (null, "outgoing"], "hobby" : [" badminton ", "skiing" and "climbing"], "professional" : [" teacher "]} 2 | {" degree ": [" secondary"], "personality" : [" impetuous ", "shy", "self-confidence"], "hobby" : / "tourism", "professional" : [" artist "]} 3 | {" degree ": /" degree ", "personality" : [" careful "], "hobby" : [" badminton ", null, "mountain climbing"], "professional" : [" lawyer "]} 4 | {" degree ": (null)," personality ": [" caution", "impetuous"], "hobby" : / "climbing", "professional" : [" lawyer "]} | {5 "degree" : [" Dr "], "personality" : / "impetuous", "stable", "love" : [" I "], "I ": [" I "]} (5 rows)Copy the code

Build indexes and queries

To speed up queries, we build GIN indexes on tag columns. GIN index is a special type of index in PostgreSQL that performs functions similar to full-text search:


CREATE INDEX idx_user_tag_tag on user_tag using gin(tag);
Copy the code

If you want to query the teacher who is outgoing and careful, then SQL statement:

Osdba = # select * from user_tag where tag @ > '{" personality ": [" outgoing" and "careful"]}' and tag @ > '{" career ": [" teacher"]}'; uid | tag ------+----------------------------------------------------------------------------------------------------------------- -- -- -- -- -- 881 | {" degree ": [" secondary"], "personality" : [" carefully ", "shy", "outgoing"], "hobby" : [] "golf", "professional" : [" teacher "]} 1031 | {" degree ": (null)," personality ": ] [" outgoing "and" careful ", "love" : [" climbing "], "professional" : [" teacher "]} 3313 | {" degree ": (null)," personality ": [" outgoing" and "careful"], "hobby" : / "table tennis", "professional" : [" teacher "]} 4053 | {" degree ": [" undergraduate"], "personality" : [" carefully ", "outgoing"], "hobby" : [" climbing ", "ski", "swimming"], "professional" : [" teacher "]} 4085 | {" degree ": [" junior"], "personality" : ] [" outgoing "and" careful ", "love" : [] "tennis", "professional" : [" teacher "]} 4332 | {" degree ": /" master ", "personality" : [" outgoing "and" careful "], "hobby" : [" tennis ", "badminton"], "professional" : [" teacher "]} 4997 | {" degree ": /" primary school ", "personality" : [" outgoing ", "care", "impetuous"], "hobby" : / "table tennis", "professional" : [" teacher "]} 5231 | {" degree ": [" undergraduate"], "personality" : ] [" outgoing "and" careful ", "love" : [] "golf", "professional" : 5360 | [" teacher "]} {" degree ": /" degree ", "personality" : [" impetuous ", "outgoing" and "careful"], "love" : [" climbing ", "tennis", "travel", "badminton"], "professional" : [" teacher "]} 6281 | {" degree ": [" specialist"], "personality" : [" carefully ", "outgoing", "self-confidence"], "hobby" : [" ski ", null], "professional" : [" teacher "]} 7681 | {" degree ": /" primary school ", "personality" : [" careless ", "outgoing" and "careful"], "hobby" : / "table tennis", "professional" : [" teacher "]} 8246 | {" degree ": [" master's"], "personality" : ] [" outgoing "and" careful ", "love" : [" ski ", "tennis", "golf"], "professional" : [" teacher "]} 8531 | {" degree ": /" master ", "personality" : [" carefully ", "outgoing", "careless"], "love" : ] [" skiing "and" climbing ", "professional" : 8618 | [" teacher "]} {" degree ": /" primary school ", "personality" : [" carefully ", "outgoing" and "impetuous"], "hobby" : / "table tennis", "professional" : [" teacher "]} 9508 | {" degree ": ] [" primary school ", "personality" : [" impetuous ", "outgoing" and "careful"], "hobby" : [" climbing ", "travel", "golf"], "professional" : [" teacher "]} (15 rows) Time: 1.495 msCopy the code

Tag @> ‘{” personality “:[” outgoing “,” careful “]}’ is a PostgreSQL jSONB operator meaning “contain”. If you want to query a doctor who is outgoing and careful, and likes skiing and swimming, then you can use the following SQL statement:

Osdba = # select * from user_tag where tag @ > '{" personality ": [" outgoing" and "careful"]}' and tag @ > '{" career ": [" doctors"]}' and tag @ > '{" hobby ": [" ski", "Swimming"]} '; uid | tag ------+-------------------------------------------------------------------------------------------------- 4469 | {" degree ": /" primary school ", "personality" : [" outgoing ", "care", "stable"], "hobby" : / "ski", "swimming", "professional" : [" doctors "]} (row 1) Time: 2.139 msCopy the code

As you can see from the above, it only takes 1~2ms to get the result. The reason is that we use GIN index, which can be checked by looking at the execution plan:

Osdba = # explain the select * from user_tag where tag @ > '{" personality ": [" outgoing" and "careful"]}' and tag @ > '{" career ": [" doctors"]}' and the tag @>'{" hobby ":[" skiing ", "swimming "]}'; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------ -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Bitmap Heap Scan on user_tag (cost = 48.00.. 52.02 rows = 1 width = 153) Recheck Cond: ((tag @ > '{" personality ": [" outgoing" AND "careful"]}' : : jsonb) AND the tag @ > '{" career ": [" doctors "]} ': : jsonb) AND (tag @ >' {" hobby ": [" ski", "swimming"]} ': : jsonb)) - > Bitmap Index Scan on idx_user_tag_tag (cost = 0.00.. 48.00 rows = 1 width = 0) Index Cond: ((tag @ > '{" personality ": [" outgoing" AND "careful"]}' : : jsonb) AND the tag @ > '{" career ": [" doctors "]} ': : jsonb) AND (tag @ >' {" hobby ": [" ski", "swimming"]} ': : jsonb)) (4 rows) Time: 1.736 msCopy the code

GIN index is the nuclear weapon in PostgreSQL, and you can see how powerful GIN index is. GIN can be used with json data types as well as array types. Jsonb provides two types of index operators (you can think of two types of indexes) for GIN indexes:

  • Jsonb_ops, this is the default, this is the type of index we built above,
  • jsonb_path_ops

Jsonb_ops supports @> and? And “? & “, “? | “operators such as query, the meaning of these operators, please see: official documentation and jsonb_path_ops only supports the” @ “> operator, of course jsonb_path_ops index are much smaller than json_ops index, specific can see:


osdba=# select pg_relation_size('idx_user_tag_tag');
 pg_relation_size
------------------
           245760
(1 row)

Time: 0.522 ms
osdba=# CREATE INDEX idx_user_tag_tag2 on user_tag using gin(tag jsonb_path_ops);
CREATE INDEX
Time: 46.947 ms
osdba=# select pg_relation_size('idx_user_tag_tag2');
 pg_relation_size
------------------
           147456
(1 row)
Copy the code

The jSONb_PATH_OPS index uses basically the same method as jSONb_OPS:

osdba=# drop index idx_user_tag_tag; DROP INDEX Time: 2.833 ms osdba = # select * from user_tag where tag @ > '{" personality ": [" outgoing" and "careful"]}' and tag @ > '{" career ": [" doctors"]}' and the tag @>'{" hobby ":[" skiing ", "swimming "]}'; uid | tag ------+-------------------------------------------------------------------------------------------------- 4469 | {" degree ": /" primary school ", "personality" : [" outgoing ", "care", "stable"], "hobby" : / "ski", "swimming", "professional" : [" doctors "]} (row 1) Time: 1.401ms osdba=# explain select * from user_tag where tag @> 'and tag @>' and tag @> ' @>'{" hobby ":[" skiing ", "swimming "]}'; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------ -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- Bitmap Heap Scan on user_tag (cost = 24.00.. 28.02 rows = 1 width = 153) Recheck Cond: ((tag @ > '{" personality ": [" outgoing" AND "careful"]}' : : jsonb) AND the tag @ > '{" career ": [" doctors "]} ': : jsonb) AND (tag @ >' {" hobby ": [" ski", "swimming"]} ': : jsonb)) - > Bitmap Index Scan on idx_user_tag_tag2 (cost = 0.00.. 24.00 rows = 1 width = 0) Index Cond: ((tag @ > '{" personality ": [" outgoing" AND "careful"]}' : : jsonb) AND the tag @ > '{" career ": [" doctors "]} ': : jsonb) AND (tag @ >' {" hobby ": [" ski", "swimming"]} ': : jsonb)) (4 rows) Time: 0.634 msCopy the code

At the end

You can write more SQL to do data analysis of user portraits, which I won’t go into here. It is also possible to use PostgreSQL array types to build tags, and array types also run GIN indexes, but in general it is not as intuitive as using JSON.