“This article has participated in the call for good writing activities, click to view: the back end, the big front end double track submission, 20,000 yuan prize pool waiting for you to challenge!”

MySQL extracts Json internal fields and dumps them to numbers

This is just a simple data migration statistics, data volume is not big, the trouble is some intermediate steps processing and thinking.

No SQL optimization, index optimization of the content, we spray light.

At best, we know that MySQL operates on Json

background

The number of records in the user’s ophthalmic property table is about 986W. The purpose is to parse eight fields of the property values (JSON format) recorded about 29W into numbers and dump them into the records of the statistical table for chart analysis.

The following structures and data are mostly made up by me and should not be taken seriously

The user’s ophthalmic attribute list is structured as follows

CREATE TABLE `property` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `ownerId` int(11) NOT NULL COMMENT 'Record ID or template ID',
  `ownerType` tinyint(4) NOT NULL COMMENT 'type. 0: record 1: template ',
  `recorderId` bigint(20) NOT NULL DEFAULT '0' COMMENT 'Recorder ID',
  `userId` bigint(20) NOT NULL DEFAULT '0' COMMENT 'user ID',
  `roleId` bigint(20) NOT NULL DEFAULT '0' COMMENT 'character ID',
  `type` tinyint(4) NOT NULL COMMENT 'field type. 0: text 1: alternative 2: time 3: picture 4: ICD10 9: New picture ',
  `name` varchar(128) NOT NULL DEFAULT ' ' COMMENT 'Field name',
  `value` mediumtext NOT NULL COMMENT 'Field value'.PRIMARY KEY (`id`),
  UNIQUE KEY `idxOwnerIdOwnerTypeNameType` (`ownerType`,`ownerId`,`name`,`type`) USING BTREE,
  KEY `idxUserIdRoleIdRecorderIdName` (`userId`,`roleId`,`recorderId`,`name`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COMMENT='properties';
Copy the code

Problem analysis

1. Attribute values are in Json format and need to be processed using Json manipulation functions

Because the property values are in Json format, as follows. A larger Json, but only 8 field values are needed, which are extracted and classified into different statistical indicators.

{..."sight": {
        "nakedEye": {
            "left": "0.9"."right": "0.6"
        },
        "correction": {
            "left": "1"."right": "1"}},..."axialLength": {
        "left": "21"."right": "12"
    },
    "korneaRadius": {
        "left": "34"."right": "33"},... }Copy the code

So, use the Json manipulation function: json_extract(value,’$.key1.key2′).

But note that this function extracts values with “”. For example, json_extract(value,’$.sight. Nakedeye.left ‘) returns “22”; Or if the field value is an empty string, the result will be “”.

Replace (json_extract(value,’$.sight. Nakedeye.left ‘),'”‘,’ “)

If the field does not exist, the result is NULL; Neither the outer sight nor the inner left exists.

2, the field content is not standard, disorderly

Ideally, if you fill in all the standard numbers, you can import the new table directly after the above step.

However, the reality is very cruel, fill things that call a mess. Such as:

  • Number + Remarks:1(Poor coordination),1 - \ +(I guess this means high or low)
  • Number + unit: similar to above,1mm
  • Multiple values or intervals:22.52/42.45,From 1 to 5
  • Plain text description:Don't fit,Unable to record
  • Mixed description of text and numbers:That's up 10 percent from last time,< 1,Less than 1,BD234/KD23

No way, looking for products and business situation, fortunately not many, just more than 4000, roughly scan the heart. The following solutions are obtained:

  • Start of number: The start of number is the correct recorded data, omit the text description
  • Multivalue or range: take the first number
  • Plain text: Indicates no data, excluded
  • Jumble of text and numbers: case by case, take the rest away to see how much is left

So how do we do that?

Step 1: Exclude normal numeric data and null data

WHERE `nakedEyeLeft` REGEXP '/ ^ 0-9. = 1 //That can already be ruled outnullAND `nakedEyeLeft` ! = ' '
Copy the code

Step 2: If no number is included, set it to NULL or an empty string

SET nakedEyeLeft = IF(nakedEyeLeft NOT regexp '[0-9]'.' ', nakedEyeLeft)
Copy the code

Step 3: Extract the first value of the data that starts with a number

SET nakedEyeLeft = IF((nakedEyeLeft + 0 = 0), nakedEyeLeft, nakedEyeLeft + 0)
Copy the code

Put it all together

SET nakedEyeLeft = IF(nakedEyeLeft NOT regexp '[0-9]' ' '.' ', 
                      IF((nakedEyeLeft + 0 = 0), nakedEyeLeft, nakedEyeLeft + 0))
WHERE `nakedEyeLeft` REGEXP '/ ^ 0-9. = 1 //That can already be ruled outnullAND `nakedEyeLeft` ! = ' '
Copy the code

PS: SQL that processes a single field looks simple, but because you batch 8 fields at a time, the combination is long.

Be careful not to miswrite the fields.

Finally, there is the fourth category: a mixture of text and numbers, more than 40.

Some of the seemingly simple ones can be automated with regexes, such as <1, less than 1.

The increase value of the record needs to search the last record for calculation: the increase is 10 compared with the last time.

The rest are a little more complicated and need to be processed manually to extract usable data, such as BD234/KD23

I don’t know if you feel some trouble when you see here.

I also thought that the teeth were engaged, the result of the business said to directly deal with 0, when the time comes to find it is 0, you can save it through the page.

I don’t need to know if it starts with a number, just + 0; If the number starts with a digit, the first digit is retained. Otherwise = 0.

The final data formatting SQL:

UPDATE property 
SET nakedEyeLeft = IF(nakedEyeLeft NOT regexp '[0-9]' ' '.' ', nakedEyeLeft + 0)
WHERE `nakedEyeLeft` REGEXP '/ ^ 0-9. = 1 //That can already be ruled outnullAND `nakedEyeLeft` ! = ' ';
Copy the code
3. Extracting content, formatting, recording and 900W + is too slow

Select * from ownerType; select * from ownerType; select * from ownerType;

If the direct search, direct is the full table scan, plus data extraction and formatting; Moreover, other tables need to be associated to supplement some other fields of the statistical indicator.

In this case, directly importing the statistics table will result in two tables + associated tables locked together for a long time, during which there is no way to change and insert, which is not practical.

Reduce the number of scanned rows

Method 1: Add index to name, ownerType and Type, and reduce the scan record to 20 W.

However, the problem is that 900W data is indexed. When the index is used up, it needs to be deleted (because it is not required by business conditions), which will lead to two fluctuations.

In addition to the subsequent processing lock table time, the problem is still very large.

Practice 2: use a table with fewer records as the driver table, this table can be associated with the target table.

CREATE TABLE `property` (
  `ownerId` int(11) NOT NULL COMMENT 'Record ID or template ID',
  `ownerType` tinyint(4) NOT NULL COMMENT 'type. 0: record 1: template ',
  `type` tinyint(4) NOT NULL COMMENT 'field type. 0: text 1: alternative 2: time 3: picture 4: ICD10 9: New picture ',
  `name` varchar(128) NOT NULL DEFAULT ' ' COMMENT 'Field name',
  `value` mediumtext NOT NULL COMMENT 'Field value', omit other fieldsUNIQUE KEY `idxOwnerIdOwnerTypeNameType` (`ownerType`,`ownerId`,`name`,`type`) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COMMENT='properties';
Copy the code

OwnerId in the table can be associated with form, plus the conditions before the name, ownerType, type, so just hit And ` ` idxOwnerIdOwnerTypeNameType (ownerType ownerId, name, type).

CREATE TABLE `medicalrecord` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `name` varchar(50) NOT NULL DEFAULT ' ' COMMENT 'Record Name',
  `type` tinyint(4) NOT NULL DEFAULT '0' COMMENT 'Record type. ', omit other fields KEY 'idxName' (' name ')USING BTREE
) ENGINE=InnoDB  DEFAULT CHARSET=utf8mb4 COMMENT='record';
Copy the code

IdxName can be hit by name=’ eye record ‘, the number of rows is only 2W, add 29W to the property table, the final number of rows is only 30W, than the full table scan property table 30 times less!! .

Avoid data extraction and formatting lock table durations

Because there are eight fields, each field needs to be extracted and formatted, with judgment in between. In this way, the same extraction and formatting operations in a SINGLE SQL will be performed multiple times.

So, to avoid such problems, you need an intermediate table to temporarily store the extract and format results.

CREATE TABLE `propertytmp` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
   `value` mediumtext NOT NULL COMMENT 'Field value',
  `nakedEyeLeft` varchar(255) DEFAULT NULL COMMENT 'Vision - naked eye - Left eye',
  `nakedEyeRight` varchar(255) DEFAULT NULL COMMENT 'Vision - naked eye - Right eye',
  `correctionLeft` varchar(255) DEFAULT NULL COMMENT 'Vision - Correction - Left Eye',
  `correctionRight` varchar(255) DEFAULT NULL COMMENT 'Vision - Correction - Right eye',
  `axialLengthLeft` varchar(255) DEFAULT NULL COMMENT 'Axial length - left eye',
  `axialLengthRight` varchar(255) DEFAULT NULL COMMENT 'Axial length - right eye',
  `korneaRadiusLeft` varchar(255) DEFAULT NULL COMMENT 'Corneal curvature - Left eye',
  `korneaRadiusRight` varchar(255) DEFAULT NULL COMMENT 'Corneal curvature - right eye',
  `updated` datetime NOT NULL COMMENT 'Update Time',
  `deleted` tinyint(1) NOT NULL DEFAULT '0'.PRIMARY KEY (`id`)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8mb4;
Copy the code

The data is first imported into the table, extracted from it, and then formatted.

Finally, the results are compared

Data import comparison

Results: The intermediate table was imported into the full table scanning property table (40s), index + import was added into the property table (6s + 3s), and associated import was added into the property table (1.4s).

Because of the need to correlate other tables, it was not as good as predicted.

Intermediate table data extraction: 7.5s
UPDATE `propertytmp` 
SET nakedEyeLeft = REPLACE(json_extract(value.'$.sight.axialLength.left'),'"'.' '),
nakedEyeLeft = REPLACE(json_extract(value.'$.sight.nakedEye.left'),'"'.' '),
nakedEyeRight = REPLACE(json_extract(value.'$.sight.nakedEye.right'),'"'.' '),
correctionLeft = REPLACE(json_extract(value.'$.sight.correction.left'),'"'.' '),
correctionRight = REPLACE(json_extract(value.'$.sight.correction.right'),'"'.' '),
axialLengthLeft = REPLACE(json_extract(value.'$.axialLength.left'),'"'.' '),
axialLengthRight = REPLACE(json_extract(value.'$.axialLength.right'),'"'.' '),
korneaRadiusLeft = REPLACE(json_extract(value.'$.korneaRadius.left'),'"'.' '),
korneaRadiusRight = REPLACE(json_extract(value.'$.korneaRadius.right'),'"'.' ');
Copy the code
Intermediate table data formatting: 2.3s

Regular judgment is faster than I thought

UPDATE propertytmp 
SET nakedEyeLeft = IF(nakedEyeLeft NOT REGEXP '[0-9]' AND nakedEyeLeft ! = ' '.' ', nakedEyeLeft + 0), 
nakedEyeRight = IF(nakedEyeRight NOT REGEXP '[0-9]' AND nakedEyeRight ! = ' '.' ', nakedEyeRight + 0), 
correctionLeft = IF(correctionLeft NOT REGEXP '[0-9]' AND correctionLeft ! = ' '.' ', correctionLeft + 0),
correctionRight = IF(correctionRight NOT REGEXP '[0-9]' AND correctionRight ! = ' '.' ', correctionRight + 0),
axialLengthLeft = IF(axialLengthLeft NOT REGEXP '[0-9]' AND axialLengthLeft ! = ' '.' ', axialLengthLeft + 0),
axialLengthRight = IF(axialLengthRight NOT REGEXP '[0-9]' AND axialLengthRight ! = ' '.' ', axialLengthRight + 0),
korneaRadiusLeft = IF(korneaRadiusLeft NOT REGEXP '[0-9]' AND korneaRadiusLeft ! = ' '.' ', korneaRadiusLeft + 0),
korneaRadiusRight = IF(korneaRadiusRight NOT REGEXP '[0-9]' AND korneaRadiusRight ! = ' '.' ', korneaRadiusRight + 0)
WHERE (`nakedEyeLeft` REGEXP '/ ^ 0-9. = 1
       AND `nakedEyeLeft` ! = ' ')
  OR (`nakedEyeRight` REGEXP '/ ^ 0-9. = 1
      AND `nakedEyeRight` ! = ' ')
  OR (`correctionLeft` REGEXP '/ ^ 0-9. = 1
      AND `correctionLeft` ! = ' ')
  OR (`correctionRight` REGEXP '/ ^ 0-9. = 1
      AND `correctionRight` ! = ' ')
  OR (`axialLengthLeft` REGEXP '/ ^ 0-9. = 1
      AND `axialLengthLeft` ! = ' ')
  OR (`axialLengthRight` REGEXP '/ ^ 0-9. = 1
      AND `axialLengthRight` ! = ' ')
  OR (`korneaRadiusLeft` REGEXP '/ ^ 0-9. = 1
      AND `korneaRadiusLeft` ! = ' ')
  OR (`korneaRadiusRight` REGEXP '/ ^ 0-9. = 1
      AND `korneaRadiusRight` ! = ' ');
Copy the code
Intermediate table of statistical indicators

This is because empty data needs to be excluded and other tables are associated to supplement the actual import of the statistical indicator table.

In order to reduce the impact on the index table, the middle table of the index table is built, with the same structure, and the ID increment is the target table + 10000.

INSERT into index intermediate table; INSERT into index intermediate table; SELECT FROM, that’s pretty quick.

Of course, this is a bit of an overcorrection, but it’s better to be cautious to avoid some fluctuations on the line.

conclusion

This is a simple record of the data migration experience.

There is no index optimization, SQL optimization content, just feel that people need to have this kind of performance and impact on the user consideration.