Abstract: GausSDB (DWS) supports XML data types and rich XML parsing functions, which can realize the mapping management function of relational data and XML data.

Summary of XML

XML is the abbreviation of Extensible Markup Language, which can describe very complex data structures and is widely used for transferring and storing data. XML is a markup language similar to HTML, but XML does not use predefined tags, which can be defined according to application requirements. The basic format of XML is standardized and enables data sharing between heterogeneous systems across platforms, operating systems, and applications.

XML data types

GausSDB (DWS) supports storing XML documents in XML data type columns in a database. The advantages of saving data via XML data type over text mode are that it has the function of data structure checking to ensure the structure is correct, and supports XML data parsing and processing functions.

The criteria for determining the correctness of an XML document are:

  • The document must be a well-formed document.
  • The document follows all the syntax rules of XML and is valid.
  • Documents follow semantically specific rules, which are typically specified in XML or DTD specifications

XML can store well-formed documents defined by the XML standard, as well as fragments of content defined by the XML standard, which can have multiple top-level elements or character nodes.

Here is an example of a well-formed XML document:

<? The XML version = "1.0" encoding = "utf-8"? > <message> Hello GaussDB(DWS) </message>

A well-formed document can be inserted into an XML column using an INSERT SQL statement, and if the document can be parsed successfully, the document is properly formed. The XML document is validated against configuration parameters before an insert or update operation is performed.

XML data in an application is typically in its serialized string format, and when the data is inserted into an XML column, it must be converted to XML hierarchical format. The XMLPARCEL function can therefore be explicitly invoked when an insert operation is performed to convert data from its serialized string format to XML hierarchical format.

test=# SELECT XMLPARSE(document '<root>GAUSSDB(DWS)</root>');
         xmlparse          
---------------------------
 <root>GAUSSDB(DWS)</root>
(1 row)

To access the XML value

When accessing and processing XML data, because the XML data is not represented as a string within the database, the XML data type does not provide a comparison operator, so it cannot be compared directly with a string. The result is that the rows cannot be retrieved by comparing the XML values with the search values, so the XML data should be accompanied by an ID value for retrieving the data.

Using the XMLSerialize function, XML values can be transformed into serialized string values that represent XML documents.

test=# SELECT XMLSERIALIZE(document '<root>GAUSSDB(DWS)</root>' AS TEXT);
       xmlserialize        
---------------------------
 <root>GAUSSDB(DWS)</root>
(1 row)

XML parsing function

At present, GAUSSDB (DWS) has supported more than 30 XML parsing functions, including parsing XML data, generating XML content, XML predicates, XML parameter setting, mapping data to XML and other functions.

A function that processes XML data

  • XPath evaluates the result of an XPath expression against an XML value
  • XMLTABLE parses XML data as an XPath expression

A function that generates XML content

  • XMLPARSE converts character data to XML type values
  • XMLSerialize converts XML types to strings
  • XMLComment creates a value that contains the specific textual content of the XML annotation
  • XMLConcat concatenates a list of individual XML values to create a single value that contains a fragment of XML content
  • XMLELEMENT generates an XML element with the given name, attributes, and content.
  • XMLFOREST generates an XML forest (sequence) element with the specified name and content
  • XMLPI creates an XML processing instruction
  • XMLRoot changes the root node attribute of an XML value
  • XMLAGG aggregate function, concatenates the input value of the aggregate function call

XML predicate function

  • IS DOCUMENT determines whether the XML value IS a DOCUMENT
  • Is NOT DOCUMENT determines whether the XML value IS a DOCUMENT
  • XMLEXISTS determines whether an XPath expression returns any nodes
  • XPATH_EXISTS determines whether the XPath expression returns any nodes
  • XML_IS_WELL_FORMED checks if the string is well-formed XML
  • The xml_is_well_formed_document checks whether the string is a well-formed XML document
  • XML_IS_WELL_FORMED_CONTENT checks that the string is not well-formed XML content

XML parameter setting

  • Set XML OPTION sets the XML format
  • Set XMLBinary To sets the encoding format of the binary value in the XML

Functions that map tables, queries, cursors, databases to XML

  • Table_to_xml, query_to_xml, cursor_to_xml, database_to_xml, etc.

Detailed use of each function can refer to the GausSDB (DWS) user manual, the following mainly introduces the common XML data parsing XMLTABLE function in the application.

Overview of the XMLTABLE function

The XMLTABLE function parses the XML data as an XPath expression and returns the data in a table from the defined columns, which can contain any SQL data type (including XML).

The XMLTABLE function allows you to pass XML data from a table or from a SELECT query as a variable to a specified XPath expression. The results of parsing the XML data through an XPath expression are used to generate column values in the table. The structure of the generated table is defined by the COLUMNS clause of XMLTABLE, and you can specify column names, data types, and how column values are generated to define the characteristics of the COLUMNS.

Let’s demonstrate how to use the XMLTABLE function by first creating the CUSTOMER table and inserting the XML data that contains CUSTOMER information.

CREATE TABLE CUSTOMER AS SELECT 1 AS ID,
XML $$
<ROWS>
  <ROW ID="1">
    <CUSTOMER_NAME>Tony</CUSTOMER_NAME>
    <PHONENUM>123-456-666</PHONENUM>
  </ROW>
  <ROW ID="2">
    <CUSTOMER_NAME>Serena</CUSTOMER_NAME>
    <PHONENUM>123-456-888</PHONENUM>
  </ROW>
  <ROW ID="3">
    <CUSTOMER_NAME>Tina</CUSTOMER_NAME>
    <PHONENUM>123-456-999</PHONENUM>
  </ROW>
</ROWS>
$$ AS INFO;

Parse the INFO column of the CUSTOMER table in the XMLTABLE function with the following SELECT statement.

SELECT XMLTABLE.* FROM CUSTOMER,
XMLTABLE('//ROWS/ROW'
        PASSING INFO
        COLUMNS ID INT PATH '@ID',
        NAME VARCHAR(64) PATH 'CUSTOMER_NAME',
        PHONENUM TEXT PATH 'PHONENUM');

 id |  name  |  phonenum   
----+--------+-------------
  1 | Tony   | 123-456-666
  2 | Serena | 123-456-888
  3 | Tina   | 123-456-999
(3 rows)

On GausSDB (DWS), the XMLTABLE function supports a STREAM query plan that is pushed down to the data node DN to execute. XML data is parsed on the data node to generate the XMLTABLE result table, and the results are gathered on the coordination node CN through Gather Stream. A STREAM plan that can be pushed down to a DN has better query performance.

XMLTABLE application case

In business scenarios, it is often encountered that XML documents need to be parsed. The XMLTABLE function can be used to quickly and conveniently complete the parsing of XML data and return the required data in the form of a table, which is convenient for further query and analysis of the data.

Here is an example of a parsing process for message data:

  1. Create a table to store the message data and insert the data.

CREATE TABLE MSGS AS SELECT 1 AS ID, ‘A,BB,CCC,DDDD,EEEEE’ AS MSG;

  1. Convert the message text data to XML data using the XMLPARSE function.
test=# SELECT XMLPARSE(content '<r><c>' || REPLACE(MSG, ',', '</c><c>') ||'</c></r>') AS XML_MSG FROM MSGS;
                          xml_msg                          
-----------------------------------------------------------
 <r><c>A</c><c>BB</c><c>CCC</c><c>DDDD</c><c>EEEEE</c></r>
(1 row)
  1. The XML data is parsed using the XMLTABLE function and the message content is returned one by one.
test=# SELECT MSG_CONTENT FROM
test-# (SELECT XMLPARSE(content '<r><c>' || REPLACE(MSG, ',', '</c><c>') ||'</c></r>') AS XML_MSG FROM MSGS),
test-# XMLTABLE('/r/c/text()' PASSING XML_MSG COLUMNS MSG_CONTENT VARCHAR(4000) PATH '.');
 msg_content 
-------------
 A
 BB
 CCC
 DDDD
 EEEEE
(5 rows)

As can be seen from the above figure, the parsed data of XMLTABLE is returned in the form of a table, on which the data can be further sorted, filtered and other operations. At the same time, GausSDB (DWS) also supports the processing of XML data in functions or stored procedures, making the development of the application very convenient.

conclusion

GausSDB (DWS) supports XML data types and abundant XML parsing functions. At the same time, the distributed architecture based on Shared Nothing has good parallel processing and expansion ability. The parsing task of XML data can be pushed down to the data node for parallel processing. It can fully meet the requirements of XML data parsing in the application.

This article is shared from the Huawei cloud community “GausSDB (DWS) XML data processing practice”, the original author: the wind of dawn.

Click on the attention, the first time to understand Huawei cloud fresh technology ~