Hive Learning Note 11: UDTF

Welcome to my GitHub

Github.com/zq2599/blog…

Content: all original article classification summary and supporting source code, involving Java, Docker, Kubernetes, DevOPS, etc.;

This paper gives an overview of

This article is the eleventh in the Hive Learning Notes series. Up to now, we have studied the UDF and UDAF, and finally there is another UDTF left in this article, which is also the main content of this article.
User-defined table-generating Functions (UDTFs)
In previous posts, we experienced explode as a built-in UDTF:

hive> select explode(address) from t3;
OK
province	guangdong
city	shenzhen
province	jiangsu
city	nanjing
Time taken: 0.081 seconds, Fetched: 4 row(s)
Copy the code

This UDTF has two examples: splitting a column into multiple columns and splitting a column into multiple rows (multiple columns per row);
And then the real thing;

Download the source code

If you don’t want to code, you can download all the source code at GitHub, with the following address and link information:

The name of the	link	note
Project home page	Github.com/zq2599/blog…	The project’s home page on GitHub
Git repository address (HTTPS)	Github.com/zq2599/blog…	The project source warehouse address, HTTPS protocol
Git repository address (SSH)	[email protected]:zq2599/blog_demos.git	The project source warehouse address, SSH protocol

This git project has multiple folders. The application of this chapter is in the hiveudf folder, as shown in the red box below:

The preparatory work

To verify the function of UDTF, we need to prepare the table and data:

Create table t16;

create table t16(
person_name  string,
string_field string
)
row format delimited 
fields terminated by '|'
stored as textfile;
Copy the code

Create a local text file 016.txt with the following contents:

tom|1:province:guangdong
jerry|2:city:shenzhen
john|3
Copy the code

Import data:

load data 
local inpath '/home/hadoop/temp/202010/25/016.txt' 
overwrite into table t16;
Copy the code

Data ready, start coding;

Key points of UDTF development

You need to inherit the GenericUDTF class;
(StructObjectInspector, StructObjectInspector, StructObjectInspector, StructObjectInspector, StructObjectInspector)
The return value of Initialize is of type StructObjectInspector, and the name and type of each column generated by UDTF are set to the return value;
Overwrite the process method, in which the extra logical code is entered as soon as the data for each column is ready to be placed in the array, and execute the forward method once, which is a row of records;
The close method is not required; if the business logic is finished, the code that frees resources can be executed there;
Next, develop the UDTF in accordance with the above key points;

One column divided into many columns

The next UDTF to be developed, named UDF_WordSplitSinglerOW, splits input arguments into multiple columns;
The red box in the following figure shows the string_field field of an original record in t16, which is processed by UDF_WORDSPLitSinglerOW:

After the fields in the red box above are processed by UDTF, one column becomes three columns. The name of each column is shown in the yellow box below, and the value of each column is shown in the red box below:

So that’s what we’re going to do;
Open the hiveudf project and create a new wordSplitSinglerow.java:

package com.bolingcavalry.hiveudf.udtf;

import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
import org.apache.hadoop.hive.ql.exec.UDFArgumentLengthException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF;
import org.apache.hadoop.hive.serde2.objectinspector.*;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector.Category;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;

import java.util.ArrayList;
import java.util.List;

/ * * *@Description: Splits the specified field into multiple columns *@author: willzhao E-mail: [email protected]
 * @date: 2020/11/5 for * /
public class WordSplitSingleRow extends GenericUDTF {

    private PrimitiveObjectInspector stringOI = null;

    private final static String[] EMPTY_ARRAY = {"NULL"."NULL"."NULL"};

    /** * The logic for splitting a column into multiple columns is here *@param args
     * @throws HiveException
     */
    @Override
    public void process(Object[] args) throws HiveException {

        String input = stringOI.getPrimitiveJavaObject(args[0]).toString();

        // Invalid string
        if(StringUtils.isBlank(input)) {
            forward(EMPTY_ARRAY);
        } else {

            // Split the string
            String[] array = input.split(":");

            // If the string array is illegal, return the original string and an error message
            if(null==array || array.length<3) {
                String[] errRlt = new String[3];
                errRlt[0] = input;
                errRlt[1] = "can not split to valid array";
                errRlt[2] = "-";

                forward(errRlt);
            } else{ forward(array); }}}/** * Free resources is performed here, no resources need to be freed in this example *@throws HiveException
     */
    @Override
    public void close(a) throws HiveException {}@Override
    public StructObjectInspector initialize(StructObjectInspector argOIs) throws UDFArgumentException {

        List<? extends StructField> inputFields = argOIs.getAllStructFieldRefs();

        // The current UDTF process only one parameter
        if (1! = inputFields.size()) {throw new UDFArgumentLengthException("ExplodeMap takes only one argument");
        }

        // This UDTF only handles string types
        if(! Category.PRIMITIVE.equals(inputFields.get(0).getFieldObjectInspector().getCategory())) {
            throw new UDFArgumentException("ExplodeMap takes string as a parameter");
        }

        stringOI = (PrimitiveObjectInspector)inputFields.get(0).getFieldObjectInspector();

        // Set of column names
        ArrayList<String> fieldNames = new ArrayList<String>();

        // The value of the column
        ArrayList<ObjectInspector> fieldOIs = new ArrayList<ObjectInspector>();

        // The first column name
        fieldNames.add("id");
        The inspector in the first column is of type string
        fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);

        // The name of the second column
        fieldNames.add("key");
        // The inspector in the second column is of type string
        fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);

        // The third column name
        fieldNames.add("value");
        // The inspector in the third column is of type string
        fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);

        returnObjectInspectorFactory.getStandardStructObjectInspector(fieldNames, fieldOIs); }}Copy the code

The focus of this code is the process method, which takes the input parameter and splits the string with a colon to get the array, then calls the forward method to generate a row with three columns.

Verify UDTF

Next, deploy wordSplitSinglerow.java as a temporary function and verify;

After the encoding is complete, run the MVN clean package -u command in the pom. XML directory.
In the target directory you get the file hiveudf-1.0-snapshot.jar
Download the JAR to the Hive server. I put it in this directory: /home/hadoop/udf/
In Hive session mode, run the following command to add the local JAR:

Add the jar/home/hadoop/udf/hiveudf - 1.0 - the SNAPSHOT. Jar;Copy the code

Deploy temporary functions:

create temporary function udf_wordsplitsinglerow as 'com.bolingcavalry.hiveudf.udtf.WordSplitSingleRow';
Copy the code

Perform the following SQL validation:

select udf_wordsplitsinglerow(string_field) from t16;
Copy the code

The string_field field in each row is split into id, key, and value fields:

hive> select udf_wordsplitsinglerow(string_field) from t16;
OK
id	key	value
1	province	guangdong
2	city	shenzhen
3	can not split to valid array	-
Time taken: 0.066 seconds, Fetched: 3 row(s)
Copy the code

Key points to note

It is important to note that UDTF cannot appear in a SELECT statement with other fields. For example, the following SQL will fail:

select person_name,udf_wordsplitsinglerow(string_field) from t16;
Copy the code

The error message is as follows:

hive> select person_name,udf_wordsplitsinglerow(string_field) from t16;
FAILED: SemanticException [Error 10081]: UDTF's are not supported outside the SELECT clause, nor nested in expressions
Copy the code

SQL > select * from LATERAL VIEW; select * from LATERAL VIEW;

select t.person_name, udtf_id, udtf_key, udtf_value
from (
    select person_name, string_field 
    from  t16
) t LATERAL VIEW udf_wordsplitsinglerow(t.string_field) v as  udtf_id, udtf_key, udtf_value;
Copy the code

The specified fields and UDTF can be displayed as follows:

hive> select t.person_name, udtf_id, udtf_key, udtf_value
    > from (
    >     select person_name, string_field 
    >     from  t16
    > ) t LATERAL VIEW udf_wordsplitsinglerow(t.string_field) v as  udtf_id, udtf_key, udtf_value;
OK
t.person_name	udtf_id	udtf_key	udtf_value
tom	1	province	guangdong
jerry	2	city	shenzhen
john	3	can not split to valid array	-
Time taken: 0.122 seconds, Fetched: 3 row(s)
Copy the code

Split a column into multiple rows (multiple columns per row)

We have tried to split string_field into id, key, and value fields, but the total number of rows remains the same. In UDTF, we split string_field into multiple records, and each record has three fields.
To import new data into the T16 table, create a text file 016_multi-. TXT with the following contents:

tom|1:province:guangdong,4:city:yangjiang
jerry|2:city:shenzhen
john|3
Copy the code

Run the following command in the Hive session window to overwrite the existing contents of the T16 table with 016_multi-. TXT:

load data 
local inpath '/home/hadoop/temp/202010/25/016_multi.txt' 
overwrite into table t16;
Copy the code

In the red box is the string_field value of a record. The UDTF we are going to develop will be separated by commas, and the two strings 1:province: Guangdong and 4:city: Yangjiang will be obtained. Then separate each string with a colon, and you get two id, key, and value records, i.e. multiple rows and columns:

The expected UDTF result looks like this, with red and yellow boxes both coming from the string_field value of a single record:

The only difference from WordSplitSingleRow is that the process method, WordSplitMultiRow, performs multiple forward operations. So there are multiple records:

package com.bolingcavalry.hiveudf.udtf;

import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
import org.apache.hadoop.hive.ql.exec.UDFArgumentLengthException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF;
import org.apache.hadoop.hive.serde2.objectinspector.*;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector.Category;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
import java.util.ArrayList;
import java.util.List;

/ * * *@Description: Splits the specified field into multiple rows, each row having multiple columns *@author: willzhao E-mail: [email protected]
 * @date: 2020/11/5 for * /
public class WordSplitMultiRow extends GenericUDTF {

    private PrimitiveObjectInspector stringOI = null;


    private final static String[] EMPTY_ARRAY = {"NULL"."NULL"."NULL"};

    /** * The logic for splitting a column into multiple columns is here *@param args
     * @throws HiveException
     */
    @Override
    public void process(Object[] args) throws HiveException {
        String input = stringOI.getPrimitiveJavaObject(args[0]).toString();

        // Invalid string
        if(StringUtils.isBlank(input)) {
            forward(EMPTY_ARRAY);
        } else {

            // Separate with commas
            String[] rowArray = input.split(",");

            // Handle exceptions
            if(null==rowArray || rowArray.length<1) {
                String[] errRlt = new String[3];
                errRlt[0] = input;
                errRlt[1] = "can not split to valid row array";
                errRlt[2] = "-";

                forward(errRlt);
            } else {
                // Each element of rowArray is a string like "id:key:value"
                for(String singleRow : rowArray) {

                    // Make sure the string is valid
                    if(StringUtils.isBlank(singleRow)) {
                        forward(EMPTY_ARRAY);
                    } else {
                        // Split the string
                        String[] array = singleRow.split(":");

                        // If the string array is illegal, return the original string and an error message
                        if(null==array || array.length<3) {
                            String[] errRlt = new String[3];
                            errRlt[0] = input;
                            errRlt[1] = "can not split to valid array";
                            errRlt[2] = "-";

                            forward(errRlt);
                        } else {
                            forward(array);
                        }
                    }
                }

            }
        }
    }

    /** * Free resources is performed here, no resources need to be freed in this example *@throws HiveException
     */
    @Override
    public void close(a) throws HiveException {}@Override
    public StructObjectInspector initialize(StructObjectInspector argOIs) throws UDFArgumentException {

        List<? extends StructField> inputFields = argOIs.getAllStructFieldRefs();

        // The current UDTF process only one parameter
        if (1! = inputFields.size()) {throw new UDFArgumentLengthException("ExplodeMap takes only one argument");
        }

        // This UDTF only handles string types
        if(! Category.PRIMITIVE.equals(inputFields.get(0).getFieldObjectInspector().getCategory())) {
            throw new UDFArgumentException("ExplodeMap takes string as a parameter");
        }

        stringOI = (PrimitiveObjectInspector)inputFields.get(0).getFieldObjectInspector();

        // Set of column names
        ArrayList<String> fieldNames = new ArrayList<String>();

        // The value of the column
        ArrayList<ObjectInspector> fieldOIs = new ArrayList<ObjectInspector>();

        // The first column name
        fieldNames.add("id");
        The inspector in the first column is of type string
        fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);

        // The name of the second column
        fieldNames.add("key");
        // The inspector in the second column is of type string
        fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);

        // The third column name
        fieldNames.add("value");
        // The inspector in the third column is of type string
        fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);

        returnObjectInspectorFactory.getStandardStructObjectInspector(fieldNames, fieldOIs); }}Copy the code

Verify UDTF

Next, deploy wordsplitMultirow.java as a temporary function and validate it;

After the encoding is complete, run the MVN clean package -u command in the pom. XML directory.
In the target directory you get the file hiveudf-1.0-snapshot.jar
Download the JAR to the Hive server. I put it in this directory: /home/hadoop/udf/
If you are in the same Hive session mode, you need to clear the previous jar and function:

drop temporary function if exists udf_wordsplitsinglerow; Delete the jar/home/hadoop/udf/hiveudf - 1.0 - the SNAPSHOT. Jar;Copy the code

In Hive session mode, run the following command to add the local JAR:

Add the jar/home/hadoop/udf/hiveudf - 1.0 - the SNAPSHOT. Jar;Copy the code

Deploy temporary functions:

create temporary function udf_wordsplitmultirow as 'com.bolingcavalry.hiveudf.udtf.WordSplitMultiRow';
Copy the code

Perform the following SQL validation:

select udf_wordsplitmultirow(string_field) from t16;
Copy the code

The string_field field in each row is split into id, key, and value fields:

hive> select udf_wordsplitmultirow(string_field) from t16;
OK
id	key	value
1	province	guangdong
4	city	yangjiang
2	city	shenzhen
3	can not split to valid array	-
Time taken: 0.041 seconds, Fetched: 4 row(s)
Copy the code

Use the syntax of LATERAL VIEW to try to check other fields. SQL is as follows:

select t.person_name, udtf_id, udtf_key, udtf_value
from (
    select person_name, string_field 
    from  t16
) t LATERAL VIEW udf_wordsplitmultirow(t.string_field) v as  udtf_id, udtf_key, udtf_value;
Copy the code

The results are as follows, in line with expectations:

hive> select t.person_name, udtf_id, udtf_key, udtf_value
    > from (
    >     select person_name, string_field 
    >     from  t16
    > ) t LATERAL VIEW udf_wordsplitmultirow(t.string_field) v as  udtf_id, udtf_key, udtf_value;
OK
t.person_name	udtf_id	udtf_key	udtf_value
tom	1	province	guangdong
tom	4	city	yangjiang
jerry	2	city	shenzhen
john	3	can not split to valid array	-
Time taken: 0.056 seconds, Fetched: 4 row(s)
Copy the code

So far, HIVE three kinds of user-defined functions we are learning and practice completed, hope these content can give you some reference to practice;

You are not alone, Xinchen original accompany all the way

Java series
Spring series
The Docker series
Kubernetes series
Database + middleware series
The conversation series

Welcome to pay attention to the public number: programmer Xin Chen

Wechat search “programmer Xin Chen”, I am Xin Chen, looking forward to enjoying the Java world with you…

Github.com/zq2599/blog…

Welcome to my GitHub

This paper gives an overview of

Download the source code

The preparatory work

Key points of UDTF development

One column divided into many columns

Verify UDTF

Key points to note

Split a column into multiple rows (multiple columns per row)

Verify UDTF

You are not alone, Xinchen original accompany all the way

Welcome to pay attention to the public number: programmer Xin Chen

Related Posts

Python crawled Jay Chou’s Instagram

Mybatis Generator plugin

Node.js is an interesting event loop