One line of input returns a result
It can meet most of the needs of daily work
UDF implementation method
Hive provides two methods to implement UDF:
The first is inheritanceUDF 类
- Advantages:
- Implement a simple
- Hive supports basic types, arrays, and Maps
- Support for function overloading
- Disadvantages:
- The logic is simple and is only suitable for implementing simple functions
In this way, the code is less coding, the code logic is clear, and simple UDF can be quickly implemented
Second: inheritanceGenericUDF 类
- Advantages:
- Supports parameters of any length and type
- You can implement different logic depending on the number and type of arguments
- Logic that allows you to initialize and close resources (initialize, close)
- Disadvantages:
- The implementation is a little more complicated than inheriting udFs
GenericUDF is more flexible and can implement more complex functions than an inherited UDF
About the choice between the two
The UDF class is inherited first if the function has the following characteristics:
- Simple logic, such as English to lowercase function
- The parameter and return value types are simple and are the basic types, arrays, or maps of Hive
- There is no need to initialize or close the resource
Otherwise, consider inheriting the GenericUDF class
The steps for both implementations are described below
Inherit the UDF class
The first approach is the simplest, creating a new class that inherits the UDF and then writing evaluate()
import org.apache.hadoop.hive.ql.exec.UDF; / * * * inherit org. Apache. Hadoop. Hive. Ql. Exec. UDF * / public class SimpleUDF extends UDF {/ * * *. Write a function that requires the following: * 1. Function name must be the evaluate * 2. The parameters and return value types can be as follows: basic type, Java wrapper classes, Java org.. Apache hadoop. IO, Writable type, List, Map * 3, etc. Void */ public int evaluate(int a, int b) {return a + b; Public Integer evaluate(Integer a, Integer b, Integer c) { if (a == null || b == null || c == null) return 0; return a + b + c; }}Copy the code
The way to inherit UDF classes is quite simple, but there are a few caveats:
- The evaluate() method does not inherit from UDF classes.
- Evaluate () cannot return void. Evaluate () : void
Supported parameter and return value types
Supports hive basic types, arrays, and Maps
Hive Basic Types
Java can use Java primitive types, Java wrapper classes, or corresponding Writable classes
PS: For primitive types, it is best not to use Java primitive types. UDF will report an error when null is passed to a Java primitive type parameter. Java wrapper classes can also be used for null value determination
Hive type | Java primitive types | The Java wrapper class | hadoop.io.Writable |
---|---|---|---|
tinyint | byte | Byte | ByteWritable |
smallint | short | Short | ShortWritable |
int | int | Integer | IntWritable |
bigint | long | Long | LongWritable |
string | String | – | Text |
boolean | boolean | Boolean | BooleanWritable |
float | float | Float | FloatWritable |
double | double | Double | DoubleWritable |
Array and Map
Hive type | Java type |
---|---|
array | List |
Map<K, V> | Map<K, V> |
Inheritance GenericUDF
This approach is the most flexible, but also a bit more complex to implement than the previous one
After GenericUDF is inherited, three methods must be implemented:
- initialize()
- evaluate()
- getDisplayString()
initialize()
/** * Initialize GenericUDF, Each GenericUDF example calls the initialization method only once * * @param Arguments * ObjectInspector instance of custom UDF arguments * @throws UDFArgumentException * If the parameter number or type is incorrect, Throw this exception * @return Function return value type */ public Abstract ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException;Copy the code
Initialize () is called once during GenericUDF initialization to perform some initialization operations, including:
- Determine the number of function arguments
- Determine the function parameter type
- Determine the return value type of the function
In addition, users can perform some customized initialization operations, such as initializing the HDFS client
One: determine the number of function parameters
The number of function arguments can be determined by the length of the arguments array
Example to determine the number of function parameters:
If (arguments.length! = 1) // Throw new UDFArgumentException(" Function requires an argument "); // An exception is thrown when the custom UDF parameters do not conform to expectationsCopy the code
Second: determine the type of function parameters
ObjectInspector can be used to detect parameter data types and has an enumerated class Category that represents the type of the current ObjectInspector
Public interface ObjectInspector extends Cloneable {public static enum Category {PRIMITIVE, // // Hive MAP STRUCT, // STRUCT UNION // UNION}; }Copy the code
Hive primitive types are subdivided into multiple subtypes. PrimitiveObjectInspector implements ObjectInspector to more specifically represent the corresponding Hive primitive types
public interface PrimitiveObjectInspector extends ObjectInspector {
/**
* The primitive types supported by Hive.
*/
public static enum PrimitiveCategory {
VOID, BOOLEAN, BYTE, SHORT, INT, LONG, FLOAT, DOUBLE, STRING,
DATE, TIMESTAMP, BINARY, DECIMAL, VARCHAR, CHAR, INTERVAL_YEAR_MONTH, INTERVAL_DAY_TIME,
UNKNOWN
};
}
Copy the code
PrimitiveCategory enumeration types cannot be explained
Parameter Type Example:
If (arguments[0].getcategory ()! = ObjectInspector. Category. PRIMITIVE / / parameter is the original Hive type | |! PrimitiveObjectInspector.PrimitiveCategory.STRING.equals(((PrimitiveObjectInspector)arguments[0]).getPrimitiveCategory() Throw new UDFArgumentException(" function first argument is a string "); // An exception is thrown when the custom UDF parameters do not conform to expectationsCopy the code
Third, determine the return value type of the function
Initialize () requires a return instance of ObjectInspector, which represents the custom UDF return value type. The return value of Initialize () determines the return type of evaluate()
ObjectInspector’s source code contains a comment to the effect that instances of ObjectInspector should be fetched by the corresponding factory class to ensure instance singletons and other properties
/**
* An efficient implementation of ObjectInspector should rely on factory, so
* that we can make sure the same ObjectInspector only has one instance. That
* also makes sure hashCode() and equals() methods of java.lang.Object directly
* works for ObjectInspector as well.
*/
public interface ObjectInspector extends Cloneable { }
Copy the code
For basic types (byte, short, int, long, float, double, Boolean, string), can be directly obtained through PrimitiveObjectInspectorFactory static field
Hive type | Writable type | Java wrapper types |
---|---|---|
tinyint | writableByteObjectInspector | javaByteObjectInspector |
smallint | writableShortObjectInspector | javaShortObjectInspector |
int | writableIntObjectInspector | javaIntObjectInspector |
bigint | writableLongObjectInspector | javaLongObjectInspector |
string | writableStringObjectInspector | javaStringObjectInspector |
boolean | writableBooleanObjectInspector | javaBooleanObjectInspector |
float | writableFloatObjectInspector | javaFloatObjectInspector |
double | writableDoubleObjectInspector | javaDoubleObjectInspector |
Note: There are two basic types of return values: Writable and Java wrapper:
- When Initialize specifies a return type of Writable, evaluate() should return the corresponding Writable instance
- When Initialize specifies a Java wrapper type, evaluate() should return the corresponding Java wrapper class instance
Complex types such as Array and Map<K, V> can be obtained using the static method of ObjectInspectorFactory
Hive type | ObjectInspectorFactory static method | Evaluate () Return value type |
---|---|---|
Array | getStandardListObjectInspector(T t) | List |
Map<K, V> | getStandardMapObjectInspector(K k, V v); | Map<K, V> |
Examples of cases where the return type is Map<String, int> :
// 3. User-defined UDF returns Map<String. int> return ObjectInspectorFactory.getStandardMapObjectInspector( PrimitiveObjectInspectorFactory.javaStringObjectInspector, / / Key is a String PrimitiveObjectInspectorFactory. JavaIntObjectInspector / / a Value is an int);Copy the code
The complete initialize() function looks like this:
/** * Initialize GenericUDF, Each GenericUDF example calls the initialization method only once * * @param Arguments * ObjectInspector instance of custom UDF arguments * @throws UDFArgumentException * If the parameter number or type is incorrect, Throws this exception. * @override Public ObjectInspector Initialize (ObjectInspector[] arguments) throws UDFArgumentException { // 1. Check the number of arguments, only one argument if (arguments.length! = 1) // Throw new UDFArgumentException(" Function requires an argument "); (arguments[0].getcategory ()!); = ObjectInspector. Category. PRIMITIVE / / parameter is the original Hive type | |! PrimitiveObjectInspector.PrimitiveCategory.STRING.equals(((PrimitiveObjectInspector)arguments[0]).getPrimitiveCategory() Throw new UDFArgumentException(" function first argument is a string "); // When the custom UDF parameter does not conform to the expectation, Throw an exception // 3. The return type of the user-defined UDF is Map<String. int> return ObjectInspectorFactory.getStandardMapObjectInspector( PrimitiveObjectInspectorFactory.javaStringObjectInspector, / / Key is a String PrimitiveObjectInspectorFactory. JavaIntObjectInspector / / a Value is an int); }Copy the code
evaluate()
Core method, custom UDF implementation logic
The code implementation steps can be divided into three parts:
- Parameters of the receiving
- Custom UDF core logic
- Return processing result
Step 1: Parameter acceptance
The evaluate() argument is the custom UDF argument,
/**
* Evaluate the GenericUDF with the arguments.
*
* @param arguments
* The arguments as DeferedObject, use DeferedObject.get() to get the
* actual argument Object. The Objects can be inspected by the
* ObjectInspectors passed in the initialize call.
* @return The
*/
public abstract Object evaluate(DeferredObject[] arguments)
throws HiveException;
Copy the code
Deferedobject.get () gets the values of the arguments, as you can see from the source code annotations
/**
* A Defered Object allows us to do lazy-evaluation and short-circuiting.
* GenericUDF use DeferedObject to pass arguments.
*/
public static interface DeferredObject {
void prepare(int version) throws HiveException;
Object get() throws HiveException;
};
Copy the code
DeferredObject () returns an Object. Deferedobject.get () returns an Object
For Hive basic types, the Writable type is passed in
Hive type | Java type |
---|---|
tinyint | ByteWritable |
smallint | ShortWritable |
int | IntWritable |
bigint | LongWritable |
string | Text |
boolean | BooleanWritable |
float | FloatWritable |
double | DoubleWritable |
Array | ArrayList |
Map<K, V> | HashMap<K, V> |
Parameter receiving example:
Map<String, int> // 1. If (arguments[0] == null) return... // 2. Arguments Map<Text, IntWritable> Map = (Map<Text, IntWritable>)arguments[0].get();Copy the code
Step 2: Customize the UDF core logic
Once you get the parameters, you’re free to play here
Step 3: Return the processing result
This step corresponds to the return value of Initialize ()
There are two basic types of return values: Writable and Java wrapper:
- When Initialize specifies a return type of Writable, evaluate() should return the corresponding Writable instance
- When Initialize specifies a Java wrapper type, evaluate() should return the corresponding Java wrapper class instance
Hive array and Map return values of the following types:
Hive type | Java type |
---|---|
Array<T> | List<T> |
Map<K, V> | Map<K, V> |
getDisplayString()
GetDisplayString () returns the information presented in Explain
/**
* Get the String to be displayed in explain.
*/
public abstract String getDisplayString(String[] children);
Copy the code
Note: do not return null, otherwise a null pointer exception may be thrown at runtime, and this problem is not easy to detect
ERROR [b1c82c24-bfea-4580-9a0c-ff47d7ef4dbe main] ql.Driver: FAILED: NullPointerException null java.lang.NullPointerException at java.util.regex.Matcher.getTextLength(Matcher.java:1283) ... at org.apache.hadoop.util.RunJar.main(RunJar.java:136)Copy the code
close()
Resource closes the callback function
It is not an abstract method and may not be implemented
/**
* Close GenericUDF.
* This is only called in runtime of MapRedTask.
*/
@Override
public void close() throws IOException { }
Copy the code
Custom GenericUDF complete example
import org.apache.hadoop.hive.ql.exec.UDFArgumentException; import org.apache.hadoop.hive.ql.metadata.HiveException; import org.apache.hadoop.hive.ql.udf.generic.GenericUDF; import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector; import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory; import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector; import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory; import org.apache.hadoop.io.Text; import java.io.IOException; import java.util.HashMap; import java.util.Map; public class SimpleGenericUDF extends GenericUDF { @Override public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException { // 1. If (arguments.length! = 1) // Throw new UDFArgumentException(" Function requires an argument "); If (arguments[0].getcategory ()! = ObjectInspector. Category. PRIMITIVE / / parameter is the original Hive type | |! PrimitiveObjectInspector.PrimitiveCategory.STRING.equals(((PrimitiveObjectInspector)arguments[0]).getPrimitiveCategory() Throw new UDFArgumentException(" function first argument is a string "); // When the custom UDF parameter does not conform to the expectation, Throw an exception // 3. The return type of the user-defined UDF is Map<String. int> return ObjectInspectorFactory.getStandardMapObjectInspector( PrimitiveObjectInspectorFactory.javaStringObjectInspector, / / Key is a String PrimitiveObjectInspectorFactory. JavaIntObjectInspector / / a Value is an int); } @Override public Object evaluate(DeferredObject[] arguments) throws HiveException { // 1. If (arguments[0] == null) return new HashMap<>(); String str = ((Text) arguments[0].get()).toString(); // Customize the UDF core logic // Count the number of occurrences of each character in a String and record them in the Map Map<String, Integer> Map = new HashMap<>(); for (char ch : str.toCharArray()) { String key = String.valueOf(ch); Integer count = map.getOrDefault(key, 0); map.put(key, count + 1); } // 3. } @override public String getDisplayString(String[] children) {return "This is a simple test for custom UDF~"; }}Copy the code