Hive SerDe

SerDe stands for Serializer or Deserializer. Serialization is the process of converting an object into a sequence of bytes. Deserialization is the process of converting a sequence of bytes into an object.

Object serialization has two main uses:

  • Object persistence, that is, converting an object into a sequence of bytes and saving it to a file.
  • Network transfer of object data.

Hive uses the SerDe interface to perform I/O operations, that is, reading and writing data. Hive does not store data, but uses files stored in the HDFS. Serialization is required when reading and writing data in the HDFS.

Hive Serde is used for serialization and deserialization, and is built between the data store and the execution engine to decouple them. Org, apache hadoop. Hive. Serde have been eliminated, and now main use org.. Apache hadoop. Hive. Serde2, hive serde allowed to read data from the table, then write data back to the HDFS in any custom format. Anyone can write their own SerDe for their own data format.

Serialize is to convert imported data into Hadoop’s Writable format. Deserialize is to import data from HDFS into memory to form a Row object

Hive read and write process

SerDe is a Hive serialization and deserialization component used to read and write data

HDFS files –> InputFileFormat –>

–> Deserializer –> Row Object
,>

Write Row Object –> Serializer –>

–> OutputFileFormat –> HDFS Files
,>

Note that the key is ignored when read and is a constant when written. Normally, the Row object is stored in the value

Hive does not store data itself, so it has no data format. Users can use any tool to directly read HDFS files in Hive tables. You can also directly write files to HDFS and use DATA files by creating an EXTERNAL TABLE, CREATE EXTERNAL TABLE, or LOAD DATA INPATH, which moves DATA to the Hive TABLE folder

The use of the SerDe

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name [(col_name data_type [COMMENT col_comment], ...)]  [COMMENT table_comment] [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]  [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] [ROW FORMAT row_format] [STORED AS file_format] [LOCATION hdfs_path]Copy the code

To create a table sentence, use the row format argument to specify the type of SerDe.

The Hive SerDe

Hive Built-in SerDe type

  1. TextInputFormat/HiveIgnoreKeyTextOutputFormat: the two classes are mainly used for operating a text file

  2. SequenceFileInputFormat/SequenceFileOutputFormat: these two classes are mainly used for Hadoop SequenceFile file operation

  3. MetadataTypedColumnsetSerDe this class is mainly used to read and write specific delimiters split files, such as A CSV file or TAB, control – A split files (quote is not supported yet.)

  4. LazySimpleSerDe

    This is the default SerDe type. Read and MetadataTypedColumnsetSerDe and TCTLSeparatedProtocol same data format, you can use this Hive SerDe type. It creates objects lazily and thus has better performance. After Hive version 0.14.0, it supports specifying character encodings when reading and writing data. Such as:

    ALTER TABLE person SET SERDEPROPERTIES (' serialization.encoding '=' GBK ')Copy the code

    If the configuration attribute hive.lazysimple.extended_boolean_literal is set to true (hive 0.14.0 and later), LazySimpleSerDe can treat ‘T’, ‘T’, ‘F’, ‘F’, ‘1’, and ‘0’ as valid Boolean literals. This configuration defaults to false, so it only treats’ True ‘and’ false ‘as valid Boolean literals.

  5. Thrift SerDe Reads and writes Thrift serialized objects. You can use the Hive SerDe type. To be sure, for Thrift objects, the class file must be loaded first.

Other SerDe

  • JsonSerDe can read JSON files (0.12.0)
  • Avro SerDe can be used to read Avro(STORED AS Avro was added in version 0.14.0)
  • ORC SerDe can read ORC files (0.11.0)
  • Parquet SerDe can read Parquet files (0.11.0)
  • ORC SerDe can read ORC files (0.13.0)
  • CSV SerDe can read CSV files (0.14.0)

Custom SerDe

In fact, most of the time, we only want to write our own Deserializer instead of SerDe. This is because we only want to read our own data in a specific format, rather than write the data in this format. RegexDeserializer is such a Deserializer. If Serializer is not available, RegexDeserializer can be used to deserialize data according to the regex rules configured for the parameter

RegexDeserializer

192.168.57.4 - - [29/Feb/2016:18:14:35 +0800] "GET /bg-upper. PNG HTTP/1.1 [29/Feb/2016:18:14:35 +0800] "GET/ASF-logo.png HTTP/1.1 PNG HTTP/1.1" 192.168.57.4 - - [29/Feb/ 2016-18:14:35 +0800] "GET /bg-button.png HTTP/1.1 [29/Feb/2016:18:14:36 +0800] "GET /bg-middle. PNG HTTP/1.1" 304-192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET / HTTP/1.1" 200 11217 192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET/HTTP/1.1" 200 11217 192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /tomcat. CSS HTTP/1.1" 304-192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /tomcat. CSS HTTP/1.1" 304-192.168.57.4 - - [29/Feb/2016:18:14:36 +0800 [29/Feb/2016:18:14:36 +0800] "GET/ASF-logo.png HTTP/1.1 192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /bg-middle. PNG HTTP/1.1 +0800] "GET /bg-button.png HTTP/1.1" [29/Feb/2016:18:14:36 +0800] "GET /bg-nav. PNG HTTP/1.1 PNG HTTP/1.1" - - [29/Feb/2016:18:14:36 +0800] [29/Feb/2016:18:14:36 +0800] "GET/HTTP/1.1" 200 11217 192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /tomcat.css HTTP/1.1" 304-192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /tomcat.png [29/Feb/2016:18:14:36 +0800] "GET/HTTP/1.1" 200 11217 192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /tomcat.css HTTP/1.1" 304-192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET /tomcat.png [29/Feb/2016:18:14:36 +0800] "GET/bG-button.png HTTP/1.1" 304-192.168.57.4 - - [29/Feb/2016:18:14:36 +0800] "GET/bG-button.png / bg - upper. HTTP / 1.1 PNG "- 304Copy the code

Build table statements

CREATE TABLE ods_regex_log (
    host STRING,
    identity STRING,
    t_user STRING,
    `time` STRING,
    request STRING,
    referer STRING,
    agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
    "input.regex" ="([^]*) ([^]*) ([^]*) \\[(.*)\\] \"(.*)\" (-|[0-9]*) (-|[0-9]*)"
)
STORED AS TEXTFILE
;
load data local inpath '/Users/liuwenqiang/workspace/hive/regexseserializer.txt' overwrite into table ods_regex_log;
Copy the code

Custom implementation

Before we also introduced a lot of Serde, so we customize the implementation of the best method is simulation, that is, according to others to write, here we can put hive source clone down, and then go to see others is in how to write, Hive Serde has a separate module, as follows

We also said before the org. Apache. Hadoop. Hive. Serde have been eliminated, and now main use org.. Apache hadoop. Hive. Serde2, only a constant under the serde class now

Next we’ll look at the implementation of JsonSerDe, which defines a class that inherits AbstractSerDe

In that case, let’s take a look at the AbstractSerDe abstract class and then take a look at the class’s annotations

public abstract class AbstractSerDe implements Deserializer.Serializer {

  protected String configErrors;

  /** * Initialize the SerDe. By default, this will use one set of properties, either the * table properties or the partition properties. If a SerDe needs access to both sets, Initialize SerDe. By default, this will use a set of attributes, either table attributes or partition attributes. If SerDe needs access to both collections, it should override this method. Eventually, once all SerDes have implemented this method, we should convert it to an abstract method. Is a knowledge of object oriented inside */
  public void initialize(Configuration configuration, Properties tableProperties,roperties partitionProperties) throws SerDeException {
    initialize(configuration,
               SerDeUtils.createOverlayedProperties(tableProperties, partitionProperties));
  }

  /**
   * Initialize the HiveSerializer.
   *
   * @param conf
   *          System properties. Can be null in compile time
   * @param tbl
   *          table properties
   * @throws SerDeException
   */
  @Deprecated
  public abstract void initialize(@Nullable Configuration conf, Properties tbl)
      throws SerDeException;

  /** * Returns the Writable class that would be returned by the serialize method. * This is used to initialize SequenceFile header. */
  public abstract Class<? extends Writable> getSerializedClass();

  /** * Serialize an object by navigating inside the Object with the * ObjectInspector. In most cases, the return value of this function will be * constant since the function will reuse the Writable object. If the client * wants to keep a copy of the Writable, the client needs to clone the * returned value. */
  public abstract Writable serialize(Object obj, ObjectInspector objInspector)
      throws SerDeException;

  /**
   * Returns statistics collected when serializing.
   *
   * @return A SerDeStats object or {@code null} if stats are not supported by
   *         this SerDe.
   */
  public SerDeStats getSerDeStats(a) { 
    return null;
  }

  /**
   * Deserialize an object out of a Writable blob. In most cases, the return
   * value of this function will be constant since the function will reuse the
   * returned object. If the client wants to keep a copy of the object, the
   * client needs to clone the returned value by calling
   * ObjectInspectorUtils.getStandardObject().
   *
   * @param blob
   *          The Writable object containing a serialized object
   * @return A Java object representing the contents in the blob.
   */
  public abstract Object deserialize(Writable blob) throws SerDeException;

  /** * Get the object inspector that can be used to navigate through the internal * structure of the Object returned from  deserialize(...) . * /
  public abstract ObjectInspector getObjectInspector(a) throws SerDeException;

  /**
   * Get the error messages during the Serde configuration
   *
   * @return The error messages in the configuration which are empty if no error occurred
   */
  public String getConfigurationErrors(a) {
    return configErrors == null ? "" : configErrors;
  }

  / * * *@return Whether the SerDe that can store schema both inside and outside of metastore
   *        does, in fact, store it inside metastore, based on table parameters.
   */
  public boolean shouldStoreFieldsInMetastore(Map<String, String> tableParams) {
    return false; // The default, unless SerDe overrides it.}}Copy the code

It looks like we need to implement the initialize, serialize and deserialize and getObjectInspector four methods, we see in front of the hive serde is in hive – serde, module, So if you want to implement a custom serde, you need to introduce this dependency

< the dependency > < groupId > org. Apache. Hive < / groupId > < artifactId > hive - serde < / artifactId > < version > 3.1.0 < / version > </dependency>Copy the code

Let’s write a simple Serde to parse specific data formats

id=1,name="jack",age=20
id=2,name="john",age=30
Copy the code

Here is the code implementation, we are still in the project where we wrote the UDF

import org.apache.commons.lang3.StringUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hive.serde.serdeConstants;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;
import org.apache.hadoop.hive.serde2.typeinfo.PrimitiveTypeInfo;
import org.apache.hadoop.hive.serde2.typeinfo.TypeInfo;
import org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import javax.annotation.Nullable;
import java.util.*;

/ * * *@descriptionAbstractSerDe > initialize, serialize, deserialize */
public class KingcallSerde extends AbstractSerDe {

    private static final Logger logger = LoggerFactory.getLogger(KingcallSerde.class);

    // Used to store field names
    private List<String> columnNames;

    // Used to store field types
    private List<TypeInfo> columnTypes;
    private ObjectInspector objectInspector;

    Initialize before serialize and deserialize
    @Override
    public void initialize(Configuration configuration, Properties tableProperties, Properties partitionProperties) throws SerDeException {
        String columnNameString = tableProperties.getProperty(serdeConstants.LIST_COLUMNS);
        String columnTypeString = tableProperties.getProperty(serdeConstants.LIST_COLUMN_TYPES);
        columnNames = Arrays.asList(columnNameString.split(","));
        columnTypes = TypeInfoUtils.getTypeInfosFromTypeString(columnTypeString);

        List<ObjectInspector> columnOIs = new ArrayList<>();
        ObjectInspector oi;
        for(int i = 0; i < columnNames.size(); i++) {
            oi = TypeInfoUtils.getStandardJavaObjectInspectorFromTypeInfo(columnTypes.get(i));
            columnOIs.add(oi);
        }
        objectInspector = ObjectInspectorFactory.getStandardStructObjectInspector(columnNames, columnOIs);
    }

    // Override the method to call the above implementation directly
    @Override
    public void initialize(@Nullable Configuration configuration, Properties properties) throws SerDeException {
        this.initialize(configuration, properties, null);
    }

    @Override
    public Class<? extends Writable> getSerializedClass() {
        return null;
    }

    // O is an array of single rows of imported data. The objInspector contains the imported fields in order
    // Process the data into a string in the format of key=value,key1=value1, and return Writable.
    @Override
    public Writable serialize(Object o, ObjectInspector objInspector) throws SerDeException {
        Object[] arr = (Object[]) o;
        List<String> tt = new ArrayList<>();
        for (int i = 0; i < arr.length; i++) {
            tt.add(String.format("%s=%s", columnNames.get(i), arr[i].toString()));
        }
        return new Text(StringUtils.join(tt, ","));
    }

    @Override
    public SerDeStats getSerDeStats(a) {
        return null;
    }

    // Writable is converted to a string containing one line of information, such as key=value,key1=value1
    // Save the partition into map, and then put it into object according to the order of fields
    // We also need to do the type processing in the middle, here we simply do string and int
    @Override
    public Object deserialize(Writable writable) throws SerDeException {
        Text text = (Text) writable;
        Map<String, String> map = new HashMap<>();
        String[] cols = text.toString().split(",");
        for(String col: cols) {
            String[] item = col.split("=");
            map.put(item[0], item[1]);
        }
        ArrayList<Object> row = new ArrayList<>();
        Object obj = null;
        for(int i = 0; i < columnNames.size(); i++){
            TypeInfo typeInfo = columnTypes.get(i);
            PrimitiveTypeInfo pTypeInfo = (PrimitiveTypeInfo)typeInfo;
            if(typeInfo.getCategory() == ObjectInspector.Category.PRIMITIVE) {
                if(pTypeInfo.getPrimitiveCategory() == PrimitiveObjectInspector.PrimitiveCategory.STRING){
                    obj = StringUtils.defaultString(map.get(columnNames.get(i)));
                }
                if(pTypeInfo.getPrimitiveCategory() == PrimitiveObjectInspector.PrimitiveCategory.INT) {
                    obj = Integer.parseInt(map.get(columnNames.get(i)));
                }
            }
            row.add(obj);
        }
        return row;
    }

    @Override
    public ObjectInspector getObjectInspector(a) throws SerDeException {
        return objectInspector;
    }

    @Override
    public String getConfigurationErrors(a) {
        return super.getConfigurationErrors();
    }

    @Override
    public boolean shouldStoreFieldsInMetastore(Map<String, String> tableParams) {
        return super.shouldStoreFieldsInMetastore(tableParams); }}Copy the code

Use custom Serde types

  1. Add the jar/Users/liuwenqiang/workspace/code/idea/HiveUDF/target/the original - HiveUDF - 0.0.4. Jar;

  2. The row fromat attribute when creating the table specifies the custom SerDe class.

    CREATE EXTERNAL TABLE `ods_test_serde`(
        `id` int,
        `name` string,
        `age` int
    )
    ROW FORMAT SERDE 'com.kingcall.bigdata.HiveSerde.KingcallSerde'
    STORED AS TEXTFILE;
    Copy the code
  3. Load or insert data

    LOAD DATA LOCAL INPATH '/Users/liuwenqiang/workspace/hive/serde.txt' OVERWRITE INTO TABLE ods_test_serde;

  4. View the data

  1. Insert the data and view it

    insert into table ods_test_serde values(3, "test", 10);

  1. View the data on the HDFS

In the new version of Hadoop, part of the file data can be directly viewed on the HDFS Web page, without having to pull down the local view or use HDFS command line view

ObjectInspector

Hive uses ObjectInspector to analyze the internal structure of row objects and column structures.

Specifically, ObjectInspector provides a unified way to access complex objects. Objects may be stored in memory in a variety of formats:

  • Java class instances, Thrift or native Java
  • Standard Java objects, such as Map fields, we use java.util.List for Struct and Array, and java.util.map.
  • Lazy initialization of objects.

In addition, it is possible to represent a complex object with the structure (ObjectInspector, Java object). It gives us a way to access the internal fields of an object without involving information about the structure of the object. For serialization purposes, Hive recommends creating a custom objectinspector for custom SerDes, which have two constructors, a no-parameter constructor and a regular constructor.

conclusion

  1. Hive itself does not store data, and it interacts with data through SerDe. Therefore, SerDe can be regarded as a decoupling design between Hive and HDFS
  2. Hive itself provides a large number of serdes, which are great for our daily development, and we can develop our own serdes when we can’t