Best Practice for processing unstructured data on MaxCompute(ODPS)

Abstract:

With the release of MaxCompute (ODPS) 2.0, the new unstructured data processing framework also features a series of introductory articles, including

1, How to access OSS data in MaxCompute, and how to access OSS data in MaxCompute

2, Best Practice for unstructured data in MaxCompute Based on the implementation principle of unstructured framework, some best practices are summarized.

3, MaxCompute access to the TableStore(OTS) data, this paper introduces how to access KV (TableStore/OTS) data through unstructured framework.

MaxCompute -> OSS -> OSS -> OSS -> OSS -> OSS -> OSS -> OSS -> OSS -> OSS -> OSS -> OSS -> OSS -> OSS -> OSS -> OSS

5. How to process open source data stored on OSS in MaxCompute. How to process common open source data stored on OSS (ORC, PARQUET, AVRO, etc.) in UNstructured framework.

This article is the first [2] in this series.

preface

With the introduction of MaxCompute (original ODPS) unstructured data processing framework, we can connect MaxCompute and OSS data in SQL line. We can see video, image, audio and gene. All kinds of data, such as meteorological data, are integrated seamlessly with traditional structured data on MaxCompute platform. Previously we provided an overview of how the MaxCompute unstructured framework works with OSS data. After the basic functionality was implemented, we received a lot of questions from users about optimizations and how best to use the unstructured functionality. This article provides a summary of Best Practice for processing data in MaxCompute by analyzing the underlying implementation principles of the unstructured framework and some usage scenarios we have seen.

1. Storage of data on OSS

1.1 Selection of OSS LOCATION

MaxCompute specifies the OSS data address to process by LOCATION cluase on the EXTERNAL TABLE. This article assumes that users have a good understanding of the definition of unstructured frameworks, including EXTERNABLE TABLE and StorageHanlder, which are not detailed here. If you have any questions, please refer to the previous basic function introduction. Where LOCATION will point to a directory of an OSS (or, more accurately, an address ending in a ‘/’), where LOCATION is the standard URI format:

LOCATION 'oss://${endpoint}/${bucket}/${userPath}/ 'Copy the code

For scenarios sensitive to data security, such as multi-user scenarios or public clouds, the above method is recommended. The AK is not used for LOCATION, but the STS/RAM system is used for prior authentication (see Basic function description).

The LOCATION selection has several points to note:

It is not allowed to use oss root bucket as LOCATION, i.e${userPath}This requirement stems from OSS restrictions on what can be stored under root bucket.
LOCATION cannot point to a single file, that is, similaross://oss-cn-hangzhou.aliyuncs.com/mybucket/directory/data.csvThis LOCATION is not valid. If you have only one file to work with, you should provide the parent directory for that file.

1.2 Storage and processing of data files: small files and large files

In distributed computing system, file size has a great correlation with the efficiency and performance of the whole system. In this paper, we introduce the OSS file storage mechanism of MaxCompute for unstructured data, analyze several typical scenarios (e.g., small files and large files), and summarize some suggestions for the OSS file storage in MaxCompute scenarios.

Small files: Small files are often accompanied by large numbers of files, which presents two problems for distributed computing systems:
1. A large number of files causes a large overhead during file sharding, resulting in time-consuming planning and sharding. For example, for an OSS LOCATION with 1 million files, the time-consuming of planning may be more than minutes.
2. Opening each OSS file has an ovehead, and fragmented small files bring additional read overhead. For example, reading 1000 10KB files from OSS may take more than 10 times as long as reading a 10MB file. Access to a large number of small files will bring more network overhead to the entire distributed system, reducing the actual effective IO throughput.
So it is generally not recommended to have too many files in one OSS directory. On the other hand, consider the Externable Table as a partition, and try to process data on the granularity of the partition. In addition, tar files can be used when applicable, such as printing multiple image files into a tar file and saving them to OSS. If it’s a text file, MaxCompute built – in StorageHandler (such as com. Aliyun. Odps. CsvStorageHandler or com. The aliyun. Odps. TsvStorageHandler) Automatically reads data from tar files. If the user defines the StorageHandler/Extractor, you can also use the Tar class in Java in the user code, such as directly using Apache Common’s TarArchiveInputStream to access.
Large files: As opposed to small files, there’s the opposite extreme: the super-large file. The essence of distributed systems is the idea of divide-and-conquer: shard data and speed up the processing of massive amounts of data by processing multiple shards concurrently. In the extreme case, if a large amount of data exists in a single file that cannot be sliced and processed, the concurrency is reduced to 1 and the “distributed system” becomes meaningless. Even less extreme, multiple large files (say, tens of gigabytes each) are unfriendly to distributed systems: large file processing may require a large amount of system resources independently, making scheduling difficult, and also prone to long tails and costly failures to redo. Therefore, it is not recommended to store data in large files on OSS for MaxCompute processing.

To summarize, as an overall guideline, the MaxCompute non-structural framework recommends the following ideal OSS data storage scenarios:

Data files can be stored in folders based on application features. You are not advised to store more than 100,000 files in a folder. Consider using tar to package multiple files as a way to reduce the number of physical files.
Moderate file size and evenly distributed data files can make more reasonable use of various system resources, thus improving the efficiency of distributed processing. A single file size of 1MB-2GB is ideal for the MaxCompute unstructured framework.

1.3 MaxCompute Network connectivity and speed of accessing OSS

MaxComput and OSS are independent distributed computing and storage services. Network connectivity in different deployment clusters may affect the accessibility of MaxCompute to OSS data. Network connectivity as a whole follows the principle of seven network isolation, specifically, there are several points:

Computes in the common cluster of MaxCompute should access OSS external clusters. It is recommended that the OSS cluster to be accessed be as close to the MaxCompute cluster as possible. For details about the access domain names and corresponding data centers on the OSS public cloud, see the OSS documents.

In the case of MaxCompute accessing OSS concurrently, one important thing to note is that OSS has a traffic limiting mechanism. By default, the traffic of an OSS account is limited to 5Gb/s (600MB/s). With the high concurrency of MaxComput (e.g. 1000 + compute nodes), the speed of OSS data download may not be limited by the speed of a single network, but by the overall traffic rate limiting of OSS. In this case, it is entirely possible that the download speed of a single compute node is less than 1MB/s. Of course, OSS traffic limiting can be specially configured. If there is a large amount of data computing requirements, you can contact the OSS team to adjust the specific traffic limiting upper limit for the corresponding account.

2. Processing of input data in user-defined StorageHandler/Extractor

In addition to providing several built-in StorageHandlers for CSV, TSV, and Apache ORC files, MaxCompute also develops an unstructured Java SDK for parsing and processing data. In this way, the whole unstructured data processing ecosystem is expanded, and the ability of video, image, audio, gene, weather and other data processing is connected. In simple terms, MaxCompute encapsulates the details of a distributed system, using an enhanced subclass of Java InputStream to interconnect input data with user code. This interface design is different from Hive’s SerDe, RowFormatter and other layers of encapsulation. It provides a more natural and completely unstructured data entry. Users can obtain raw data streams and process them using logic similar to stand-alone applications. Of course, there are some Best practices that recommend users follow for distributed systems.

2.1 Processing mode of input data stream

For input data streams (inputStreams), it is recommended that bytes be retrieved and processed directly in memory. Ideally, you can do streaming “read and compute” processing of input data. Of course, for some of the data formats, due to the nature of the data itself, it’s hard to do a complete flow processing, such as for some photo/audio data format, a file must be read in to get the correct encoding information, and other features, that in this case, the file itself is not very big, can put the files completely read in local memory, dealt with. A less efficient way is to download the data file locally and then use FileStream to read the local file for processing. This processing mode has two problems:

In a distributed system, it is not recommended to write files (especially large files) to the local disk to achieve resource isolation and protect the health of compute nodes. On MaxCompue computing systems, users’ Java code needs to apply for additional permissions for recent reads and writes to local files, or turn on the quarantine option (which degrades overall performance).
Data is written to the local drop disk and then read, resulting in additional performance losses.
For large data (for example, files of 10GB or larger size), the disk space of the compute node cannot be guaranteed, and the disk may be overwritten

2.2 Three-party library use

In the unstructured data processing line, a common requirement is to migrate the single-machine data processing mechanism to a distributed system through the MaxCompute unstructured data framework. For example, you want to read video data directly with FFMPEG, or you want to handle weather data directly with Netcdf/GRIB format through NetCDF-Java. These tripartite libraries tend to have some common features/limitations such as

Probably based on C/C++, so you need to run native code through JNI
It may be a standalone implementation, so the entry point for data is often a local file address

In each of these cases, the unstructured framework has a way to support it. Such as allowing JNI to be used with isolation turned on, allowing data to be downloaded to native temporary files through permission approval, and so on. In the long run, the MaxCompute framework recognizes that it is inevitable to use native C/C++ code libraries to process specific data formats. Therefore, the security of the framework is used to solve this problem. However, it is essentially a large additional consumption for reading data and processing it locally. It is still recommended to handle input data directly. For example, change the implementation of NetcdF-Java to use InputStream instead of FilePath->FileStream.

3. The conclusion

The MaxCompute unstructured framework is a new feature introduced with MaxCompute2.0. In addition to processing unstructured data on OSS, the MaxCompute unstructured framework has recently opened up a data link with TableStore(OTS). The framework itself is still evolving, including closer integration and extensions with the MaxCompute optimizer and the entire UDF framework. Here we first summarize some best practices for dealing with unstructured data from the implementation of the existing system and some feedback we received. We also hope to get more feedback to make the framework function better. In the future, we will provide some more specific usage examples in combination with specific usage scenarios, such as offline video image processing on the city brain.

The original link