What is StreamSets Data Collector?

StreamSets Data Collector is a lightweight, powerful design and execution engine that processes streaming Data in real time. Use the Data Collector to route and process Data in the Data stream.

To define the Data flow, you need to design a pipeline in the Data Collector. A pipe consists of stages that represent the origin and destination of the pipe and any other processing that you wish to perform. After you have designed the pipe, click Start and the Data Collector begins work. The Data Collector processes Data when it arrives at the origin and waits when it is not needed. You can view real-time statistics about the data, examine the data as it passes through the pipeline, or view a snapshot of the data.

How to use StreamSets Data Collector?

Use StreamSets Data Collectors as conduits for Data streams. Throughout the enterprise data topology, there are data flows that need to be moved, collected, and processed to destinations. The Data Collector provides critical connections between the phases of the Data flow.

To address your Data access needs, you can use a single Data Collector to run one or more pipes. Or you can install a series of Data Collectors to transfer Data across the enterprise Data topology.

How exactly does it work?

Let’s take a look…

After the Data Collector is installed and started, log in and create the first pipeline using the Data Collector UI.

What do you want it to do? Suppose you want to read an XML file from a directory and remove the newline before moving it to HDFS. To do this, you start with a directory source stage and configure it to point to the source file directory. (You can also have Stage archive processed files and write incomplete files to a separate directory for viewing.)

To remove line feeds, connect the directory to the Expression Evaluator processor and configure it to remove line feeds from the last field of the record.

To make data available to HDFS, the Expression Evaluator component needs to be connected to the Hadoop FS target stage. You can configure the stage to write data as JSON objects (you can also use other data formats).

Preview the data to see how the source data moves through the pipe and notice that some fields are missing data. Therefore, you need to add a Field Replacer component to replace the null values in these fields.

The data flow is now complete. You configured pipeline error record handling to write error records to a file, you created a data drift alert to let you know when field names changed, and you configured an E-mail alert to let you know when the pipeline generated more than 100 error records. The pipeline is then started and the Data Collector is working.

The Data Collector enters monitor mode and immediately displays summary and error statistics. To take a closer look at the activity, you can take a snapshot of the pipe to examine how a set of data is passed through the pipe. You will see some abnormal data in the pipeline, so you need to create a data rule for the link between the two phases to gather information about similar data and set up an alert to notify you when the number is too high.

What about the error logs that are written to files? They are saved along with the error details, so you can create an error pipeline to reprocess the data. Sure enough!

StreamSets Data Collector is a powerful tool, but we made it as simple as possible to use. So give it a try and click on the help icon for information. WXGZH Big data