JStorm: Concept and programming model

1. Cluster architecture

JStorm is a typical scheduling system from the perspective of design. The architecture of a simple cluster is shown in the figure below. Nimbus can add a standby node, and multiple Supervisor nodes form a task execution cluster.

Simple cluster diagram

1.1, Nimbus

Nimbus acts as a scheduler for the whole cluster, distributing topology code, assigning tasks, and monitoring cluster running status. Nimbus interacts with supervisor through ZK. It can run on the same physical machine as the Supervisor. In JStorm, Nimbus can use master/slave backup and support hot.

1.2, the Supervisor

A Supervisor is the executor of a task in a cluster, responsible for running specific tasks and shutting them down. It listens to nimbus instructions from the ZK, receives and distributes code and tasks, executes them, and monitors the feedback task execution.

1.3, the Zookeeper

ZK is the coordinator of the entire system. The task scheduling of Nimbus is delivered to Supervisor through ZK.

2. Topology programming model

Topology is an abstract representation of a task that can be run in JStorm. In JStorm’s Topology, there are two types of components: SPout and Bolt. Here is a classic Topology diagram. Each topology can have either multiple SpOuts that receive messages from multiple data sources at the same time, or multiple Bolts that execute different business logic. A topology runs until you kill it manually, and JStorm automatically reassigns tasks that failed to execute. In JStorm there is an abstraction for a stream, which is an unbroken continuous tuple with no boundaries. Note that JStorm abstracts the events in the stream into tuples, or tuples, when modeling the event stream. We can think of spout as one faucet after another, and the water flowing from each faucet is a different tuple. We turn on the faucet when we want to get the water tuple, and then use pipes to guide the water tuple from the faucet to a water processor (bolt). The bolt is then piped to another processor or stored in a container.

The Topology map

JStorm abstracts the figure above as Topology. The Topology is
Directed acyclicTopology is an abstraction at the highest level in Jstorm, which can be submitted to a Jstorm cluster for execution. A topology is a data flow transformation graph, where each node is a Spout or bolt, and the edges indicate which streams bolt subscribes to. When spout or Bolt sends tuples to streams, It sends a tuple to every bolt that subscribes to the stream.

2.1, spout

JStorm believes that every stream has a stream source, the source of the original tuple, so it abstracts this source as spout, which may be connected to messaging middleware (e.g., MetaQ, Kafka, TBNotify, etc.) and constantly sends messages. It can also be read from a queue over and over again and assembled as a tuple. The JStorm framework defines one main method for spout components: nextTuple, which, as the name implies, gets the next message. When executed, it can be understood that the JStorm framework is constantly tuning this interface to pull data from the data source and send data to Bolt. A Tuple is a basic unit of message passing. Each field in a Tuple has a name and the corresponding field type of each Tuple must be the same. The tuple field types can be INTEGER, long, short, byte, string, double, float, Boolean, and byte Array. You can also customize types as long as you implement the corresponding serializer. Interfaces related to SPout in JStorm are ISpout, IRichSpout and IBatchSpout. The latter two interfaces implement the upper-layer encapsulation of ISpout interfaces.

The main method of ISpout interface is: open: it is called when the ISpout is initialized in the worker. It is generally used to set some properties, such as obtaining the corresponding Bean from the Spring container. Close: Corresponds to open (called when to close). Activate: activates when the inactive state changes to active. Deactivate: corresponds to activate (called when the active state changes to inactive). NextTuple: JStorm expects that every time this method is called, it will emit a tuple via Collector.emit. Ack: Jstorm calls this method when it finds that the tuple corresponding to the msgId has been successfully consumed in its entirety. Fail: Corresponding to ACK (jStorm found that a tuple failed in a segment). Work with ack to ensure that tuples are processed.

2.2, bolt

JStorm abstracts the intermediate processing of a tuple as a Bolt. A Bolt can consume any number of input streams as long as the flow direction is directed to that Bolt, and it can also send new streams to other Bolts. Just open a specific SPout and direct the tuple flowing out of the SPout to a specific bolt, which then processes the incoming stream and directs it to another bolt or destination. Bolt stands for processing logic. After receiving the message, Bolt processes the message (that is, execute the user’s business logic). After processing the message, bolt can send the processed message to the downstream BOLT, which forms a processing pipeline (but the more complex case should be a directed graph). Or you can just end it. Bolt component main method: Execute, this interface is used by users to process business logic. Usually, the last bolt in an assembly line stores data, such as writing the data calculated in real time into DB or HBase for foreground services to query and display. Bolts can launch multiple message flow, use OutputFieldsDeclarer declareStream definition stream, using OutputCollector, emit to choose to launch the stream. In scenarios where messages are not lost, the bolts must call the OutputCollector ack method for each tuple it processes, to notify JStorm that the tuple has been processed, and to notify the transmitter of the tuple of spouts. The bolts will handle an input tuple, fire zero or more tuples, and call ACK to tell JStorm that it has processed the tuple. JStorm provides an IBasicBolt that will automatically call ack. Bolt related interfaces in JStorm are mainly IBolt, IRichBolt, IBasicBolt and IBatchBolt. The latter interfaces implement the upper-layer encapsulation of IBolt interfaces.

The main method of the IBolt interface is as follows: prepare: called when the IBolt is initialized in the worker. It is usually used to set some properties, such as obtaining the corresponding Bean from the Spring container. Cleanup: corresponding to prepare (called when the topology is disabled) execute: handles the tuple sent by JStorm.

2.3, the Tuple

JStorm abstracts the data in the stream as a tuple. A tuple is a value list, and each value in the list has a name. Tuples can be composed of any type because Storm is distributed. So it needs to know how to serialize and deserialize data between tasks. Storm uses Kryo, a fast and flexible sequencer in Java development, for serialization. By default, Storm can serialize basic types such as string, byte, array, ArrayList, HashMap, HashSet, and Clojure collection types, and custom sequencers are required if other types are needed. Each node in the topology specifies the name of the field of the tuple it emits, and the other nodes simply subscribe to this name to receive processing. In spout and Bolt components, use the declareOutputFields method to define the field name of the emitted tuple.

3, summary

This paper mainly describes the conceptual knowledge of cluster architecture and Topology programming model in JStorm, and will write more in-depth articles on practice, operation and maintenance, principle and other aspects in the future.