1. Storm core concept

1.1 Topologies

A complete Storm stream handler is called a Storm topology. It is a directed acyclity diagram of Spouts and Bolts connected through a Stream. Storm will keep each topology submitted to the cluster running, processing an endless Stream of data, until you kill it.

1.2 Streams of Streams

Stream is the core concept in Storm. A Stream is an unbounded sequence of tuples that are created and processed in parallel in a distributed fashion. A Tuple can contain data of most primitive types as well as custom types. Simply put, a Tuple is the actual carrier of Stream data, and a Stream is a series of tuples.

1.3 Spouts

Spouts are sources of stream data. A Spout can send data to more than one Streams. Spouts are generally classified as reliable and unreliable: reliable spouts can resend tuples on failure, and unreliable spouts ignore tuples once they have been sent.

1.4 Bolts

The Bolts are processing units of streaming data. They can receive data from one or more Streams, and when they are finished, beam it into a new stream. The Bolts can perform operations like filtering, aggregations, joins, and interact with file systems or databases.

1.5 Stream Groupings

Spouts and bolts, when executed on a cluster, are executed in parallel by multiple tasks (each circle represents a Task, as in the figure above). When A Tuple needs to be sent from Bolt A to Bolt B for execution, how does the program know which Task to send Bolt B for execution?

This is determined by Stream Groupings. Storm has eight built-in Stream groupings. You can also implement a CustomStreamGrouping policy by implementing the CustomStreamGrouping interface.

  1. Shuffle grouping

    Tuples are randomly distributed to each Task of each Bolt, and each Bolt obtains the same amount of Tuples.

  2. Fields grouping

    Streams are grouped by a field specified by grouping. Tuples with the same User-ID are sent to the same Task, assuming partitioning is done through the User-ID field.

  3. Partial Key grouping

    Streams are grouped by a field specified in Grouping, similar to Fields Grouping. But it is load-balanced for the two downstream bolts, providing better optimization in the case of uneven input data.

  4. All grouping

    Streams will be copied by all Bolt Tasks. Caution is required because of repeated data processing.

  5. Global grouping

    The entire Streams feed into one of Bolt’s tasks, usually the Task with the smallest ID.

  6. None grouping

    Currently, None grouping is equivalent to Shuffle grouping.

  7. Direct grouping

    Direct Grouping can only be used for Direct Streams. Using this approach requires that the producer of the Tuple directly specify which Task should be processed.

  8. Local or shuffle grouping

    If the target Bolt has Tasks in the same Worker process as the current Bolt, it should use Tuple Shuffled to use the Tasks of the target Bolt in the same process to minimize network transport. Otherwise, it behaves like regular Shuffle Grouping.

Ii. Storm architecture

2.1 the Nimbus process

Also known as a Master Node, it is the global commander of Storm cluster work. The main functions are as follows:

  1. The Thrift interface listens for and receives the Topology submitted by the Client.
  2. Based on the resources of cluster Workers, the Topology submitted by the Client allocates tasks, and the assignment result is written to Zookeeper.
  3. Through the Thrift interface, it listens for the Supervisor’s request to download the Topology code and provides the download.
  4. Through the Thrift interface, the Thrift monitors the UI’s reading of statistics, reads statistics from Zookeeper, and returns them to the UI.
  5. If the process is restarted on the host immediately after exiting, cluster running is not affected.

2.2 the Supervisor process

Also known as Worker Node, it is the resource manager of Storm cluster and starts Worker processes on demand. The main functions are as follows:

  1. Periodically check whether the new Topology code from Zookeeper is not downloaded to the local directory and periodically delete the old Topology code.
  2. According to Nimbus’s task allocation plan, start 1 or more Worker processes on the machine as required and monitor the situation of all Worker processes;
  3. If the process exits, restart the host immediately to ensure that the cluster runs properly.

2.3 Functions of ZooKeeper

Both Nimbus and Supervisor processes are designed to fail quickly (the process self-destructs in the event of any unexpected event) and to be stateless (all states are stored on Zookeeper or disk). The advantage of this design is that if their processes are accidentally destroyed, they only need to retrieve the previous state data from Zookeeper after a restart, without any data loss.

2.4 Worker processes

Task constructor for Storm cluster, constructs Task instance of Spoult or Bolt, and starts Executor thread. The main functions are as follows:

  1. Start one or more Executor threads in this process based on the Task assigned on Zookeeper, and send the constructed Task instance to Executor to run.
  2. Write heartbeat to Zookeeper.
  3. Maintain transmission queues and send tuples to other workers;
  4. If the process exits, restart the host immediately to ensure that the cluster runs properly.

2.5 Executor threads

Storm cluster Task executor, executes Task code in a loop. The main functions are as follows:

  1. Execute one or more tasks;
  2. Acker mechanism is implemented to send the Task processing status to the worker corresponding to the Spout.

2.6 the parallelism

A Worker process performs a subset of one Topology, and one Worker does not serve multiple Topology. Therefore, a running Topology is composed of multiple Worker processes on multiple physical machines in a cluster. A Worker process starts one or more Executor threads to execute the Component(i.e., Spout or Bolt) of one Topology.

Executor is a single thread started by the Worker process. Each Executor runs one or more tasks in one Component.

Tasks are units of code that make up a Component. When the Topology is started, the number of tasks per Component is fixed, but the number of Executor threads used by the Component can be dynamically adjusted (for example: An Executor thread can execute one or more Task instances of the Component. This means that #threads<=#tasks (the number of threads is less than or equal to the number of tasks) exists for a Component. By default, the number of tasks is equal to the number of Executor threads, that is, each Executor thread runs only one Task.

The summary is as follows:

  • A running Topology consists of multiple Worker processes in a cluster;
  • By default, each Worker process starts an Executor thread by default;
  • By default, each Executor starts one Task thread by default;
  • Tasks are units of code that make up a Component.

The resources

  1. storm documentation -> Concepts

  2. Internal Working of Apache Storm

  3. Understanding the Parallelism of a Storm Topology

  4. Handling of Storm Nimbus single node failure

See the GitHub Open Source Project: Getting Started with Big Data for more articles in the big Data series