Translation :StreamNative-Sijia

How quickly to process, understand, and respond to data is critical for many event-driven applications. In analysis and data processing for these scenarios, calculating precise values can be too time-consuming or unreasonably resource-intensive. In this case, it makes more sense to get approximate results in a given amount of time than to wait for accurate results. For example, to calculate the exact number of unique visitors to a web page or site, you need to keep all previous unique visitor records for comparison. The number of unique identifiers is not cumulative, so parallelism is not helpful at all.

If a use case does not require exact results and can use approximations, we can provide a variety of techniques and algorithms to calculate accurate approximations more quickly with less memory. Also, there are several open source libraries that implement each of the patterns covered in this article, making it relatively easy to use these libraries in Apache Pulsar Function.

Approximate design pattern

Such patterns refer to techniques that provide approximations, estimates, and random data samples for statistical analysis when the stream of events is too large to store, or because the data is moving too fast to process.

We can take advantage of algorithms that can work with small data structures without retaining large amounts of data. These data structures are usually kilobytes, also known as Sketch. Sketch is also a flow algorithm, as each incoming project only needs to be viewed once. Because of these two properties, these algorithms are ideal for edge device deployment.

Pattern 1: Collection elements

Sometimes we need to confirm that we have seen a stream element within a reasonable certainty without querying the external data store. Since we can’t keep the entire stream history in memory, we need to rely on Bloom filters, an approximation technique that leverages data structures. Bloom filter is a space-saving probabilistic data structure that can be used to test whether elements belong to collections.

FIG. 1 Bloom Filter algorithm

As shown in Figure 1, all Bloom filters use two key elements.

  1. An n-bit array, initialized to 0
  2. A set of K independent hash functions h (x), input a value and generate a number less than N

When a new element is added to the filter, all hash functions are performed on that element. These hashes are treated as indexes of the bitarray and the corresponding array element is set to “1”.

When checking to see if an element already exists in the filter, we use the hash function again, but this time we do an array lookup for each hash index. If at least one of them is zero, the element does not exist in filter. A key feature of the Bloom Filter is the guarantee that no false negatives will be returned. Thus, it is possible to determine that an element is not in the collection, or may be in the collection, but other logic is required to determine that conclusively. The following example utilizes the stream-lib of the Bloom Filter algorithm in Twitter, whose simplest operation is filtering. When the Pulsar Function processes a new event, it first checks to confirm that we have seen the event before. If not, it is routed to “not seen” Topic for further processing.

import org.apache.pulsar.functions.api.Context;
import org.apache.pulsar.functions.api.Function;
import com.clearspring.analytics.stream.membership.BloomFilter;
 
public class BloomFilterFunction implements Function<String, Void> {
    BloomFilter filter = new BloomFilter(20, 20);
 
    Void process(String input, Context context) throws Exception {
      if(! filter.isPresent(input)) { filter.add(input); Publish (" notSeenTopic ", input); // Route to "not seen" topic context.publish(" notSeenTopic ", input); }returnnull; }}Copy the code

Pattern 2: Event frequency

Another common approximation statistic is the frequency with which a particular element appears in an infinite data stream containing repeating elements. This answers questions such as “How many times does the element X appear in the data stream?” “Is very useful. Such results are especially useful in network monitoring and analysis.

So, if it’s easy to calculate the sample frequency, why do you need to get an approximation? Why not just calculate the observations for each sample and divide by the total observations to calculate the frequency?

In many high-frequency event streams, there is insufficient time or memory for such calculations. Imagine analytics and network traffic sampling just over a 40 Gbps connection that can handle 78 million 64 byte packets per second. For a single 40 Gbps connection, there is not enough time to perform calculations or space to store data, let alone a network consisting of multiple such segments.

FIG. 2 The count-min Sketch algorithm in time and frequency

In this case, because precise results cannot be calculated in time, you can choose to use an estimate with acceptable accuracy. The most commonly used algorithm in sampling frequency estimation is the count-min Sketch, which provides the approximate value of the data without storing the data itself. As shown in Figure 2, two elements are used in the execution of the count-min Sketch algorithm:

  1. Counter matrices of M x K, each initialized to 0, each row corresponding to a hash function
  2. The set of K independent hash functions h of x

When a new element is added to Sketch, all hash functions are performed on that element. These hashes are treated as indexes of the bitarray, and the corresponding array element is incremented by “1”.

Now, you can see the approximate value of each element stored in the M-K matrix, and you can quickly determine how many times the element X appears in the stream by performing a hash function on each element and retrieving all the corresponding array elements just as you did when inserting. This time, however, instead of incrementing the array elements, we use the minimum value in the list as an estimate of the event count.

The estimated event frequency using this algorithm will only be high because the counter is incremented. In this way, we know that the return value of the element we have seen is the count that occurs most often. The accuracy of the count-min Sketch algorithm is determined by the number of hash functions k. To calculate the probability error of X, set k >= log 1/X so that for moderately large values of k (e.g., k=5), the probability error is 1%.

When you add input to Sketch, you can immediately get an estimate of a count, which can be used for any form of logic based on that count, including simple functions such as filtering events, raising an alarm when a count exceeds a threshold, or more complex functions such as: Publish updated counts to external data stores for display on dashboards, etc.

Pattern 3: Most frequent K-term estimation

Another common use of the count-min algorithm is to maintain lists of frequent items, often referred to as “Heavy Hitters.” This design pattern preserves items that occur more frequently than certain predefined values, such as a list of the top K items, a list of the K most frequent items in a data stream, such as the top 10 most-viewed tweets on Twitter, or the top 20 most popular products on Amazon.

This design pattern has several other practical applications, such as detecting IP addresses that are abnormally sending large amounts of data (for example, during a denial-of-service attack, in which the attacker makes a large number of requests to the target service, consuming service resources and making the service unavailable to normal users), and identifying stocks that are heavily traded.

The problem of the most frequent K-term estimation can also be solved through the count-min Sketch algorithm. The logic of the update count is exactly the same as in the Event Frequency use case. However, there is an additional list of length K that holds the K most frequently updated elements, as shown in Figure 3. To add elements to the Top-K Sketch, perform the following logic:

  • All hash functions are performed on the element. The hash value can be treated as an index of a bit array, and the corresponding array element is incremented by 1.
  • As in the event frequency use case, calculate the event frequency of an element by performing all the hash functions on the element and retrieving all the corresponding array elements as if they were inserted. This time, however, instead of incrementing the corresponding array element by 1, the minimum value in the list is counted as the approximate event.
  • The calculated event frequency of this element is compared with the minimum value of the first K elements in the array. If the event frequency is large, the minimum value in the array is deleted and replaced with a new element.

Figure 3 Data flow chart of top-K algorithm

Figure 3 shows the sequence of events when this element is added to the Top-K Sketch. First, k separate hash functions are executed and each corresponding array entry is incremented by one (for example, 98+1). Next, calculate the minimum value in the updated array Entry (99,108,312,681,282), which in the example is 99, and compare that to the maximum count. If the minimum is larger, the original minimum is replaced with a new element.

In the Pulsar Function code example below, the StreamSummary class performs all the operations associated with updating the top-K list, so we simply call the “offer” method to add the element to Sketch first, if the element is also in the top-K list, It is routed to a priority topic for further processing.

import org.apache.pulsar.functions.api.Context;
import org.apache.pulsar.functions.api.Function;
import com.clearspring.analytics.stream.StreamSummary;
 
public class CountMinFunction implements Function<String, Void> {
     StreamSummary<String> summary = new StreamSummary<String> (256);
 
     Void process(String input, Context context) throws Exception {
        // Add the element to the sketch
        summary.offer(input, 1)
        // Grab the updated top 10,
        List<Counter<String>> topK = summary.topK(10);
        returnnull; }}Copy the code

Top-k is updated immediately after input is added to Sketch and can be used to perform any form of logic, such as publishing the updated top-K to an external data store for display on the dashboard.

Pattern 4: Count different elements

In some use cases, the data flow contains repeating elements, but we want to count the number of different elements (for example: IP address, unique visitors, AD exposure, and so on). In computer science, this is a well-known problem known as the count-DISTINCT problem. In resource-constrained environments, where we cannot store the entire stream in memory, we have to rely on probabilistic algorithms, which are very efficient approximations. Currently, there are two algorithms used to solve the COUn-DISTINCT problem:

  • Bit-patten Based algorithm: Estimates observations of the number of non-repeating elements in the data stream by computing the binary form of each number in the set. Such algorithms include: LogLog, HyperLogLog, HyperLogLog++ and so on.

  • Order-statistics Based algorithm: sequence-based statistics, for example, the minimum value in a stream. Such algorithms include MinCount, Bar-tosSEF and so on.

HyperLogLog

The basic flow of the HyperLogLog algorithm consists of four steps, as shown in Figure 4, taking the element to be counted and performing a hash function on it, taking the hash value and converting it to a binary string.

The lowest p significant bits of the binary string are used to determine the location of the register to be updated. The P value determines the accuracy of the estimate and must be greater than zero. The larger the P value, the more accurate the estimate, but in order to balance the space, every time the P value increases, the space increases exponentially, that is, the space requirement is 2 to the P power.

Once the location of the register is known, the algorithm takes advantage of the phenomenon of “bit-pattern observable,” which counts zeros starting from the right side of the remaining bit string, increasing the count by one for each 0 that occurs. The final result is then used to replace the previously determined register locations.

Figure 4 Data flow chart of HyperLogLog algorithm

Fortunately, there is an Apache licensed implementation of the HyperLogLog algorithm in which we can use the Apache Pulsar Function:

import org.apache.pulsar.functions.api.Context;
import org.apache.pulsar.functions.api.Function;
import io.airlift.stats.cardinality.HyperLogLog;
 
public class HyperLogLogFunction implements Function<Integer, Void> {
   HyperLogLog hll = HyperLogLog.newInstance(2048);
 
   Void process(Integer value, Context context) throws Exception {
       hll.add(value);
       Integer numDistinctElements = hll.cardinality();
       // Do something with the distinct elements
   }
}
Copy the code

To deepen your understanding, take a look at the process of adding data elements to HyperLogLog. As shown in Figure 5, the original data to be added is 10,529,222 on the left, and the hash value generated by the original data is 2,386,714,787 on the right.

The binary string representation of the hash value is shown below, where the blue ones are the six least significant bits and the green ones are the rest. The six least significant bits represent the decimal value 35 and are used as the index of the register.

Next, we get a value of 1 from the number of consecutive zeros that occur before the leftmost digit of the remaining digit appears 1. Add the number 1 to the value 1 to get 2, and place the value in register 35 marked in red. Repeat this process for each newly added element.

Fig.5 Element processing in HyperLogLog algorithm

conclusion

This paper introduces several approximation algorithms which are the best choice for performance analysis of stream datasets due to their spatial efficiency and constant time performance.

We delve into real-time analysis capabilities based on probabilistic algorithms and give examples of Pulsar functions that take advantage of existing open source implementations of the algorithms to make it easier to get started.

This article has shown that it is simple and feasible to incorporate existing implementations of these complex algorithms into Apache Pulsar Functions for deployment in resource-constrained environments. Thus, you can take advantage of these approximation algorithms without having to know the code and/or write the code yourself.

Want to keep up to date with Pulsar’s development, user stories and hot topics? Follow the Apache Pulsar and StreamNative wechat accounts for the first time to share everything about Pulsar.

Original link: www.splunk.com/en_us/blog/…