Moment For Technology

How to build a good big data platform architecture

Posted on Aug. 9, 2023, 3:47 a.m. by Eva Barr
Category: The back-end Tag: The back-end

Lambda architecture requirements

The requirements behind the Lambda architecture are due to latency issues with the MR architecture. Although MR realizes the purpose of distributed and extensible data processing system, it has serious delay in data processing. In fact, if the memory and CPU are strong enough, MR can also achieve near-real-time computing, but this is not the case in the actual business environment, so we need to choose the amount of data and appropriate resources for real-time and batch processing.

Lambda data processing framework proposed by Nathan Marz, author of Storm in 2012. The Lambda architecture aims to design an architecture that meets the key features of real-time big data systems, including high fault tolerance, low latency, and scalability. Lambda architecture integrates offline and real-time computing, Immunability, read/write separation and complexity isolation, and can integrate Hadoop, Kafka, Storm, Spark, Hbase and other big data components.

Second, the key to Lambda architecture

  • Lateral capacity

Scalability means that it can be achieved by adding memory or disk resources to existing machines (vertical scaling) or by adding machines to the cluster (horizontal scaling) to meet the growing needs of user services without the need for underlying architecture or code. Whether real-time or batch, it should be possible to perform horizontal scaling in the case of continuous service.

  • Fault tolerance

The system needs to properly handle failures to ensure the availability of the entire system services in the event that some components fail. The failure of some components may lead to the breakdown of some nodes in the cluster, affecting the SLA, but the system can still be corresponding, and the system cannot have a single point of failure.

  • Low latency \

Many applications have high latency requirements for read and write operations and require low latency responses to updates and queries.

  • Extensible \

The system needs to be flexible enough to accommodate new and modified requirements without refactoring the entire system. Real-time processing is isolated from batch processing, allowing flexibility to modify requirements.

  • Easy to maintain

Development deployments should not be too complex. \

Third, the hierarchy of Lambda architecture

When new data arrives in the Lambda schema, it is dispatched to both the batch layer and the fast processing layer. Once the data reaches the batch layer, the batch view is recalculated from scratch each time, following normal batch intervals. Similarly, as soon as new data arrives at the fast processing layer, the fast processing layer uses the new data to generate a fast view. When the query reaches the services layer, it merges the fast view and the batch view to produce the appropriate query results. After the batch view is generated, the quick view is discarded, and you only need to query the batch view unless new data arrives, because at this point you have all the data in the batch layer.

The Lambda architecture defines the major layers and the integration between each component. Note the following layers: \

  • The data source,

Data sources are external databases, message queues, files, etc. Data consumption layers can be developed to hide the complexity of data from different accesses, and data formats can be defined. \

  • Data consumption layer \

Responsible for encapsulating the complexity of data not available from the data source, transforming it to be further consumed by batch or stream processing in the same format. \

  • Batch layer \

This is one of the core layers of the Lambda architecture, where batch processing accepts data, persists it to user-defined data structures, and maintains master data. The data structure is generally not changed, just appended data. Batch is also responsible for creating and maintaining batch views. For example, Hive ETL is used to collect some data, and finally save the result in Hive table or database, which belongs to the batch layer. \

  • Real-time layer \

This is the other core layer of Lambda. Batch processing can meet requirements in many scenarios, but with the "demanding" nature of business requirements, they want to be able to see data in time, rather than wait until the next day to see metrics changes and analysis results. So real-time processing was introduced. The real-time layer solves the problem of storing only one set of data that is immediately available to the user, eliminating the need to process the full amount of data and providing significant processing efficiency. For example, stream processing only stores the last 5 minutes of data, processes the computation and forms the result, which is the time window we want in Spark Streaming.

  • The service layer,

This is the final layer of the Lambda architecture, and the service layer is responsible for capturing the results of batch and stream processing and providing unified query view services to users. \

4. Summary of Lambda architecture

Lambda data architecture has become a necessary architecture for every company's big data platform, which solves the needs of batch offline processing and real-time data processing of a company's big data.

Data starts from the underlying data source and enters the big data platform through various formats. In the big data platform, data components such as Kafka and Flume are collected and then divided into two lines for calculation. One line is to go into Streaming computing platforms (such as Storm, Flink or Spark Streaming) to compute metrics in real time. The other line goes to offline computing platforms for batch data processing (such as Mapreduce, Hive, and Spark SQL) to calculate service indicators related to T+1, which can be seen only every other day.

After years of development, Lambda architecture is very stable. The computing cost of real-time computing is controllable. Batch processing can use the time of night to complete batch computing, which separates the peak of real-time computing from that of offline computing.

  • The real-time and batch calculation results are inconsistent

Because batch and real-time computing are based on two computing frameworks and programs, the results are often different. It is common to see a number one day and see the same data the next day, but the data of yesterday has changed.

  • Robustness of batch processing

As the data scale is more and more big, often found that only 4 or 5 hours at night time window, has been unable to complete the day more than 20 hours accumulated data, ensure the morning before going to work out the data on time has become each big data team headaches, at the same time do a task parallel execution for the stability of the large data clustering is also a huge test, Often, tasks are not started regularly due to insufficient resources or errors are reported.

  • Complexity of development and maintenance

The same business logic is programmed twice in the Lambda architecture: once for the ETL system for bulk computation and once for the Streaming system for Streaming computation. Two code bases were created for the same business problem, each with different vulnerabilities.

  • Fast storage growth

The unreasonable design of data warehouse will produce a large number of intermediate result tables, resulting in rapid expansion of data, increasing the storage pressure of the server. For example, we often struggle with how to layer data warehouse, is it directly ODS layer to application? Or ODS layer to landscape DWS, DW, etc., and finally to the application?

The Lambda architecture has its drawbacks, but it still works for many companies. Sometimes we don't have as much business, and the real-time business needs are not so obvious, so it's still cool to use the Lambda architecture. For businesses with large data volumes or real-time businesses with the same amount, we can explore improved Lambda, and the industry has also proposed Kappa architecture, which interested partners can search and learn.

Recommended reading

How to build a big data platform from 0 to 1 \

Using Django to develop a Python Web API\

The high-performance Pandas methods: Query and eval\

Click below to read the article and join the community

Give it a thumbs up

About (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.