Seven common Hadoop and Spark project examples

If your Hadoop project is going to break new ground, it will look a lot like these seven common projects.

Project 1: Data integration

Call it an “enterprise data center” or a “data lake,” and the idea is that you have different data sources and you want to do data analysis on them. Such projects include getting data sources from all sources (real-time or batch) and storing them in Hadoop. Sometimes this is the first step towards becoming a “data-driven company”; Sometimes, you just need a nice report. An enterprise data center consists of the HDFS file system and tables in HIVE or IMPALA. In the future, HBase and Phoenix will break new ground in big data integration and create a new Brave New World of data.

Sales people like to talk about “read mode,” but the truth is, to be successful, you have to have a clear understanding of what your use case will be (the Hive mode won’t look different from what you’re doing in an enterprise data warehouse). The real reason is that adata lake has much more horizontal scalability and a much lower cost than Teradata and Netezza. A lot of people use Tableau and Excel when they do front-end analysis. Many sophisticated companies use “data scientists” with Zeppelin or IPython notebooks as front ends.

Project II: Professional analysis

Many data consolidation projects actually start with an analysis of your particular needs and a particular data set system. These are often incredible for specific areas, such as liquidity risk/Monte Carlo simulation analysis in the banking sector. In the past, this kind of specialized analysis relied on outdated, proprietary software packages, and the inability to scale up the data often suffered from a limited set of features (largely because software vendors couldn’t possibly know as much as the specialized organizations).

In the world of Hadoop and Spark, look at these systems for roughly the same data consolidation system, but tend to have more HBase, custom non-SQL code, and fewer (if not unique) data sources. They are increasingly based on Spark.

Project 3: Hadoop as a Service

In any large organization with a “professional analytics” project (ironically, one or two “data collation” projects) they will inevitably start to feel the “joy” (i.e., pain) of managing several Hadoop clusters with different configurations, sometimes from different vendors. Next, they say, “Maybe we should consolidate these pools,” instead of leaving most of the nodes idle most of the time. They should constitute cloud computing, but many companies often can’t or won’t for security reasons (internal politics and job protection). This usually means a lot of Docker container packages.

I don’t use it, but recently Bluedata seems to have a solution that will also appeal to small businesses that lack enough capital to deploy Hadoop as a service.

Project 4: Flow analysis

A lot of people would refer to this as “flow”, but flow analysis is different from flow from device. Typically, flow analysis is a real-time version of an organization in batch processing. To anti-money laundering and fraud detection: Why not catch it on a transaction basis as it occurs rather than at the end of a cycle? Same inventory management or anything else.

In some cases, this is a new type of trading system that analyzes bits of data by bits, because you’re paralleling it into an analysis system. These systems prove themselves as common data stores such as Spark or Storm and Hbase. Note that flow analysis does not replace all forms of analysis, and for some things you have never considered before, you still want to analyze historical trends or look at past data.

Project 5: Complex event processing

Here, we are talking about sub-second real-time event processing. While not fast enough for ultra-low latency (picosecond or nanosecond) applications such as high-end trading systems, you can expect millisecond response times. Examples include a real-time evaluation of a thing or event processed by an Internet telecom operator’s call data record. Sometimes, you’ll see such systems using Spark and HBase — but they generally fall on their faces and have to be converted to Storm, which is based on the jamming mode developed by the LMAX Exchange.

In the past, such systems have been based on custom messaging or high-performance, client-server messaging products from the shelf — but today’s data volumes are too much. I haven’t used it yet, but the Apex project looks promising and claims to be faster than Storm.

Project 6: ETL flow

Sometimes you want to capture streaming data and store it. These projects usually overlap with 1 or 2, but add to their scope and character. (Some people think they’re Number 4 or 5, but they’re actually dumping data onto disks and analyzing it.) These are mostly Kafka and Storm projects. Spark is also used, but that’s no reason, because you don’t need to analyze in memory.

Item 7: Replace or add SAS

SAS is fine and fine but SAS is also expensive and we don’t need to buy storage for your data scientist and analyst to “play” with the data. Also, you can do a few different things besides SAS can do or produce beautiful graphic analysis. This is your data lake. Here are IPython notebooks (now) and Zeppelin (later). We use SAS to store the results.

When I see other Hadoop, Spark, or Storm projects of different types every day, this is normal. If you use Hadoop, you probably know about them. I implemented some of these projects a few years ago, using other techniques.

If you’re an old-timer too afraid to “big” or “do” big data Hadoop, don’t worry. Things change more and more, but the essence remains the same. You’ll see a lot of similarities and the technology that you use to deploy and snazzily revolves around the Hadooposphere.

The original article was published on June 26, 2018

Author: Alukar

This article is from CDA Data Analyst, a partner of the cloud community. For more information, please follow CDA Data Analyst.

Seven common Hadoop and Spark project examples

Related Posts

Online troubleshooting command

High performance index optimization Strategy (2) : Independent index or joint index for multiple indexes?

What is the difference between Spring BeanFactory and FactoryBean?