Hadoop and Big Data: 60 Top Open Source Tools

Hadoop related tools

1. Hadoop

Apache’s Hadoop project has become almost synonymous with big data. It has grown into an entire ecosystem of open source tools for highly scalable distributed computing.

Supported operating systems: Windows, Linux, and OS X

2. Ambari

As part of the Hadoop ecosystem, the Apache project provides an intuitive Web-based interface for configuring, managing, and monitoring Hadoop clusters. For developers who want to incorporate Ambari’s capabilities into their applications, Ambari also provides apis that take full advantage of REST (Representative State Transfer Protocol).

Supported operating systems: Windows, Linux, and OS X

3. Avro

This Apache project provides a data serialization system with rich data structures and compact formats. Schemas are defined in JSON, which is easy to integrate with dynamic languages.

Supported OPERATING systems: Independent of operating systems.

4. Cascading

Cascading is a Hadoop based application development platform. Provide business support and training services.

Supported OPERATING systems: Independent of operating systems.

Relevant link: www.cascading.org/projects/ca…

5. Chukwa

Chukwa is based on Hadoop and can collect data from large distributed systems for monitoring. It also contains tools for analyzing and displaying data.

Supported operating systems: Linux and OS X

6. Flume

Flume collects log data from other applications and feeds it into Hadoop. “It is powerful, fault-tolerant, has tunable reliability mechanisms and many failover and recovery mechanisms,” the website claims.

Supported operating systems: Linux and OS X

Relevant link: cwiki.apache.org/confluence/…

7. HBase

Designed for very large tables with billions of rows and millions of columns, HBase is a distributed database that provides random real-time read/write access to big data. It’s similar to Google’s Bigtable, but based on Hadoop and the Hadoop Distributed File system (HDFS).

Supported OPERATING systems: Independent of operating systems.

8. HadoopDistributed file system (HDFS))

HDFS is a hadoop-oriented file system, but it can also be used as a standalone distributed file system. It is Java-based, fault tolerant, highly extensible, and highly configurable.

Supported operating systems: Windows, Linux, and OS X

Relevant link: hadoop.apache.org/docs/stable…

9. Hive

Apache Hive is a data warehouse for the Hadoop ecosystem. It allows users to query and manage big data using HiveQL, an SQL-like language.

Supported OPERATING systems: Independent of operating systems.

10. Hivemall

Hivemall combines various machine learning algorithms for Hive. It includes a number of highly extensible algorithms for data classification, recursion, recommendation, K-nearest neighbor, anomaly detection, and feature hashing.

Supported OPERATING systems: Independent of operating systems.

Related links: github.com/myui/hivema…

11. Mahout

According to the official website, the Mahout project aims to “create an environment for rapidly building scalable, high-performance machine learning applications.” It includes a number of algorithms for data mining on Hadoop MapReduce, as well as novel algorithms for Scala and Spark environments.

Supported OPERATING systems: Independent of operating systems.

12. MapReduce

An integral part of Hadoop, MapReduce is a programming model that provides a way to process large distributed data sets. It was originally developed by Google, but is now used by several other big data tools covered in this article, including CouchDB, MongoDB, and Riak.

Supported OPERATING systems: Independent of operating systems.

Relevant link: hadoop.apache.org/docs/curren…

13. Oozie

This workflow scheduling tool is specifically designed to manage Hadoop tasks. It can trigger tasks by time or by data availability and is integrated with MapReduce, Pig, Hive, Sqoop, and many other related tools.

Supported operating systems: Linux and OS X

14. Pig

Apache Pig is a platform for distributed big data analysis. It relies on a programming language called Pig Latin and has the advantages of simplified parallel programming, optimization, and extensibility.

Supported OPERATING systems: Independent of operating systems.

15. Sqoop

Enterprises often need to transfer data between relational databases and Hadoop, and Sqoop is a tool to do this. It can import data to Hive or HBase and export data from Hadoop to a relational database management system (RDBMS).

Supported OPERATING systems: Independent of operating systems.

16. Spark

As an alternative to MapReduce, Spark is a data processing engine. It claims to be up to 100 times faster than MapReduce when used in memory; When used on disks, it is up to 10 times faster than MapReduce. It can be used with Hadoop and Apache Mesos or on its own.

Supported operating systems: Windows, Linux, and OS X

17. Tez

Tez is built on Apache Hadoop YARN, which is “an application framework that allows you to build a complex directed acyclic graph for tasks to process data.” It allows Hive and Pig to simplify complex tasks that would otherwise require multiple steps.

Supported operating systems: Windows, Linux, and OS X

18. Zookeeper

The big data management tool describes itself as “a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.” It allows nodes in a Hadoop cluster to coordinate with each other.

Supported operating systems: Linux, Windows (development environment only), and OS X (development environment only).

2. Big data analysis platforms and tools

19. Disco

Originally developed by Nokia, Disco is a distributed computing framework that, like Hadoop, is based on MapReduce. It includes a distributed file system and a database that supports billions of keys and values.

Supported operating systems: Linux and OS X

20. HPCC

As an alternative to Hadoop, HPCC is a big data platform that promises to be extremely fast and scalable. In addition to the free community edition, HPCC Systems also offers a paid enterprise edition, paid modules, training, consulting, and other services.

Supported OPERATING systems: Linux.

21. Lumify

Lumify, owned by Altamira Technologies (best known for its national security technology), is an open source big data integration, analysis and visualization platform. Just Try the demo at try.lumify. IO to see it in action.

Supported OPERATING systems: Linux.

Relevant link: www.jboss.org/infinispan….

22. Pandas

The Pandas project includes data structures and data analysis tools based on the Python programming language. It allows enterprise organizations to use Python as an alternative to R for big data analysis projects.

Supported operating systems: Windows, Linux, and OS X

23. Storm

Storm is now an Apache project that provides real-time processing of big data (unlike Hadoop which only provides batch processing). Its users include Twitter, the Weather Channel, WebMD, Alibaba, Yelp, Yahoo Japan, Spotify, Group, Flipboard, and many others.

Supported OPERATING systems: Linux.

3. Database /The data warehouse

24. Blazegraph

Blazegraph, formerly known as “Bigdata,” is a highly scalable, high-performance database. It is available under both open source and commercial licenses.

Supported OPERATING systems: Independent of operating systems.

25. Cassandra

Originally developed by Facebook, this NoSQL database is now used by more than 1,500 business organizations, They include Apple, CERN, Comcast, Ebay, GitHub, GoDaddy, Hulu, Instagram, Intuit, Netfilx, Reddit, and others. It can support very large clusters; Apple’s Cassandra system, for example, includes more than 75,000 nodes and holds more than 10 petabytes of data.

Supported OPERATING systems: Independent of operating systems.

26. CouchDB

CouchDB claims to be “a database that fully embraces the Internet,” storing data in JSON documents that can be queried through a Web browser and processed in JavaScript. It is easy to use, highly available and scalable on distributed networks.

Supported operating systems: Windows, Linux, OS X and Android.

27. FlockDB

FlockDB, developed by Twitter, is a very fast and scalable graphical database that specializes in storing social network data. Although it is still available for download, the open source version of the project has not been updated for some time.

Supported OPERATING systems: Independent of operating systems.

Related links: github.com/twitter/flo…

28. Hibari

The Erlang-based project describes itself as “a distributed ordered key-value storage system with a high degree of consistency”. It was originally developed by Gemini Mobile Technologies and is now used by several telecom operators in Europe and Asia.

Supported OPERATING systems: Independent of operating systems.

Hibari.github. IO /hibari-doc/

29. Hypertable

Hypertable is a Hadoop-compatible big data database that promises super performance and is used by ebay, Baidu, Gaopeng, Yelp, and many other Internet companies. Provide business support services.

Supported operating systems: Linux and OS X

30. Impala

Cloudera claims that the SQL-based Impala database is “the leading open source analytics database for Apache Hadoop.” It can be downloaded as a standalone product and is part of Cloudera’s commercial big data offering.

Supported operating systems: Linux and OS X

Relevant link: www.cloudera.com/content/clo…

31. InfoBrightCommunity edition

Designed for data analysis, InfoBright is a column-oriented database with a high compression ratio. InfoBright.com offers a paid product based on the same code and provides support services.

Supported operating systems: Windows and Linux

32. MongoDB

MongoDB, an extremely popular NoSQL database, has been downloaded more than 10 million times. MongoDB.com offers enterprise edition, support, training and related products and services.

Supported operating systems: Windows, Linux, OS X, and Solaris

33. Neo4j

Neo4j calls itself “the fastest and most scalable native graphics database” and promises massive scalability, fast password query performance, and improved development efficiency. Users include Ebay, Pitney Bowes, Wal-Mart, Lufthansa and CrunchBase.

Supported operating systems: Windows and Linux

34. OrientDB

This multi-model database combines some functions of graph database and some functions of document database. Fee-based support, training and consulting services are provided.

Supported OPERATING systems: Independent of operating systems.

Relevant link: www.orientdb.org/index.htm

35. Pivotal Greenplum Database

Pivotal claims Greenplum is “the best enterprise-level analytics database in its class,” capable of doing powerful analytics on huge volumes of data very quickly. It is part of Pivotal’s large database suite.

Supported operating systems: Windows, Linux, and OS X

IO /big-data/ PI…

36. Riak

Riak is “fully functional” and comes in two versions: KV is a distributed NoSQL database and S2 provides object storage for the cloud environment. It is available in both open source and commercial versions, as well as add-ons that support Spark, Redis and Solr.

Supported operating systems: Linux and OS X

37. Redis

Redis is now sponsored by Pivotal, a key value caching and storage system. Charge support is available. One caveat: While the project doesn’t officially support Windows, Microsoft does have a Windows derivative on GitHub.

Supported OPERATING systems: Linux.

Business intelligence

38. Talend Open Studio

Talend has been downloaded more than 2 million times and its open source software provides data integration. The company also develops fee-based tools for big data, the cloud, data consolidation, application consolidation and master data management. Its customers include corporate organizations such as American International Group (AIG), Comcast, Ebay, General Electric, Samsung, Ticketmaster and Verizon.

Supported operating systems: Windows, Linux, and OS X

39. Jaspersoft

Jaspersoft provides flexible, embedable business intelligence tools to organizations including Gaopeng, Champion Technologies, USDA, Ericsson, Time Warner Cable, Olympic Steel, Nebraska University, and General Dynamics. In addition to the Open Source Community edition, it also offers a paid reporting edition, Amazon Web Services (AWS) edition, Professional edition, and Enterprise edition.

Supported OPERATING systems: Independent of operating systems.

40. Pentaho

Pentaho, which is owned by Hitachi Data Systems, offers a range of data consolidation and business analysis tools. There are three community editions available on the official website; Visit Pentaho.com for information on premium support.

Supported operating systems: Windows, Linux, and OS X

41. SpagoBI

Spago, described by market analysts as an “open source leader,” provides business intelligence, middleware and quality assurance software, in addition to a Java EE application development framework. The software is 100% free and open source, but also offers support, consulting, training and other services for a fee.

Supported OPERATING systems: Independent of operating systems.

Relevant link: www.spagoworld.org/xwiki/bin/v…

42. KNIME

KNIME, Konstanz Information Miner, is an open source analysis and reporting platform. Several commercial and open source extensions are provided to enhance its functionality.

Supported operating systems: Windows, Linux, and OS X

43. BIRT

BIRT stands for “Business Intelligence and Reporting Tools.” It provides a platform for creating visual elements and reports that can be embedded in applications and websites. It is part of the Eclipse community and is supported by Actuate, IBM, and Innovent Solutions.

Supported OPERATING systems: Independent of operating systems.

Data mining

44.DataMelt

As a successor to jHepWork, DataMelt can handle tasks such as mathematical operations, data mining, statistical analysis, and data visualization. It supports Java and related programming languages, including Jython, Groovy, JRuby, and Beanshell.

Supported OPERATING systems: Independent of operating systems.

45. KEEL

KEEL, which stands for “Knowledge Extraction based on Evolutionary Learning,” is a Java-based machine learning tool that provides algorithms for a range of big data tasks. It also helps to evaluate how well the algorithm handles recursion, classification, clustering, pattern mining, and similar tasks.

Supported OPERATING systems: Independent of operating systems.

46. Orange

Orange believes that data mining should be “fruitful and fun”, whether you have many years of experience or are just starting out in the field. It provides visual programming and Python scripting tools for data visualization and analysis.

Supported operating systems: Windows, Linux, and OS X

47. RapidMiner

RapidMiner claims more than 250,000 users, including paypal, Deloitte, Ebay, Cisco and Volkswagen. It offers a wide range of open source and paid versions, but be warned: the free open source version only supports data in CSV or Excel format.

Supported OPERATING systems: Independent of operating systems.

RapidMiner.com

48. Rattle

Rattle stands for “easy to learn and use R analysis tool”. It provides a graphical interface to the R programming language that simplifies these processes: building statistical or visual summaries of data, building models, and performing data transformations.

Supported operating systems: Windows, Linux, and OS X

49. SPMF

The SPMF now includes 93 algorithms for sequential pattern mining, association rule mining, itemset mining, sequential rule mining, and clustering. It can be used independently or integrated into other Java-based programs.

Supported OPERATING systems: Independent of operating systems.

Related links: www.philippe-fournier-viger.com/spmf/

50. Weka

The Waikato Knowledge Analysis Environment (Weka) is a set of Java-based machine learning algorithms for data mining. It can perform data preprocessing, classification, recursion, clustering, association rules, and visualization.

Supported operating systems: Windows, Linux, and OS X

Related links: www.cs.waikato.ac.nz/~ml/weka/

Query engine

51. Drill

The Apache project enables users to query Hadoop, NoSQL databases, and cloud storage services using SQL-based queries. It can be used for data mining and AD hoc queries, and it supports a wide range of databases including HBase, MongoDB, MAPR-DB, HDFS, MAPR-FS, Amazon S3, Azure Blob Storage, Google Cloud Storage, and Swift.

Supported operating systems: Windows, Linux, and OS X

Programming language

52. R

R resembles the S language and environment and is designed to handle statistical calculations and graphics. It includes an integrated set of big data tools for data processing, calculation, and visualization.

Supported operating systems: Windows, Linux, and OS X

53. ECL

Enterprise Control Language (ECL) is the language developers use to build big data applications on the HPCC platform. The official HPCC Systems website has an integrated Development environment (IDE), tutorials, and many related tools for working with the language.

Supported OPERATING systems: Linux.

Relevant link: hpccsystems.com/download/do…

Big data search

54. Lucene

Java-based Lucene can perform full-text searches very quickly. According to the official website, it can retrieve over 150GB of data per hour on modern hardware, and it contains powerful and efficient search algorithms. The development work was supported by the Apache Software Foundation.

Supported OPERATING systems: Independent of operating systems.

55. Solr

Solr, based on Apache Lucene, is a highly reliable, highly scalable enterprise search platform. Prominent users include eHarmony, Sears, StubHub, Zappos, Best Buy, AT&T, Instagram, Netflix, Bloomberg and Travelocity.

Supported OPERATING systems: Independent of operating systems.

9. In-memory technology

56. Ignite

The Apache project describes itself as “a high-performance, integrated, distributed in-memory platform for performing real-time computations and processing on large data sets, orders of magnitude faster than traditional disk-based or flash technologies.” The platform includes data grid, compute grid, service grid, streaming media, Hadoop acceleration, advanced clustering, file systems, messaging, events, and data structures.

Supported OPERATING systems: Independent of operating systems.

Relevant link: ignite.incubator.apache.org

57. Terracotta

Terracotta claims its BigMemory technology is “the number one in-memory data management platform in the world,” claiming 2.1 million developers and 250 enterprise organizations to deploy its software. The company also offers commercial versions of the software, in addition to support, consulting and training.

Supported OPERATING systems: Independent of operating systems.

58. Pivotal GemFire/Geode

Earlier this year, Pivotal announced that it would open up the source code for key components of its big data suite, including GemFire’s in-memory NoSQL database. It has submitted a proposal to the Apache Software Foundation to manage the core engine of the GemFire database under the name “Geode”. A commercial version of the software is also available.

Supported operating systems: Windows and Linux

IO /big-data/ PI…

59. GridGain

GridGrain, powered by Apache Ignite, offers in-memory data structures for quickly processing big data, as well as Hadoop accelerators based on the same technology. It is available in both a paid enterprise version and a free community version, which includes free basic support.

Supported operating systems: Windows, Linux, and OS X

60. Infinispan

As a Red Hat JBoss project, The Java-based Infinispan is a distributed in-memory data grid. It can be used as a cache, as a high-performance NoSQL database, or to add clustering capabilities to many frameworks.

Supported OPERATING systems: Independent of operating systems.

Relevant link: www.jboss.org/infinispan….

Hadoop and Big Data: 60 Top Open Source Tools

Hadoop related tools

1. Hadoop

2. Ambari

3. Avro

4. Cascading

5. Chukwa

6. Flume

7. HBase

8. HadoopDistributed file system (HDFS))

9. Hive

10. Hivemall

11. Mahout

12. MapReduce

13. Oozie

14. Pig

15. Sqoop

16. Spark

17. Tez

18. Zookeeper

2. Big data analysis platforms and tools

19. Disco

20. HPCC

21. Lumify

22. Pandas

23. Storm

3. Database /The data warehouse

24. Blazegraph

25. Cassandra

26. CouchDB

27. FlockDB

28. Hibari

29. Hypertable

30. Impala

31. InfoBrightCommunity edition

32. MongoDB

33. Neo4j

34. OrientDB

35. Pivotal Greenplum Database

36. Riak

37. Redis

Business intelligence

38. Talend Open Studio

39. Jaspersoft

40. Pentaho

41. SpagoBI

42. KNIME

43. BIRT

Data mining

44.DataMelt

45. KEEL

46. Orange

47. RapidMiner

48. Rattle

49. SPMF

50. Weka

Query engine

51. Drill

Programming language

52. R

53. ECL

Big data search

54. Lucene

55. Solr

9. In-memory technology

56. Ignite

57. Terracotta

58. Pivotal GemFire/Geode

59. GridGain

60. Infinispan

Related Posts

Utility summary

Get up and running with Gridea and make your little blogging dream come true!

Use Madge to generate engineering dependency diagrams