Machine learning now becoming popular industry, after 20 years of development, machine learning currently has a very wide application, such as data mining, computer vision, natural language processing, biometric recognition, search engines, medical diagnosis, DNA sequencing, voice and handwriting recognition, strategy game or a robot, etc.





1. TensorFlow





TensorFlow is the second generation machine learning system released by Google. In some benchmarks, TensorFlow was up to twice as fast as the first generation of DistBelief, according to Google.


Specifically, TensorFlow is an open source software library for numerical calculation using Data Flow Graphs: Nodes in the graph represent mathematical operations, while Edges in the graph represent multi-dimensional arrays that Flow between Nodes, i.e., Tensors. This flexible architecture allows users to deploy computing on one or more cpus on a variety of desktops, servers, or mobile devices without rewriting code. At the same time, any machine learning algorithm based on gradient can use the auto-differentiation of TensorFlow for reference. The flexible Python interface also makes it easier to express ideas in TensorFlow.


TensorFlow was originally developed by researchers and engineers at the Google Brain Group (part of Google’s Machine Intelligence research arm) for Machine learning and deep neural network research. But the system is versatile enough to make it widely used in other computing fields.


Google is already using TensorFlow internally. TensorFlow is used for speech recognition in the Google App, automatic reply in Gmail, and image search in Google Photos.


Development language: C++

License agreement: Apache License 2.0

GitHub project address: github.com/tensorflow/…


2. Scikit-Learn


Scikit-learn is a Python module for machine learning that builds on SciPy. The project was founded in 2007 by David Cournapeau as Google Summer of Code, and since then many volunteers have contributed to it.


Main features:


Simple operation, efficient data mining and data analysis

No access restrictions, in any case can be reused

Build on NumPy, SciPy and Matplotlib


The basic functions of SciKit-Learn are divided into six parts: classification, regression, clustering, data dimensionality reduction, model selection, and data preprocessing. For details, please refer to the documentation on the official website. In tests, SciKit-learn runs on Python 2.6, Python 2.7, and Python 3.5. In addition, it should also run on Python 3.3 and Python 3.4.


Note: Scikit-learn was formerly known as Scikits.Learn.


Development language: Python

License agreement: 3-clause BSD license

GitHub project address: github.com/scikit-lear…


3. Caffe


Caffe is a deep learning framework generated by expressions, speeds, and modularity in neural networks. It has since grown into a berkeley-led, loosely organized community with Github and Caffe-Users mail, through contributions from the Berkeley Center for Vision and Learning (BVLC) and community participants.


Caffe is a C++/ cuda-based architecture framework that enables developers to leverage its freely organized networks, currently supporting convolutional neural networks and fully connected neural networks (artificial neural networks). On Linux, C++ can operate interfaces through the command line, and there are special interfaces for MATLAB and Python, which can directly switch between CPU and GPU in computing.


Caffe’s characteristics:


Ease of use: Caffe’s model and corresponding optimization are presented in text form rather than code form. Caffe provides the definition of model, optimization Settings and weight of pre-training for easy and quick use.

Fast: can run the best models and massive data;

Caffe can be used in combination with cuDNN to test AlexNet models and can process an image on K40 in 1.17ms;

Modularity: easy to extend to new tasks and Settings;

Users can define their own models based on Caffe’s layer types.


At present, Caffe’s application practices mainly include data collation, network structure design, training results, and direct recognition by Caffe based on existing training models.


Development language: C++

License agreement: BSD 2-clause license

GitHub project address: github.com/BVLC/caffe


4. PredictionIO


PredictionIO is an open source machine learning server for developers and data scientists. It supports event collection, algorithm scheduling, evaluation, and predictive result queries via REST APIs. PredictionIO allows users to make predictions, such as personalized recommendations and content discovery. PredictionIO offers 20 preset algorithms that developers can run directly on their own data. Almost any application can become “smarter” when integrated with PredictionIO. Its main characteristics are as follows:


Predict user behavior based on existing data;

Users can choose their own machine learning algorithm;

Don’t worry about scalability, good scalability.


PredictionIO is based on the REST API(Application Programming interface) standard, but it also includes SDKS (software development kits) for programming languages such as Ruby, Python, Scala, And Java. Its development language is Scala language, the use of database is MongoDB database, computing system using Hadoop system architecture.


Development language: Scala

License agreement: Apache License 2.0

GitHub project address: github.com/PredictionI…


5. Brain


Brain is a neural network library in JavaScript. The following example illustrates using Brain to approximate XOR functionality:

var net = new brain.NeuralNetwork();

net.train([{input: [0, 0], output: [0]},

{input: [0, 1], output: [1]},

{input: [1, 0], output: [1]},

{input: [1, 1], output: [0]}]);

var output = net.run([1, 0]); / / [0.987]


When brain is used in a node, the NPM installation can be used:


npm install brain


When brain is used in a browser, download the latest brain.js file. Training calculations are expensive, so train the network offline (or on the Worker) and use the toFunction() or toJSON() option to plug the pre-training network into the site.


Development language: JavaScript

GitHub project address: github.com/harthur/bra…


6. Keras


Keras is an extremely lean and highly modular neural network library that runs on either TensorFlow or Theano. Keras is a highly modular neural network library that supports GPU and CPU computing. Keras can be said to be the Python version of Torch7, which is very convenient for the rapid construction of CNN model. At the same time, it also includes some algorithms of the latest literature, such as Batch Noramlize. The documentation tutorials are complete, and the author gives examples directly on the official website to make it easy to understand. Keras also supports saving trained parameters and then loading trained parameters to continue training.


Keras focuses on developing rapid experiments, the transition from idea to result with the least possible delay, which is key to doing a good study.


Consider using Keras when you need a deep learning library that:


Consider simple and rapid prototyping (through overall modularity, simplicity, and extensibility);

Both convolutional networks and recursive networks are supported, as well as the combination between the two.

Support any connection scheme (including multi-input multi-output training);

It runs seamlessly on CPU and GPU.


Keras currently supports Python 2.7-3.5.


Development language: Python

GitHub project address: github.com/fchollet/ke…


7. CNTK


CNTK(Computational Network Toolkit) is a unified deep learning Toolkit that describes a neural Network as a series of Computational steps through a directed graph. In a directed graph, leaf nodes represent input values or network parameters, and other nodes represent matrix operations on the input of that node.


CNTK makes it easy to implement and combine popular patterns such as feedforward neural networks DNN, convolutional neural networks (CNN), and cyclic neural networks (RNNs/LSTMs). It also implements automatic differentiation and parallelization of stochastic gradient descent (SGD) learning across multiple Gpus and servers.


The figure below compares CNTK’s processing speed (frames per second) to four other well-known toolkits. The configuration uses a four-layer, fully connected neural network (see the benchmark script) and an efficient Mini Batch of size 8192. Based on the same hardware and corresponding latest public software version (version before December 3, 2015), the following results are obtained:





CNTK has been open source since April 2015.


Development language: C++

GitHub project address: github.com/Microsoft/C…


8. Convnetjs


ConvNetJS is a neural network implemented in Javascript with a very good browser-based Demo. Its most important use is to help deep learning beginners understand algorithms faster and more intuitively.


It currently supports:


Common neural network modules (full connection layer, nonlinear);

Cost functions for classification (SVM/ SOFTMAX) and regression (L2);

Specify and train convolutional networks for image processing;

Experimental reinforcement Learning model based on Deep Q Learning.


Some online examples:


Convolutional Neural Network on MNIST digits

Convolutional Neural Network on CIFAR-10

Toy 2D data

Toy 1D regression

Training an Autoencoder on MNIST digits

Reinforcement Learning Demo + Image Regression (” Painting “) + Comparison of SGD/Adagrad/Adadelta on MNIST


Other:


Development language: Javascript

License agreement: MIT License

GitHub project address: github.com/karpathy/co…


9. Pattern





Pattern is a Web mining module in Python. Have the following tools:


Data mining: Web services (Google, Twitter, Wikipedia), web crawlers, HTML DOM parsing;

Natural language processing: Part-of-Speech Tagger, N-gram Search, Sentiment Analysis, WordNet;

Machine learning: Vector space model, clustering, classification (KNN, SVM, Perceptron);

Network analysis: Graphical centrality and visualization.


It is well documented and currently has over 50 cases and over 350 unit tests. Pattern currently only supports Python 2.5+(not Python 3 yet). This module has no external requirements other than the use of Lsas in the Pattern.vector module, so only NumPy is installed (default only on Mac OS X).


Development language: Python

License agreement: BSD License

GitHub project address: github.com/clips/patte…


10. NuPIC





NuPIC is a machine intelligence platform that implements HTM learning algorithms. HTM is a detailed artificial intelligence algorithm about Neocortex. The core of HTM is a time-based continuous learning algorithm, which can store and call both time and space patterns. NuPIC can be adapted to solve a variety of problems, particularly anomaly detection and flow data source prediction.


NuPIC Binaries files are currently available for:


Linux x86 64bit

OS X 10.9

OS X 10.10

Windows 64bit


NuPIC is unique in its own way. NuPIC, which many machine learning algorithms fail to adapt to new patterns, works like a human brain, forgetting old patterns and remembering new ones when patterns change.


Development language: Python

GitHub project address: github.com/numenta/nup…


11. Theano


Theano is a Python library that allows users to effectively define, optimize, and evaluate mathematical expressions involving multidimensional arrays, while supporting GPUs and efficient symbol differentiation operations. Theano has the following features:

Closely related to NumPy – numpy.ndarray is used in Theano’s compilation function;

Transparently using gpus – performing data-intensive computations 140 times faster than CPU (against Float32);

Efficient symbolic differentiation — Theano divides the derivative of a function into one or more distinct inputs;

Optimization of speed and stability — log(1+x) is correct even if input x is very small;

Dynamically generated C code – expression evaluation is faster;

Extensive unit testing and self-validation – detection and determination of multiple types of errors.

  

Since 2007, Theano has been working on large-scale, intensive scientific computing, but it is also widely used in the classroom (such as the Deep learning/machine learning course at Montreal University).


Development language: Python

GitHub project address: github.com/Theano/Thea…


12. MXNet






MXNet is an efficient and flexible deep learning framework. It allows users to combine symbolic programming with imperative programming to maximize efficiency and productivity. At its core is the dynamically dependent scheduler, which can automatically parallelize symbols and commands dynamically. The graphical optimization layer deployed enables faster symbolic manipulation and higher memory utilization. The library is lightweight and portable, and can be extended to multiple Gpus and multiple hosts.


Main features:

The design notes provide useful insights that can be reapplied to other DL projects;

Flexible configuration of any computing graph;

Combine the advantages of various programming methods to maximize flexibility and efficiency;

Lightweight, efficient memory and support for portable smart devices;

Multi-gpu extension and distributed automatic parallelization setup;

Supports Python, R, C++, and Julia;

Cloud computing friendly, directly compatible with S3, HDFS, and Azure.


MXNet is more than just a deep learning project; it is a combination of blueprints, guidelines, and hackers’ unique insights into deep learning systems.


Development language: Jupyter Notebook

Open-source license: Apache-2.0 License

GitHub project address: github.com/dmlc/mxnet


13. Vowpal Wabbit


Vowpal Wabbit is a machine learning system that advances advances in machine learning technologies such as online, hashing, Allreduce, Learning2search, etc. Its training speed is very fast. In the case of 2 billion training samples, each training sample has about 100 non-zero features: if the total number of features is 10,000, the training time is 20 minutes; When the total number of features is 10 million, the training time is 2 hours. Vowpal Wabbit supports classification, regression, matrix decomposition, and LDA.


When running Vowpal Wabbit on Hadoop, there are the following optimizations:


Lazy initialization: All data can be loaded into memory and cached before All Reduce is performed. Even if an error occurs on one node, training can be continued by using data from the wrong node (retrieved from the cache) on another node.

Speculative Execution: In a large cluster, one or two slow Mappers affect the performance of the entire Job. The idea of supporting Speculative Execution is that while most nodes have completed tasks, Hadoop can copy tasks from the remaining nodes to other nodes for completion.


Other:


Development language: C++

GitHub project address: github.com/JohnLangfor…


14. Ruby Warrior


By designing a game to make Ruby language and artificial intelligence learning more fun and interactive.


The user plays a warrior who climbs up a tall tower to the top to get precious Ruby. At each level, a Ruby script is written to guide the warrior to defeat the enemy, rescue the prisoner, and reach the stairs. The user has some knowledge of each layer, but you never know exactly what is going to happen at each layer. You have to give the warrior enough artificial intelligence to figure out how to respond on his own.


Warrior action API:


Warrior. Walk: Used to control the movement of warriors, the default direction is forward;

Warrior. Feel: Use the warrior to sense the situation ahead, such as space, or monsters;

Warrior. Attack: Let the Warrior attack the monster;

Warrior. Health: Gains the current health of the Warrior.

Warrior. Rest: Gives the Warrior a rest turn and restores 10% of his maximum health.


Warrior awareness API:


Empty: Senses whether Space is in front of it;

Space. Stairs: Sense if the front is a staircase;

Enemy: sensing whether there is a monster in front;

4. Space. Captive: to feel whether or not there is a captive in front of you;

Space.wall: Sense if the front is a wall.


Other:


Development language: Ruby

GitHub project address: github.com/ryanb/ruby-…


15. XGBoost


XGBoot is an optimized distributed gradient Boosting library designed for high efficiency, flexibility and portability. It implements machine learning algorithm in Gradient Boosting framework.


XGBoost solves many data science problems in a fast and accurate way by providing a parallel tree Boosting(also known as GBDT, GBM). The same code can run on large distributed environments such as Hadoop, SGE, and MP. It is similar to the gradient ascent framework, but more efficient. It combines linear model solver and tree learning algorithm.


XGBoot is at least 10 times better than existing gradient ascent implementations, while providing multiple objective functions, including regression, classification, and sorting. XGBoot is ideal for many races because of its predictive performance, with the added ability to do cross-validation and discover key variables.


It’s worth noting that XGBoost only works with numeric vectors, so all other forms of data need to be converted to numeric vectors; There are still many parameters to be adjusted in this algorithm when optimizing the model.


Development language: C++

Open-source license: Apache-2.0 License

GitHub project address: github.com/dmlc/xgboos…


16. GoLearn


GoLearn is a “full-featured” machine learning library for Go, with simplicity and customization as its development goals.


When GoLearn is installed, the data is loaded as an instance on which you can then manipulate the matrix and pass the operation values to the estimates. GoLearn implements Fit/Predict’s SciKit-learn interface so users can easily swap out estimates through trial and error. In addition, GoLearn includes ancillary features for data, such as cross-validation, training, and burst testing.


Development language: Go

GitHub project address: github.com/sjwhitworth…


17. ML_for_Hackers


ML_for_Hackers is a code library for machine learning for hackers. The library contains code examples for all machine learning for hackers (2012). The code may not be exactly the same as it appears in this article, as additional comments and modifications may have been added since publication.


All the code is in R, relying on numerous R packages, Topics covered include Classification, Ranking, and all the usual tasks of Regression and statistical methods such as principal component analysis (PCA) and multi-dimenstional Scaling.


Development language: R

Open source License: Simplified BSD License

GitHub project address: github.com/johnmyleswh…


18. H2O-2


H2O enables Hadoop to do math! It can measure statistics, machine learning and math through big data. H2O is extensible, allowing users to build blocks in core areas using simple mathematical models. H2O retains familiar interfaces similar to R, Excel, and JSON, allowing big data enthusiasts and experts to explore, transform, model, and score data sets using a range of algorithms ranging from simple to advanced. While it’s easy to gather data, it’s hard to make judgments, and H2O is able to gain insights from data more easily and quickly with faster, more optimized prediction models.


The algorithm for 0xData H2O is business process-oriented — fraud or trend prediction. Hadoop specialists can use Java to interact with H2O, but the framework also provides bundles for Python, R, and Scala.


Development language: Java

GitHub project address: github.com/h2oai/h2o-2


19. neon


Nervana Nervana is a Python-based deep learning framework that delivers high performance on common deep neural networks such as AlexNet, VGG or GoogLeNet. When designing Neon, the developers had the following features in mind:


Support common models and instances, such as Convnets, MLPs, RNNs, LSTMs, Autoencoders, etc. Many of the pre-trained implementations can be found in the model library;

Tight integration with nervanagpu cores for fp16 and fp32(benchmark) in maxwell gpus;

3s/macrobatch(3072 images) on AlexNet Titan X(1 GPU to 32 HRS fully operational);

Fast image caption model (up to 200 times faster than neuraltalk-based cpus).

Support basic automatic differentiation;

Frame visualization;

Switchable hard disk backend: Write code once and configure it to a CPU, GPU, or Nervana hard disk.


Within Nervana, NEON was used to solve customer problems across multiple domains.


Development language: Python

Open-source license: Apache-2.0 License

GitHub project address: github.com/NervanaSyst…


20. Oryx 2


The open source project Oryx provides a simple and real-time infrastructure for large-scale machine learning and predictive analytics. It implements some of the algorithms commonly used in business applications: collaborative filtering/recommendation, classification/regression, clustering, etc. In addition, Oryx can use Apache Hadoop to build models in large data streams, provide real-time queries for these models through HTTP REST apis, and update models approximately automatically as new data flows in. This dual design, which includes both the computing layer and the service layer, enables a Lambda architecture to be implemented separately. Models are exchanged in PMML format.





Oryx essentially does two things: modeling and serving models, which are the responsibilities of two separate parts, the computing layer and the service layer. The computing layer is an offline and batch process. Machine learning models can be established from input data. Its operating benefits lie in generation. The service layer is also a Long-running Java-based server process that exposes the REST API. Users can access it from a browser, or from any language or tool that can send HTTP requests.


Oryx’s positioning is not a library for machine learning algorithms, and Owen focuses on four things: regression, classification, clustering, and collaborative filtering (A.K.A. recommendations). Recommendation systems are popular, and Owen is working with several Cloudera customers to help them deploy recommendation systems using Oryx.


Development language: Java

GitHub project address: github.com/cloudera/or…


21. Shogun


Shogun is a machine learning toolkit created by Soeren Sonnenburg and GunnarRaetsch(focusing on large-scale kernel learning methods, in particular the Support Vector Machines (SVM) learning toolkit. It provides a generic INTERFACE to SVM objects in several different SVM implementations, including the most advanced LIBSVM and SVMlight, each OF which can be combined with a variety of kernels. The toolkit not only provides an efficient way to implement common kernel programs such as linear, polynomial, Gaussian, and S-type kernel functions, but also comes with some recent string kernel functions such as localization improvement, Fischer, TOP, Spectrum, weighted degree kernel, and shift. Later, effective LINADD optimized kernel functions have also been implemented.


In addition, Shogun provides the freedom to work with custom predictive kernels, one important feature of which is a composite kernel that can be constructed from a weighted linear combination of multiple subkernels, each of which does not need to work in the same domain. The weight of the optimal subkernel is known by using multi-kernel learning.


At present Shogun can solve the classification and regression problems of SVM 2 classes. Shogun also adds a number of linear methods such as linear discriminant analysis (LDA), linear programming (LPM), (kernel) perception, and some algorithms for training hidden Markov models.


Development language: C/C++, Python

License Agreement: GPLv3

GitHub project address: github.com/shogun-tool…


22. HLearn


HLearn is a high-performance machine learning library written by Haskell language. It currently has the fastest nearest neighbor implementation algorithm for any dimensional space.


HLearn is also a research-based project. The project’s research goal is to discover the “best possible” interface for machine learning. This involves two conflicting requirements: the library should run as fast as the underlying library developed by C/C++/Fortran/Assembly; It should also be as flexible as a high-level library developed by Python/R/Matlab. Julia has made amazing progress in this direction, but HLearn is much more ambitious. More significantly, HLearn aims to be faster and more flexible than low-level languages.


To achieve this goal, HLearn uses a completely different interface from the standard learning library. In HLearn, H stands for three different concepts that are fundamental to HLearn design:


H is for Haskell. Machine learning is predicting functions from data, so it makes perfect sense for functional programming languages to adapt to machine learning. Functional programming languages are not widely used for machine learning, however, because they generally lack the fast numerical computing power to support learning algorithms. HLearn uses SubHask library in Haskell to obtain fast numerical computation capability.

H also represents Homomorphisms. Homomorphisms are the basic concepts of abstract algebra. HLearn uses Homomorphisms in the learning system.

H also stands for History Monad. One of the most difficult tasks in developing a new learning algorithm is to debug the optimization process. Previously, there was no way to reduce the workload of the debugging process, but History Monad is trying to solve this problem. It allows you to optimize code throughout the entire thread without modifying the original code. In addition, there is no additional running overhead when using this technique.


Other:


Development language: Haskell

GitHub project address: github.com/mikeizbicki…


23. MLPNeuralNet


MLPNeuralNet is a fast multilayer perceptual neural network library for iOS and Mac OS systems that can predict new instances from trained neural networks. It takes advantage of vector arithmetic and hard disk acceleration (if available), and builds on Apple’s acceleration framework.





If you have designed a prediction model in Matlab(Python or R) and want to apply it in an iOS application. In this case, an MLP NeuralNet is needed, and an MLP NeuralNet can only load and run a forward-propagation model. MLP NeuralNet has several features:


Classification, multi-category classification and regression output;

Vectorization realization form;

Double precision;

Multiple hidden layers or empty (equivalent to logic/linear regression at this point).


Other:


Development language: Objective-C

License agreement: BSD License

GitHub project address: github.com/nikolaypavl…


24. Apache Mahout


Mahout is an open source project of the Apache Software Foundation(ASF) that provides scalable implementations of classic algorithms in machine learning to help developers create smart applications more easily and quickly. Mahout includes many implementations, including clustering, sorting, recommendation filtering, and frequent subitem mining. In addition, Mahout can be effectively extended to the cloud by using Apache Hadoop libraries. The goal of the Apache Mahout project is to create an environment that enables rapid creation of scalable, high-performance machine learning applications.


Although relatively young in the open source world, Mahout already provides a great deal of functionality, especially in clustering and CF. Mahout’s key features include:


Taste CF, Taste is an open source project for CF started by Sean Owen on SourceForge and gifted to Mahout in 2008;

Some map-Reduce-enabled cluster implementations include K-means, fuzzy K-means, Canopy, Dirichlet, and Mean-Shift;

Distributed Naive Bayes and Complementary Naive Bayes

Distributed applicability for evolutionary programming;

Matrix and vector library.


Mahout also allows you to categorize content. Mahout currently supports two approaches to content classification based on Bayesian statistics: the first is a simple map-reduce-enabled Naive Bayes classifier; The second approach is Complementary Naive Bayes, which tries to correct some of the problems of a Naive Bayes approach while still maintaining simplicity and speed.


Development language: Java

License Agreement: Apache

GitHub project address: github.com/apache/maho…


25. Seldon Server


Seldon is an open forecasting platform that provides content suggestions and general functional forecasting. It runs within the Kubernetes cluster, so it can be deployed to any address within the Kubernetes range: in-house or cloud (for example, AWS, Google Cloud Platform, Azure). In addition, it can measure the needs of large enterprise installations.


Development language: Java

GitHub project address: github.com/SeldonIO/se…


26. Datumbox – Framework


The Datumbox machine learning framework is an open source framework written in Java that covers a large number of machine learning algorithms and statistical methods and is capable of handling large data sets.


The Datumbox API provides a vast array of classifiers and natural language processing services that can be used in a wide range of applications, including sentiment analysis, topic classification, language detection, subjective analysis, spam detection, reading evaluation, keyword and text extraction, and more. Currently, all of Datumbox’s machine learning services are available through the API, and the framework enables users to quickly develop their own smart applications. Currently, the GPL3.0-based Datumbox machine learning framework is open source and can be downloaded from GitHub.


Datumbox’s machine learning platform has largely replaced ordinary smart applications. It has the following significant advantages:


Powerful and open source. The Datumbox API uses the powerful open source machine learning framework Datumbox to rapidly build innovative applications using its highly accurate algorithms.

Easy to use. The platform API is very easy to use and uses REST&JSON technology for all classifiers;

Use it quickly. Datumbox eliminates complex machine learning training models that take a lot of time. Users can use the classifier directly from the platform.


Datumbox can be mainly applied in four aspects: one is social media monitoring. It can evaluate users’ opinions through machine learning. Datumbox can help users build their own social media monitoring tools. The second is search engine optimization, which is very effective method is the location and optimization of important terms in the document; The third point is quality assessment. In online communications, evaluating the quality of user-generated content is very important to eliminate spam. Datumbox can automatically score and audit this content. Finally, there is text analysis. Natural language processing and text analysis tools are driving a lot of applications on the web, and platform apis can easily help users with this analysis.


Development language: Java

License agreement: Apache License 2.0

GitHub project address: github.com/datumbox/da…


27. Jubatus


The Jubatus library is an online machine learning framework running in a distributed environment, which is an open source framework for big data streams. It is similar to the Storm, but offers more features, including the following:


Online machine learning library: including classification, aggregation, and recommendation;

Fv_converter: Data preprocessing (in natural language);

An online machine learning framework that supports fault tolerance.


Jubatus thinks future data analysis platforms should be developed in three directions simultaneously: processing larger data, in-depth analysis and real-time processing. Jubatus combines the strengths of online machine learning, distributed computing, and random algorithms for machine learning, and supports basic elements such as classification, regression, and recommendation. According to its design purpose, Jubatus has the following features:


Extensible: Support for extensible machine learning processing. Data processing speeds up to 100,000 bits/SEC on normal hardware clusters; + Real-time computing: real-time data analysis and model update;

Deep data analysis: support all kinds of analysis and calculation: classification, regression, statistics, recommendation, etc.


If there is a need for machine learning based on streaming data, Jubatus is worth watching.


Development language: C/C++

License Agreement: LGPL

GitHub project address: github.com/jubatus/jub…


28. Decider


Decider is another Ruby machine learning library that is flexible and extensible. Decider has built-in support for plain text and URIs, fill-in terms, stop word removal, lattices, and more, all of which can be easily combined in options. Decider supports any storage mechanism available in Ruby. If you like, you can save it to the database for distributed classification.


Decider has several benchmarks that also double as integration tests. These are run periodically and used to pinpoint CPU and RAM bottlenecks. Decider can do a lot of math, and it’s very computationally intensive, so it requires a lot of speed. This is often done using Ruby1.9 and JRuby to test its computational speed. In addition, the user’s data set should be entirely in memory, or you will run into trouble.


Development language: Ruby

GitHub project address: github.com/danielsdele…


End.