Tags: Bokeh, Data Science, Keras, Matplotlib, NLTK, numpy, Pandas, Plotly, Python, PyTorch, scikit-learn, SciPy, Seaborn, TensorFlow, XGBoost

Original author:ActiveWizards

Original address:Top 20 Python Libraries for Data Science in 2018

Welcome to my simple book: The Silly Bird’s Translation Py Nonsense

Python remains the data science leader this year. Last year, we posted a list of the top 15 Python data science libraries on our blog, and everyone thought it was great. This time, we’ll take a look at some of the top Data science Python libraries updated this year, and we’ve also added some new ones to this list.

In fact, there are more than 20 libraries on this list, mainly because there are multiple libraries in some areas that solve the same problem, and it’s not clear which one will be the leader, so we’ve grouped them together for you to choose from.

Core library and statistics library

1. NumPy(Submitted modification: 17911, Contributor: 641)

As usual, this list starts with the science application library. Numpy is the first choice for this category. It is a collection of tools for handling large multidimensional arrays, matrices, and advanced mathematical functions, providing a variety of data manipulation methods.

Numpy has made a lot of improvements this year. In addition to Bug fixes and compatibility enhancements, the key improvement is the addition of an optional style, the print format for Numpy objects. In addition, new capabilities have been added so that Numpy can process files using any Python supported encodings.

2. SciPy(Submitted modification: 19150, Contributor: 608)

Another core library for scientific computing is SciPy. SciPy is based on Numpy and extends the functionality of Numpy. Its core data structure is a multi-dimensional array implemented with Numpy. This library contains a set of tools for handling tasks such as linear algebra, probability theory, integral calculus, and more.

Major improvements in SciPy include integration with different operating systems, the addition of new functions and methods, and, notably, an updated optimizer. In addition, a number of new BLAS[^1] and LAPACK[^2] functions are packaged.

[^1]: BLAS stands for Basic Linear Algebra Subprograms. BLAS is a series of API specifications, such as vector, matrix multiplication, etc. [^2]: The full name of LAPACK is Linear Algebra Package, i.e., linear algebra library. Its underlying layer is BLAS, on which many functions of matrix and vector advanced operations are defined, such as matrix decomposition, inverse and singular value. LAPACK is an interface specification for scientific computing (matrix operations) that is more efficient than BLAS.

3. Pandas(Submitted modification: 17144, Contributor: 1165)

Pandas provides high-level data structures anda wide range of analysis tools. The best thing about this library is that you can do a lot of complex data analysis with one or two commands. Pandas provides fast operations for grouping, filtering, merging data, and time series.

Pandas has made hundreds of optimizations this year, including new features, Bug fixes, and API improvements. These optimizations focus on improving Pandas’ ability to group and sort data, providing better outputs for the Apply method, and support for manipulating custom types.

4. StatsModels(Submitted modification: 10067, Contributor: 153)

Statsmodels is a Python module for statistical analysis, such as evaluating statistical models, performing statistical tests, and so on. With Statsmodels, you can implement many machine learning methods to explore the graphing possibilities of different shapes.

Statsmodels is still under development and will provide more and more new features in the future. This year, Statsmodels introduced time series and new technical models, such as generalized Poisson, zero expansion model, negative binomial distribution, etc. It also provides some new multivariate methods, such as factor analysis, multivariate analysis of variance (MANOVA) and ANOVA repeated measurements.

visualization

5. Matplotlib(Submitted for modification: 25747, Contributor: 725)

Matplotlib is an underlying library for creating 2d graphics. Matplotlib allows you to create histograms, scatter plots, and non-Cartesian coordinates, and most popular drawing libraries are compatible with Matplotlib for interactive manipulation.

This year, Matplotlib optimized styles for colors, fonts, sizes, legends, and more. Improvements to its appearance include automatic alignment of legends, for example, and improved colors, including a new hue ring that can be seen even by color blind people.

6. Seaborn(Submitted for modification: 2044, Contributor: 83)

Seaborn is a high-level API developed based on Matplotlib that provides more useful default graphs than Matplotlib, including visualizations such as time series, Jointplot, and violin plots.

In the first half of 2018, Seaborn’s upgrade was mostly about fixing bugs. It also improved the compatibility of FaceGrid and PairGrid with Matplotlib’s enhanced interaction back end, and added parameters and options for visualizations.

7. Plotly(Submitted modification: 2906, Contributor: 48)

Easy to generate complex graphics with Plotly, the library can be used to develop interactive web applications, as well as to generate gorgeous contour, trig, and 3D charts.

This year, Plotly was optimized to provide many new graphics and features, introducing support for multiple linked views such as animation and crosstalk integration.

8. Bokeh(Submitted changes: 16983, Contributors: 294)

Bokeh uses JavaScript widgets to create interactive, vectorized visualizations that can be viewed in a browser. Bokeh provides many graphics, styles, and interaction capabilities by linking graphics and adding widgets and callbacks to utility features.

Bokeh’s improved interactive features this year are notable, such as rotating category legend tags, zooming tools, and enhanced custom tooltip fields.

9. Pydot(Submitted modification: 169, Contributor: 12)

Pydot is a library for generating complex directed and undirected graphs. It is the Graphviz^3 interface developed by Python. Pydot makes it possible to display structured graphics, build neural network diagrams and algorithm-based decision trees.

Machine learning

10. scikit-learn(Submitted modification: 22753, Contributor: 1084)

One of the most powerful data processing libraries is sciKit-Learn, a Python module based on Numpy and SciPy. Scikit-learn provides many standard machine learning and data mining algorithms, such as classification, clustering, regression, dimensionality reduction, and model selection.

In the first half of 2018, SciKit-Learn made many improvements. Optimized cross validation so that it can use more than one metric; Several training methods such as nearest neighbor and logistic regression are improved. Another point is that a glossary of generic terms and API elements is finally available, which makes it easy to understand sciKit-Learn terminology and conventions.

11. XGBoost / LightGBM / CatBoost(Submitted modification: 3277/1083/1509, Contributor: 280/79/61)

Gradient enhancement is the most popular algorithm in machine learning, which is built on a continuously refined set of basic models such as decision trees. As a result, there are specialized libraries designed to implement this approach quickly and easily. We think XGBoost, LightGBM, and CatBoost deserve special attention. Both are contenders to solve the problem of gradient enhancement and are used in much the same way. The XGBoost libraries offer highly optimized, scalable, and rapidly implementable gradient enhancement algorithms that have made them very popular in the data science community. Even more, since its algorithms are so helpful in winning Kaggle contests, many competitors now like to use these libraries to solve Kaggle contest problems.

12. Eli5(Submitted modification: 922, Contributor: 6)

Eli5 was developed to address the pain point that machine learning models often predict results that are not immediately obvious and easily understood. Eli5 is used to visualize the machine learning model and Debug the algorithm step by step. Eli5 supports libraries such as SciKit-Learn, XGBoost, LightGBM, Lightning and Sklear-CrFSuite, with the ability to perform different monitoring tasks for these libraries.

Deep learning

13. TensorFlow(Submitted modification: 33339, Contributor: 1469)

TensorFlow is Google Brain’s hyper-popular deep learning and machine learning framework that uses artificial neural networks to process multiple data sets. TensorFlow is widely used in object recognition, speech recognition and other services. Now there are many high-level helpers created based on TensorFlow, such as TFLearn, TF-SLIM and SKFlow.

TensorFlow continues to iterate rapidly, with a number of updates and new features released this year. For example, the latest version fixes potential security issues and improves TensorFlow’s integration with gpus, allowing TensorFlow to run evaluation period models on a single machine using multiple Gpus.

14. PyTorch (Submitted modification: 11306, Contributor: 635)

PyTorch is a large framework that allows users to use gpus to accelerate tensor calculations, create dynamic computations, and automatically compute gradients. PyTorch also provides a rich API for addressing neural network related applications.

PyTorch is based on the open source deep learning library Torch, which introduced the Python API in 2017. Since then, PyTorch has become very popular, attracting a growing number of data scientist users.

15. Keras(Submitted modification: 4539, Contributor: 671)

Keras is a high-level neural network library that runs on TensorFlow and Theano, with CNTK and MxNet as backends in the latest version. Keras simplifies tasks and greatly reduces the amount of writing code. However, for some complex tasks, Keras is not very suitable.

This year, Keras optimized performance, availability, documentation, and apis, adding new features such as a Conv3D inversion layer, MobileNet applications, and self-normalization networks.

Distributed deep learning

16. Dist-keras / elephas / spark-deep-learning(Submitted modification: 1125/170/67, Contributor: 5/13/11)

Nowadays, more and more use case processing requires enormous resources and time, so the problem of deep learning for large-scale data has become very serious. However, as distributed computing systems such as Apache Spark expand into deep learning, it becomes easier to process such large amounts of data. As a result, libraries such as Dist – Keras, Elephas, and Spark-deep-learning were quickly developed. But these libraries are all aiming to solve the same problem, and it’s not clear who will stand out. These libraries can train neural networks by calling Apache Spark directly through Keras, and Spark-deep-Learning also provides tools for building Python neural network pipelines.

Natural language processing

17. NLTK(Submitted for modification: 13041, Contributors: 236)

NLTK is a platform for natural language processing. You can use NLTK to process and analyze text, do word segmentation, labeling, information extraction, etc. You can also use NLTK to build prototypes and research systems.

NLTK has made few improvements this year, mainly to optimize apis and compatibility, as well as providing a new CoreNLP interface.

18. SpaCy(Submitted modification: 8623, Contributor: 215)

SpaCy, a natural language processing library, provides many useful routines, API documentation, and application demos. SpaCy is developed using Cython, supports more than 30 languages, integrates easily with deep learning, and guarantees robust, accurate results.

Another very useful feature of SpaCy is that it is designed so well that it can process the entire document without breaking it into sections.

19. Gensim(Submitted for modification: 3603, Contributor: 273)

Gensim is a Python library based on Numpy and Scipy for semantic analysis, topic modeling, and vector space modeling. It provides NLP algorithms such as WORD2VEc. Although Gensim has its own models.wrappers. Fasttext, the Fasttext library can also be used for efficient word representation learning.

The data collection

20. Scrapy(Submitted modification: 6625, Contributor: 281)

Scrapy is a Python library for building crawler robots that scan web pages and collect structured data. Scrapy can also extract data through an API. Scrapy supports expansion and migration, making it very handy to use.

This year saw a number of Scrapy upgrades, including improvements to the Proxy server, improved error notification and problem identification, and new capabilities to use Scrapy to parse metadata Settings.

conclusion

That’s our list of Python data science libraries for the first half of 2018. Compared with last year, the classic data science libraries are still being improved and optimized, and at the same time, many new data science libraries have emerged in this field.

Finally, I present the Github active list for these libraries.

**ActiveWizards** is a team of data scientists and data engineers focused on big data, data science, machine learning, data visualization, and more. Our core areas of expertise include data science (data research, machine learning algorithms and data development), Data visualization (D3.js, Tableau, etc.), big data development (Hadoop, Spark, Kafka, Cassandra, HBASE, MongoDB, etc.) and data-intensive Web application development (RESTful API, Flask, Django, Meteor, etc.).