Very useful Python library, one fire at a time

Click “Python Programming time” above and select “Add as star” \

First look at Python technology dry stuff!

Original address: dwz.cn/FBj1Ktxv

Translation link: dwz.cn/moEU7xzr

Python is a great language. It is one of the fastest growing programming languages in the world. It has proven its usefulness time and time again in developer positions and data science positions across industries. The entire Ecosystem of Python and its libraries makes it a suitable choice for users, beginners and advanced alike, around the world. One of the reasons for its success and popularity is its powerful collection of third-party libraries, which allow it to remain vibrant and efficient. \

In this article, we’ll look at some Python libraries for data science tasks, rather than the more common ones like Panda, SciKit-Learn, and Matplotlib. Although libraries like Panda and SciKit-Learn are common in machine learning tasks, it’s always good to know about other Python products in the field.

Wget

Extracting data from the web is one of the most important tasks for data scientists. Wget is a free utility for downloading non-interactive files from the network. It supports HTTP, HTTPS, and FTP protocols, as well as file retrieval via HTTP proxies. Because it is non-interactive, it can work in the background even if the user is not logged in. So the next time you want to download all the images on a website or a page, Wget can help you. Installation:

$ pip install wget
Copy the code

Example:

import wget
url = 'http://www.futurecrew.com/skaven/song_files/mp3/razorback.mp3'

filename = wget.download(url)
100% [...]3841532 / 3841532

filename
'razorback.mp3'

### Pendulum
Copy the code

For those of you who get frustrated dealing with date and time in Python, Pendulum is for you. It is a Python package that simplifies date-time manipulation. It is a simple alternative to Python’s native classes. Please refer to the documentation for further study.

Installation:

$ pip install pendulum
Copy the code

Example:

import pendulum

dt_toronto = pendulum.datetime(2012.1.1, tz='America/Toronto')
dt_vancouver = pendulum.datetime(2012.1.1, tz='America/Vancouver')

print(dt_vancouver.diff(dt_toronto).in_hours())

3 
Copy the code

imbalanced-learn

It can be seen that when the number of samples of each class is basically the same, the effect of most classification algorithms is the best, that is, data balance needs to be maintained. However, most of the real cases are unbalanced data sets, which have a great impact on the learning stage and subsequent prediction of machine learning algorithms. Fortunately, this library is designed to solve this problem. It is compatible with Scikit-learn and is part of the Scikit-lear -contrib project. Try it the next time you encounter an unbalanced data set.

Installation:

PIP install -u imbalanced-learn # conda install -c conda-forge imbalanced-learnCopy the code

Example:

Please refer to the documentation for usage methods and examples.

FlashText

In NLP tasks, cleaning text data often requires replacing or extracting keywords from sentences. Typically, this can be done using regular expressions, but it can become cumbersome if the number of terms to search is in the thousands. Python’s FlashText module, based on the FlashText algorithm, provides a suitable alternative to this situation. The great thing about FlashText is that the running time is the same regardless of the number of search terms. You can read more here.

Installation:

$ pip install flashtext
Copy the code

Example:

Extract key

from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()

# keyword_processor.add_keyword(<unclean name>, <standardised name>)

keyword_processor.add_keyword('Big Apple'.'New York')
keyword_processor.add_keyword('Bay Area')
keywords_found = keyword_processor.extract_keywords('I love Big Apple and Bay Area.')

keywords_found
['New York'.'Bay Area']
Copy the code

Substitution key

keyword_processor.add_keyword('New Delhi'.'NCR region')

new_sentence = keyword_processor.replace_keywords('I love Big Apple and new delhi.')

new_sentence
'I love New York and NCR region.'
Fuzzywuzzy
Copy the code

The name of this library sounds strange, but fuzzyWuzzy is a very useful library for string matching. It is easy to calculate string matching, token matching, and other operations, as well as to match records stored in different databases.

Installation:

$ pip install fuzzywuzzy
Copy the code

Example:

from fuzzywuzzy import fuzz
from fuzzywuzzy importProcess # Simple matching fuzz.ratio("this is a test"."this is a test!")
97# Fuzz.partial_ratio ("this is a test"."this is a test!")
 100
Copy the code

More interesting examples can be found at the GitHub repository.

PyFlux

Time series analysis is one of the most common problems in machine learning. PyFlux is an open source library in Python that was built to handle time series problems. The library has an excellent set of modern time series models, including but not limited to ARIMA, GARCH and VAR models. In short, PyFlux provides a probabilistic approach to time series modeling. It’s worth a try.

The installation

pip install pyflux
Copy the code

example

Please refer to the official documentation for detailed usage and examples.

Ipyvolume

Presentation of results is also an important aspect of data science. Being able to visualize the results is a big advantage. IPyvolume is a Python library for visualizing 3D objects and graphics (such as 3D scatter diagrams) in Jupyter Notebook with minimal configuration. But it’s still pre-1.0. An apt analogy is that IPyvolume’s volshow is as good for 3d arrays as Matplotlib’s imshow is for 2d arrays. You can get more here.

The use of PIP

$ pip install ipyvolume
Copy the code

Use Conda/Anaconda

$ conda install -c conda-forge ipyvolume
Copy the code

example

animation

Volume rendering

Dash

Dash is an efficient Python framework for building Web applications. It’s built from Flask, Plotly. Js, and React.js, with modern UI elements such as drop-down boxes, sliders, and charts, and you can write analysis directly in Python code without javascript. Dash is ideal for building data visualization applications. These applications can then be rendered in a Web browser. The user guide is available here.

The installation

pip install dash==0.29. 0PIP install dash-html-components==0.132.PIP install dash-core-components==0.36. 0PIP install dash-table==3.13.# Interactive DataTable component (new!)Copy the code

Examples The following example shows a highly interactive chart with drop-down capabilities. When the user selects a value from the drop-down menu, the application code dynamically exports the data from Google Finance to the Panda DataFrame.

Gym

OpenAI’s Gym is a development and comparison kit for enhanced learning algorithms. It is compatible with any numerical computation library, such as TensorFlow or Theano. Gym library is a must-have tool for testing problem sets, also known as environments — you can use it to develop your reinforcement learning algorithms. These environments have a shared interface that allows you to write common algorithms.

The installation

pip install gym
Copy the code

This example runs an instance of the Cartpole-v0 environment with 1000 time steps, each rendering the entire scene.

conclusion

These useful data science Python libraries are carefully selected by me, not the usual ones like Numpy and PANDAS. If you know of other libraries that can be added to the list, please mention them in the comments below. And don’t forget to try them out first.

Using Python to analyze hurun rankings in recent years, I am sour…

The crawler | how to set the proxy IP anomaly after scrapy request

All of Python’s built-in functions are organized for you!

By the way, one button, four links, this is really important to me. **

Very useful Python library, one fire at a time

Wget

imbalanced-learn

FlashText

PyFlux

Ipyvolume

Dash

Gym

conclusion

Related Posts

High availability: usability practice of meituan-Dianping smart payment core transaction system

Spring source code series BeanDefinition

Please, stop using pprint to print dictionaries