On November 27 solstice December 3 KDnugget websites, this article was republished (https://www.kdnuggets.com/2017/12/top-news-week-1127-1203.html).

We use the for loop to do most of the work, which requires updating a long list of elements. I’m pretty sure that almost everyone reading this article wrote their first matrix or vector multiplication code in high school or college using a for loop. The for loop provides a long-term, stable service to the programming community.

However, for loops are typically slow to execute when processing large data sets (e.g., millions of records in the era of big data). This is especially true for an interpreted language like Python. If your loop body is simple, the loop interpreter takes up a lot of overhead.

Fortunately, most of the major programming languages have another programming language that can replace it. The same is true for Python.

Numpy, short for Numerical Python (http://numpy.org/), is the basic package needed for high-performance scientific computing and data analysis in the Python ecosystem. It is the basis for almost all high-level language tools, such as Pandas and SciKit-learn, which are compiled on top of Numpy. TensorFlow uses NumPy arrays as the underlying compilation blocks. On top of that you build the Tensor object and graphFlow for deep learning (which uses a lot of linear algebra in a long list/vector/matrix).

The two most important features that Numpy provides are:

  • Ndarray: A fast spatially efficient multidimensional array, Provides the calculation of vector quantization and complicated radio ability (https://towardsdatascience.com/two-cool-features-of-python-numpy-mutating-by-slicing-and-broadcasting – 3 b0b86e8b4c7)
  • A standard mathematical function that can quickly operate on an entire array of data without writing a loop.

In the data science, machine learning, and Python communities, you often come across the assertion that Numpy is faster. Because it is based on the realization of the vector, and many of its core routines are written in C (based on retaining frame: https://en.wikipedia.org/wiki/CPython).

This article is a retaining a good framework of (http://notes-on-cython.readthedocs.io/en/latest/std_dev.html) Numpy can work together with each aspect. You can even write bare bone C routines using the Numpy API. Numpy arrays are homogeneous types of dense arrays. Instead, Python lists are arrays of Pointers to objects, even if they are of the same object type. You can get harvest from regional association (https://en.wikipedia.org/wiki/Locality_of_reference).

Many Numpy operation is implemented in C language, to avoid the overhead of the Python cycle, pointer to each element of the dynamic type checking (https://www.sitepoint.com/typing-versus-dynamic-typing/). Numpy speeds up depending on what you’re doing. This is a valuable advantage for data science and modern machine learning, as data sets are often millions or even billions in size. And you don’t want to update with the For loop and its associated algorithm.

How do you verify it with a medium-sized data set?

Here is the Jupyter Making code link (https://github.com/tirthajyoti/PythonMachineLearning/blob/master/How%20fast%20are%20NumPy%20ops.ipynb). In some simple lines of code, Numpy operates at a different speed than regular Python programming, Such as the for loop, the map function (https://stackoverflow.com/questions/10973766/understanding-the-map-function) or the list comprehensions (HTTP p://www.pythonforbeginners.com/basics/list-comprehensions-in-python).

Here I briefly summarize the basic process:

  • Create a list of floating-point numbers with a medium set, preferably drawn from a continuous statistical distribution, such as a Gaussian or uniform random distribution. I chose a million pieces of data for the demonstration
  • Create an Ndarray object in the list, which is vectorized
  • Write short blocks of code to update the list and use math on the list, such as logarithms base 10. Use for loops, map-function, and list-Comprehension. And use the time() function to verify how long it takes to process one million pieces of data
t1=time.time()
for item in l1:
l2.append(lg10(item))
t2 = time.time()
print("With for loop and appending it took {} seconds".format(t2-t1))
speed.append(t2-t1)Copy the code
  • Use Numpy’s built-in math (Np.log10) to do the same with an NDARray object. Figure out how long it takes
t1=time.time() a2=np.log10(a1) t2 = time.time() print("With direct Numpy log10 method it took {} seconds".format(t2-t1))  speed.append(t2-t1)Copy the code
  • Store execution times in a list and plot a histogram of the differences

Here is the result. You can repeat the whole process by running all the code unit blocks in the Jupyter notebook. Each time it generates a new set of random numbers, so the exact execution time may vary. But overall, the trend has been the same. You can try a variety of other mathematical functions/string operations or collections to see if they work in general.

There is a complete written by the French neuroscience researchers open source online book (https://www.labri.fr/perso/nrougier/from-python-to-numpy/#id7).



Bar chart comparing speeds of simple mathematical operations


If you have any questions or ideas to share, please contact the author ([email protected]). You can also in Python, R or MATLAB, and machine learning resources in view of the author’s making library (https://github.com/tirthajyoti), obtain other interesting snippets. You can also on LinkedIn (https://www.linkedin.com/in/tirthajyoti-sarkar-2127aa7/) pay attention to me.


The original post was published on December 21, 2017

Author: Data

This article is from the cloud community partner “Datapai THU”. For relevant information, you can follow the wechat public account of “Datapai THU”