How can Pypy make Python faster than C? We can dig a little deeper and get to the bottom of pYPY.

Before we get started, say a few more words. Our company has developed a number of track games, including a Top2 fish game project, SLG card game project that won Aladdin magic Lamp award and overseas match-3 game. For these different types of games, the back end is mostly pYPY. I have a little experience with how to use pYPY. No more words, to begin formally, this article includes the following parts:

  • Classification of language
  • Python’s interpreter implementation
  • Why is pypy fast
  • Performance comparison
  • Performance optimization method
  • The characteristics of pypy
  • summary

Classification of language

Let’s start with some basic concepts of language classification, and those who know this part very well can skip it.

Static versus dynamic languages

If the type of a variable is known at compile time, the language is statically typed. Common examples of statically typed languages include Java, C, C ++, FORTRAN, Pascal, and Scala. In a statically typed language, once a variable is declared with a type, it cannot be assigned to another variable of a different type, and doing so would raise a type error at compile time.

# java int data; data = 50; Data = "Hello Game_404!" ; // causes an compilation errorCopy the code

If the type of a variable is checked at run time, the language is dynamically typed. Common examples of dynamically typed languages include JavaScript, Objective-C, PHP, Python, Ruby, Lisp, and Tcl. In dynamically typed languages, variables are bound to objects at run time via assignment statements, and the same variable can be bound to objects of different types during program execution.

# python data = 10; data = "Hello Game_404!" ; // no error caused data = data + str(10)Copy the code

In general, static languages are compiled into bytecode for execution, while dynamic languages are executed using an interpreter. Compiled languages perform better, but are harder to port to different CPU architectures and operating systems. Interpreted languages are portable, and performance can be much worse than compiled languages. These are the two extremes of the spectrum.

Strongly typed languages vs weakly typed languages

A strongly typed language is a language in which variables are bound to a particular data type, which results in a type error if the type is not as expected in the expression, such as this one:

# python

temp = “Hello Game_404!”
temp = temp + 10; // program terminates with below stated error (TypeError: must be str, not int)
Copy the code

Python feels at odds with us and betrays a weakly typed language, unlike the best in the world 🙁

# php

$temp = “Hello Game_404!”;
$temp = $temp + 10; // no error caused
echo $temp;
Copy the code

The quadrants of common programming languages are classified as follows:

This part of the content is mainly translated from the reference link 1

Python’s interpreter implementation

Python is a dynamic programming language that is interpreted and executed by a specific interpreter. Here are some interpreter implementations:

  • CPython uses an interpreter implemented in C
  • PyPy uses an interpreter implemented in RPython, a subset of the Python language. In general, PyPy is 4.2 times faster than CPython
  • Stackless Python interpreters with coroutine implementations
  • Jython Java implementation interpreter
  • An interpreter implemented by IronPython.net
  • A more recent implementation of Pyston, a branch of CPython 3.8.8, has other optimizations for performance. It provides up to 30% acceleration for large real-world applications, such as Web services, without the need for development.
  • .

A few more related concepts:

  • IPython && Jupyter IPython is an interactive shell built using Python, and Jupyter is its web-based wrapper.
  • Anaconda is a Python virtual environment commonly used in Python data science.
  • Mypyc is a new project that compiles Python to a C code base in order to improve the efficiency of Python.
  • Py files and PyC files PyC files are bytecode compiled by Python and can also be executed by the Python interpreter.
  • Both wheel files and egg files are packaged files for project releases, with wheel being the latest standard.
  • .

The question is, isn’t Python an interpreted language? Why is there a compiled PyC. Here’s the thing: after py files are compiled into Pyc, the interpreter defaults to executing pyC files first, which makes python programs start up faster. After betraying weakly-typed languages, python is a ghost that jumps between compiled and interpreted languages.

Another event is the Go language’s implementation of bootstrap in version 1.5. The Go language implemented its compiler in C before version 1.5, and implemented its own compiler in Go during version 1.5. There is a chicken-and-egg process here, which is also interesting.

Why is pypy fast

Pypy uses rPython, a subset of Python, to implement the interpreter, similar to the bootloader of Go described earlier. Against common sense is that the RPython interpreter will be faster than the C interpreter? Mainly because PYPY uses JIT technology.

Just-in-time (JIT) Compiler attempts to get the best of both worlds by doing some actual compilation and some interpretation of the machine code. In short, here are the steps that JIT compilation takes to improve performance:

  1. Identify the most commonly used components in your code, such as functions in loops.
  2. These parts are converted to machine code at run time.
  3. Optimize the generated machine code.
  4. Swap the previous implementation with the optimized machine code version.

This is what the examples in “How Pypy makes Python Faster than C” show. Pypy has the following features in addition to its fast speed:

  • Less memory usage than cpython
  • Gc policy is more optimized
  • Stackless coroutine mode is supported by default and supports high concurrency
  • Good compatibility, highly compatible cpython implementation, basically can seamlessly switch

These are all claims

Pypy is so strong, fast and province are occupied, why not a large scale popular? Personally, I think it’s mostly Python.

  1. Pypy has no advantage in the python ecosystem where a large number of libraries are implemented in C, especially those related to scientific computing /AI. Pypy is fast mainly in pure-Python, which is the pure Python implementation.
  2. Pypy suitable for high concurrency applications with long memory (Web services class)
  3. Python is a glue language that doesn’t aim for extreme performance, and even four times as fast isn’t fast enough :(🐶. Certainly not as good as C, the cpython interpreter implemented by C.

Note that pYPy also has a GIL, so high concurrency is mostly in Stackless.

Refer to self reference link 2 for this section

Performance comparison

We can write performance test cases, speak in code, and compare implementations. The test cases in this article are not rigorous, but suffice to illustrate a few points.

Driving and walking

The cumulative number of test cases in the original text is 100000000, let’s reduce it to 1000:

import time

start = time.time()
number = 0
for i in range(1000):
    number += i

print(number)
print(f"Elapsed time: {time.time() - start} s")
Copy the code

The test results are shown in the following table (the test environment is in the appendix of this article):

The interpreter cycles Time (s)
python3 1000 0.00014281272888183594
pypy3 1000 0.00036716461181640625

It turns out that Cpython is faster than pypy for 1000 loops, as opposed to 100000000 loops. The following example can be very vivid to explain this point.

Suppose you want to go to a store near your home. You can walk or drive. Your car is obviously faster than your feet. However, consider that you need to do the following:

  1. Go to your garage.
  2. Start your car.
  3. Keep the car warm.
  4. Drive to the store.
  5. Look for a parking space.
  6. Repeat the process on the return trip.

Driving involves a lot of overhead, and it’s not always worth it if the place you want to go is nearby! Now think about what would happen if you wanted to go to a nearby city 50 miles away. It’s certainly worth driving there instead of walking.

Examples come from reference link 2

The same is true of PyPy and CPython, although the difference in speed is not as obvious as the above analogy.

Horizontal contrast

Let’s compare the performance of C, Python3, Pypy3, JS, and Lua horizontally.

# js const start = Date.now(); let number = 0 for (i=0; i<100000000; i++){ number += i } console.log(number) const millis = Date.now() - start; console.log(`milliseconds elapsed = `, millis); # lua local starttime = os.clock(); local number = 0 local total = 100000000-1 for i=total,1,-1 do number = number+i end print(number) local endtime = os.clock(); print(string.format("elapsed time : %.4f", endtime - starttime)); # c #include <stdio.h> #include <time.h> const long long TOTAL = 100000000; long long mySum() { long long number=0; long long i; for( i = 0; i < TOTAL; i++ ) { number += i; } return number; } int main(void) { // Start measuring time clock_t start = clock(); printf("%llu \n", mySum()); // Stop measuring time and calculate the elapsed time clock_t end = clock(); double elapsed = (end - start)/CLOCKS_PER_SEC; printf("Time measured: %.3f seconds.\n", elapsed); return 0; }Copy the code
The interpreter cycles Time (s)
c 100000000 0.000
pypy3 100000000 0.15746307373046875
js 100000000 0.198
lua 100000000 0.8023
python3 100000000 10.14592313766479

The test results show that C is undoubtedly the fastest and beats other languages, which is the characteristic of compiled languages. In explanatory language, pypy3’s performance deserves the word “excellent”.

Memory footprint

Output for increasing memory footprint in test case:

p = psutil.Process()
mem = p.memory_info()
print(mem)
Copy the code

The test results are as follows:

# python3
pmem(rss= 9027584, vms=4747534336, pfaults= 2914, pageins=1)

# pypy3
pmem(rss=39518208, vms=5127745536, pfaults=12188, pageins=58)
Copy the code

Pypy3 is going to have a higher memory footprint than Python3, which is just scientific, swapping memory space for runtime. Of course, this is not a rigorous evaluation. In reality, PYPY claims to have a lower footprint. I doubt it, but there is no proof.

Performance optimization method

With the language performance comparison in mind, let’s take a look at some of the performance optimization methods that are helpful in choosing between Cpython and Pypy.

Using c functions

In Python, c functions, such as the sum here, can be replaced by reduce to improve efficiency:

def my_add(a, b):
    return a + b

number = reduce(add, range(100000000))
Copy the code
The interpreter The number of Time (s)
pypy3 reduce 0.08371400833129883
pypy3 100000000 0.15746307373046875
python3 reduce 5.705173015594482 s
python3 100000000 cycle 10.14592313766479

The results show that Reduce works on both Cpython and Pypy.

To optimize the cycle

Optimize the most critical place, improve the efficiency of the algorithm, reduce the cycle. To change the cumulative requirement, assuming we are summing even numbers up to 100000000, the following shows how to improve performance by reducing the number of loops using the step of range:

Python3 xrange = range def test_0(): number = 0 for i in range(100000000): if i % 2 == 0: number += i return number def test_1(): number = 0 for i in xrange(0, 100000000, 2): number += i return numberCopy the code
The interpreter cycles Time (s)
python3 50000000 2.6723649501800537 s
python3 100000000 6.530670881271362 s

After halving the number of cycles, the effective rate was significantly improved.

The static type

Python3 can use type annotations to improve code readability. Type determination is a logical performance aid that eliminates the need for type inference every time data is processed.

number: int = 0
for i in range(100000000):
    number += i
Copy the code
The interpreter cycles type Time (s)
python3 100000000 int 9.492593050003052 s
python3 100000000 Does not define 10.14592313766479 s

Memory is equivalent to one space, and we have to fill it with different boxes. Part 1 on the left of the figure is filled with boxes of length 4(think float), one line at a time, which is the fastest; In the middle part of the figure (2), use boxes of length 3(think long) and length 1(think int), 2 in a row, also pretty fast; 3 on the right side of the figure, although the length of the box is still 3 and 1, the speed is the slowest because there is no scale and it needs to be tried when filling.

The charm of algorithms

At the end of the optimization, the most important thing comes: the Gaussian summation algorithm. Gauss story, presumably everyone is not unfamiliar, the following is the algorithm implementation:

def gaussian_sum(total: int) -> int:
    if total & 1 == 0:
        return (1 + total) * int(total / 2)
    else:
        return total * int((total - 1) / 2) + total


# 4999999950000000
number = gaussian_sum(100000000 - 1)
Copy the code
The interpreter cycles Time (s)
python3 Gauss sum 4.100799560546875 e-05 s
python3 100000000 cycle 10.14592313766479

After using gaussian summation, the program seconds open. This is probably the industry interview, to test the truth of the algorithm, but also the charm of the algorithm.

Principle of optimization

Briefly introduce the principles of optimization, mainly the following two points:

  1. Use tests instead of speculations.
python3 -m timeit 'x=3' 'x%2' 10000000 loops, best of 5: 25.3 nsec per loop PYTHon3-m timeit 'x=3' x&1' 5000000 loops, best of 5: 41.3 nsec per loop python2-m timeit 'x=3' x&1' 10000000 loops, best of 3: Python2 -m timeit 'x=3 "x%2' 10000000 loops, best of 3: 0.0371 usec per loop python2 -m timeit 'x=3" x%2' 10000000 loops, best of 3: 0.0371 usec per loopCopy the code

The example above shows that the bit operation in python3 is slower than modulo in the case of parity, which is a counter-intuitive place to speculate. This is also covered in my python Cold Weapons Collection article. It is also important to note that Python2 and Python3 have the opposite performance, so the performance optimization should be measured, paying attention to the environment and effectiveness.

  1. Follow the 2/8 rule, don’t over-optimize, don’t repeat.

The characteristics of pypy

Pypy also has the following features:

  • Cffi pypy You are advised to load C in CFFI mode
  • Using cProfile to detect performance in cProfile pypy is invalid
  • The gc mode of sys. getSizeof pypy is different. Sys. getSizeof cannot be used
  • __slots__ cpython uses slots, invalid under pypy

Using slots in Python objects can reduce object memory usage and improve efficiency. Here is a test case:

def test_0():
    class Player(object):

        def __init__(self, name, age):
            self.name = name
            self.age = age

    players = []
    for i in range(10000):
        p = Player(name="p" + str(i), age=i)
        players.append(p)
    return players


def test_1():
    class Player(object):
        __slots__ = "name", "age"

        def __init__(self, name, age):
            self.name = name
            self.age = age

    players = []
    for i in range(10000):
        p = Player(name="p" + str(i), age=i)
        players.append(p)
    return players
Copy the code

The test log is as follows:

# python3 slots pmem(rss=10776576, vms=5178499072, pfaults=3351, pageins=58) Elapsed time: Velapsed time (RSS = 11892384, VMS =5033795584, PELAPSED =3587, Pageins =0) velapsed time: Elapsed time (RSS =40042496, VMS =5263011840, Pfaults =12341, Pageins =4071) Elapsed time: 800800s # Elapsed time (RSS = 3982582583838862, VMS =4974653440, Pelapsed =12280, Pageins =0) 0.004619121551513672 sCopy the code

See reference links 4 and 5 for more information

The most important feature of Pypy is Stackless, which supports high concurrency. There is a distinction between I/ o-bound and COMPUte-bound. Cpu-intensive code is slow because it executes a lot of CPU instructions, such as the for loop above. There is a difference between I/O intensive and slow due to disk or network latency. This part of the content, to introduce clearly is not easy, let us see in the next chapter.

summary

Python is an interpreted programming language with multiple interpreter implementations, the most common being the Cpython implementation. Pypy uses JIT technology, which can significantly improve Python execution efficiency in some common scenarios, and is highly compatible with Cpython. If your project has a large portion of pure Python, it is recommended to try running the program using Pypy.

By the way, our company is recruiting all kinds of technology research and development, game planning, testing, product managers and so on. Friends interested in Python direction can add a blogger to wechat public account to participate in the internal promotion.

Note: due to the limited personal ability, if the examples in the text are fallacious, please forgive me.

The appendix

The test environment

  • MacBook Pro (16-inch, 2019) (2.6 GHz Intel Core i7)
  • Python 2.7.16
  • Python 3.8.5
  • Python 3.6.9 [PyPy 7.3.1 with GCC 4.2.1 Compatible Apple LLVM 11.0.3 (Clang-1103.0.32.59)] Python 3.6.9 [PyPy 7.3.1 with GCC 4.2.1 Compatible Apple LLVM 11.0.3 (Clang-1103.0.32.59)]
  • Python 2.7.13 [PyPy 7.3.1 with GCC 4.2.1 Compatible Apple LLVM 11.0.3 (clang-1103.0.32.59)]
  • Lua #Lua 5.2.3 Copyright (C) 1994-2013 Lua.org, puc-rio
  • Node# v10.16.3

Refer to the link

  1. Medium.com/android-new…
  2. Realpython.com/pypy-faster…
  3. www.pypy.org/index.html
  4. Stackoverflow.com/questions/2…
  5. Morepypy.blogspot.com/2010/11/eff…