This article was compiled by Alex Rogozhnikov at GitHub

Currently, all major projects in the Python science stack support both Python 3.x and Python 2.7, although that will soon end. Last November, the Numpy team sparked concern in the data science community with an announcement that the scientific computing library was dropping support for Python 2.7 in favor of Python 3. Numpy is not the only tool that has announced plans to drop support for older versions of Python. Pandas and Jupyter Notebook are also on the list. Moving existing projects from Python 2 to Python 3 is a major ongoing problem for data science developers. Dr. Alex Rogozhnikov from Moscow University has put together a code migration guide for us.


Introduction to Python 3 features

Python is the dominant language in machine learning and other sciences, and we often need to use it to process large amounts of data. Python is compatible with a variety of deep learning frameworks and has many excellent tools to perform data preprocessing and visualization.

However, Python 2 and Python 3 have long coexisted in the Python ecosystem, and many data scientists still use Python 2. At the end of 2019, many scientific computing tools such as Numpy will stop supporting Python 2, while all new versions of Numpy after 2018 will only support Python 3.

To make the transition from Python 2 to Python 3 easier, I’ve collected some Python 3 features that I hope you’ll find useful.


Use Pathlib to better handle paths

Pathlib is the default module for Python 3 to help avoid using a large number of OS.path.joins:

from pathlib import Path

dataset = 'wiki_images'
datasets_root = Path('/path/to/datasets/') 
train_path = datasets_root / dataset / 'train'
test_path = datasets_root / dataset / 'test'

for image_path in train_path.iterdir():    
    with image_path.open() as f: # note, open is a method of Path object        
         # do something with an imageCopy the code

Python 2 has always tried to use string concatenation (accurate, but bad), and now with Pathlib the code is safe, accurate, and readable.

In addition, pathlib.Path has a large number of methods so that new Python users don’t have to search every one of them:

p.exists()
p.is_dir()
p.parts()p.with_name('sibling.png') # only change the name, but keep the folder
p.with_suffix('.jpg') # only change the extension, but keep the folder and the name
p.chmod(mode)
p.rmdir()Copy the code

Pathlib saves a lot of time. See:

  • Documents: https://docs.python.org/3/library/pathlib.html;

  • Reference information: https://pymotw.com/3/pathlib/.


Type hinting became part of the language

Examples of type hints in PyCharm:

Python is not just a scripting language; today’s data flow consists of a large number of steps, each with a different framework (and sometimes different logic).

Type hints were introduced into Python to help with increasingly complex projects, making it possible for machines to do better code validation. Previously, different modules were required to specify types in docstrings in a custom way (note that PyCharm converts old docstrings into new type hints).

The following code is a simple example of handling different types of data (which is what we like about the Python data stack).

def repeat_each_entry(data):    
     """ Each entry in the data is doubled 
      
        "
      ""    
     index = numpy.repeat(numpy.arange(len(data)), 2)    
     return data[index]Copy the code

The above code applies to numpy.array (including multidimensional), Astropy. Table and astropy.Column, bCOLz, Cupy, mxnet.ndarray, etc.

This code also works for pandas.Series, but in the wrong way:

repeat_each_entry(pandas.Series(data=[0, 1, 2], index=[3, 4, 5])) # returns Series with Nones insideCopy the code

This is a two line code. Imagine how difficult it is to predict the behavior of complex systems, and sometimes a single function can lead to the wrong behavior. It’s helpful to know exactly which type methods are appropriate for larger systems, and it will alert functions when they don’t get such arguments.

def repeat_each_entry(data: Union[numpy.ndarray, bcolz.carray]):Copy the code

If you have a great code base, a type hint tool like MyPy might be part of the integration process. Unfortunately, tips are not powerful enough to provide fine grained types for NDARrays/Tensors, but maybe we’ll have such tips soon, which would be a great feature of the DS.


Type hint → Type checking at run time

By default, function comments don’t affect how your code works, but they can only help you point out the intent of your code.

However, you can enforce type checking at run time using tools such as Enforce, which can help you debug your code (in many cases type hints won’t work).

@enforce.runtime_validation
def foo(text: str) -> None:    
    print(text)
foo('Hi') # ok
foo(5)    # fails

@enforce.runtime_validation
def any2(x: List[bool]) -> bool:    
    return any(x)

any ([False, False, True, False]) # True
any2([False, False, True, False]) # True

any (['False']) # True
any2(['False']) # fails

any ([False, None, "", 0]) # False
any2([False, None, "", 0]) # failsCopy the code


Other uses for function annotations

As mentioned earlier, comments do not affect code execution and provide meta-information that you can use freely.

For example, units of measurement are a common puzzle in the scientific community, and the Astropy package provides a simple Decorator to control the units of measurement of inputs and convert the output into desired units.

# Python 3
from astropy import units as u
@u.quantity_input()
def frequency(speed: u.meter / u.s, wavelength: u.m) -> u.terahertz:    
    return speed / wavelength

frequency(speed=300_000 * u.km / u.s, wavelength=555 * u.nm)
Output: 540.5405405405404 THz, Frequency of Green Visible LightCopy the code

If you have Python tabular science data (not necessarily too much), you should try Astropy. You can also define application-specific decorators that control/transform inputs and outputs in the same way.


Matrix multiplication by @

Below, we implement the simplest machine learning model, namely linear regression with L2 regularization:

# l2-regularized linear regression: || AX - b ||^2 + alpha * ||x||^2 -> min

# Python 2
X = np.linalg.inv(np.dot(A.T, A) + alpha * np.eye(A.shape[1])).dot(A.T.dot(b))
# Python 3
X = np.linalg.inv(A.T @ A + alpha * np.eye(A.shape[1])) @ (A.T @ b)Copy the code

The following Python 3 notation with @ as matrix multiplication is more readable and easier to translate in a deep learning framework: Because some code such as X @ W + b[None, :] represents a single-layer perceptron under libraries as diverse as Numpy, Cupy, PyTorch, and TensorFlow.


Use ** as a wildcard

Recursive folder wildcards are not very convenient in Python2, so there is a custom glob2 module to overcome this problem. Recursive flag is supported in Python 3.6.

import glob

# Python 2
found_images = \    
    glob.glob('/path/*.jpg') \  
    + glob.glob('/path/*/*.jpg') \  
    + glob.glob('/path/*/*/*.jpg') \  
    + glob.glob('/path/*/*/*/*.jpg') \  
    + glob.glob('/path/*/*/*/*/*.jpg') 

# Python 3
found_images = glob.glob('/path/**/*.jpg', recursive=True)Copy the code

A better option in Python3 is to use pathlib:

# Python 3found_images = pathlib.Path('/path/').glob('**/*.jpg')Copy the code


Print is a function in Python3

Using Print in Python 3 requires cumbersome parentheses, but it has some advantages.

Simple syntax using file descriptors:

print >>sys.stderr, "critical error"      # Python 2
print("critical error", file=sys.stderr)  # Python 3Copy the code

Output tab-aligned tables without str.join:

# Python 3
print(*array, sep='\t')
print(batch, epoch, loss, accuracy, time, sep='\t')Copy the code

Modify and redefine the output of the print function:

# Python 3
_print = print # store the original print function
def print(*args, **kargs):    
    pass  # do something useful, e.g. store output to some fileCopy the code

In Jupyter, it was nice to log each output to a separate document, and to track down the wrong document when an error occurred, so we can now rewrite the print function.

In the following code, we can temporarily override the behavior of the print function using the context manager:

@contextlib.contextmanager
def replace_print():    
    import builtins    
    _print = print # saving old print function    
    # or use some other function here    
    builtins.print = lambda *args, **kwargs: _print('new printing', *args, **kwargs)    
    yield    
    builtins.print = _print

with replace_print():    
     <code here will invoke other print function>Copy the code

The above method is not recommended because it can cause system instability.

The print function can be added to list parsing and other language constructs.

# Python 3
result = process(x) if is_valid(x) else print('invalid item: ', x)Copy the code


F-strings can be used as a simple and reliable format

The default formatting system provides some flexibility and is not necessary in data experimentation. But such code is either too verbose for any modifications or becomes fragmentary. Typical data science, on the other hand, iteratively outputs some log information in a fixed format, usually using the following code:

# Python 2
print('{batch: 3} {epoch: 3} / {3} total_epochs: accuracy: {acc_mean: 0.4 f} + {} acc_std: 0.4 f time: {avg_time: 3.2 f}'.format(    
    batch=batch, epoch=epoch, total_epochs=total_epochs,     
    acc_mean=numpy.mean(accuracies), acc_std=numpy.std(accuracies),    
    avg_time=time / len(data_batch)
))
# Python 2 (too error-prone during fast modifications, please avoid):
print(': {3} {3} / {: 3} accuracy: {: 0.4 f} + {: 0.4 f} time: {: 3.2 f}'.format(    
    batch, epoch, total_epochs, numpy.mean(accuracies), numpy.std(accuracies),    
    time / len(data_batch) 
))Copy the code

Sample output:

120 12/300 accuracy: 0.8180±0.4649 time: 56.60Copy the code

F-strings, formatted strings, were introduced in Python 3.6:

# Python 3.6 +
print(f'{Batch :3} {epoch:3} / {total_epochs:3} accuracy: {numpy. Mean (accuracies):0.4f}±{numpy. STD (accuracies):0.4f} time: {time/len (data_batch) : 3.2 f} ')Copy the code

In addition, it is very convenient to write queries:

query = f"INSERT INTO STATION VALUES (13, '{city}', '{state}', {latitude}, {longitude})"Copy the code


The obvious difference between “true division” and “INTEGER Division”

This change is convenient for data science (but NOT, I believe, for systems programming).

data = pandas.read_csv('timing.csv')
velocity = data['distance'] / data['time']Copy the code

The results in Python 2 depend on whether the “time” and “distance” (for example, in meters and seconds) are saved as integers.

In Python 3, the representation of the result is exact because the result of division is a floating point number.

Another example is integer division, now used as an explicit operation:

n_gifts = money // gift_price  # correct for int and float argumentsCopy the code

Note that this operation can be applied to both built-in types and custom types provided by packets (for example, numpy or PANDAS).


Strict ordering

# All these comparisons are illegal in Python 3< 3'3'
2 < None
(3, 4) < (3, None)
(4, 5) < [4, 5]

# False in both Python 2 and Python 3
(4, 5) == [4, 5]Copy the code

Prevents accidental sorting of instances of different types.

sorted([2, '1', 3)# invalid for Python 3, in Python 2 returns [2, 3, '1']Copy the code

Help identify problems while working with raw data.

Side note: AN appropriate check for None (both versions of Python) is:

if a is not None:  
    pass

if a: # WRONG check for None  
    passCopy the code


Unicode for natural language processing

s = 'hello'
print(len(s))
print(s[:2])Copy the code

Output:

  • Python 2:6 \ n � �

  • Python 3: 2\n Hello.

x = u'с о'
x += 'co' # ok
x += 'с о' # failCopy the code

Python 2 failed here, and Python 3 worked as expected (because I used Russian letters in strings).

STRS are Unicode strings in Python 3, making NLP handling of non-English text easier.

There are other interesting aspects, such as:

'a' < type < u'a'  # Python 2: True
'a' < u'a'         # Python 2: FalseCopy the code
from collections import Counter
Counter('Möbelstück')Copy the code
  • Python 2: Counter({'\xc3': 2, 'b': 1, 'e': 1, 'c': 1, 'k': 1, 'M': 1, 'l': 1, 's': 1, 't': 1, '\xb6': 1, '\xbc': 1})

  • Python 3: Counter ({‘ M: 1, ‘o’ : 1, ‘b’ : 1, ‘e’ : 1, “l” : 1, ‘s’ : 1, the ‘t’ : 1, the ‘u’ : 1, “c” : 1, “k” : 1})

These also work correctly in Python 2, but Python 3 is friendlier.


Keep the order of dictionary and **kwargs

In CPython 3.6+, the default behavior of dictionaries is similar to that of OrderedDict (guaranteed in 3.7+). This remains in order during dictionary comprehension (and other operations such as JSON serialization/deserialization).

import json
x = {str(i):i for i in range(5)}
json.loads(json.dumps(x))
# Python 2
{u'1': 1, u'0': 0, u'3': 3, u'2': 2, u'4'4} :# Python 3
{'0': 0.'1': 1, '2': 2.'3': 3.'4'4} :Copy the code

It also works for **kwargs (in Python 3.6+) : they are in the order shown in the arguments. When designing a data flow, order is crucial. Previously, we had to write in such a cumbersome way:

from torch import nn

# Python 2
model = nn.Sequential(OrderedDict([          
        ('conv1', nn. Conv2d,20,5 (1)), ('relu1', nn.ReLU()),          
        ('conv2', nn. Conv2d (20,64,5)), ('relu2', nn.ReLU())        ]))

# Python 3.6+, how it *can* be done, not supported right now in pytorchModel =nn. Sequential(conv1=nn.Conv2d(1,20,5), relu1= nn.relu (), conv2=nn.Conv2d(20,64,5), relu2=nn.Copy the code

Notice that? The uniqueness of the name is also checked automatically.


Iteratively unpack

# handy when amount of additional stored info may vary between experiments, but the same code can be used in all cases
model_paramteres, optimizer_parameters, *other_params = load(checkpoint_name)

# picking two last values from a sequence
*prev, next_to_last, last = values_history

# This also works with any iterables, so if you have a function that yields e.g. qualities,
# below is a simple way to take only last two values from a list 
*prev, next_to_last, last = iter_train(args)Copy the code


The default pickle engine provides better compression for arrays

# Python 2
import cPickle as pickle
import numpy
print len(pickle.dumps(numpy.random.normal(size=[1000, 1000])))
# result: 23691675

# Python 3
import pickle
import numpy
len(pickle.dumps(numpy.random.normal(size=[1000, 1000])))
# result: 8000162Copy the code

Three times less space and faster. In fact, similar (though speed-independent) compression can be achieved with the protocol=2 parameter, but users often ignore (or don’t know about) this option.


More secure parsing

labels = <initial_value>
predictions = [model.predict(data) for data, labels in dataset]

# labels are overwritten in Python 2
# labels are not affected by comprehension in Python 3Copy the code


About super ()

Python 2 super (…) Is a common cause of code errors.

# Python 2
class MySubClass(MySuperClass):    
    def __init__(self, name, **options):        
        super(MySubClass, self).__init__(name='subclass', **options)

# Python 3
class MySubClass(MySuperClass):    
    def __init__(self, name, **options):        
        super().__init__(name='subclass', **options)Copy the code

More content about super and method resolution order, see stackoverflow:stackoverflow.com/questions/5…


Better ides give variable comments

One of the most enjoyable things about programming in Java, C#, etc., is that the IDE can provide very good advice because the types of all identifiers are known before the code is executed.

This is hard to do in Python, but comments can help you:

  • Write down your expectations in clear form

  • Get good advice from the IDE

This is a PyCharm example with variable annotations. It works even if the function you use is uncommented (for example, due to backward compatibility).


Multiple unpacking

Example of code for merging two dictionaries in Python3:

x = dict(a=1, b=2)
y = dict(b=3, d=4)
# Python 3.5 +
z = {**x, **y} 
# z = {'a': 1, 'b': 3, 'd': 4}, note that value for `b` is taken from the latter dict.
Copy the code

Can examine the code in the Python2 contrast in this link: stackoverflow.com/questions/3…

The aame method is valid for lists, tuples, and sets (a, B, and C are arbitrary iterables) :

[*a, *b, *c] # list, concatenating 
(*a, *b, *c) # tuple, concatenating 
{*a, *b, *c} # set, union 
Copy the code

For *args and **kwargs, the function also supports additional unpacking:

Python 3.5 + do_something (* * {default_settings * *, * * custom_settings})# Also possible, this code also checks there is no intersection between keys of dictionaries
do_something(**first_args, **second_args)
Copy the code


An API with keyword arguments only

Let’s consider this code snippet:

model = sklearn.svm.SVC(2, 'poly', 2, 4, 0.5)Copy the code

It is clear that the author of the code is not yet familiar with the Python code style (most likely just jumped from CPP and Rust to Python). Unfortunately, this is not just a matter of personal preference, as changing the order of arguments in SVC (adding/ Deleting) makes the code invalid. In particular, SkLearn often reorders or renames a large number of algorithm parameters to provide a consistent API. Every refactoring can invalidate the code.

In Python3, library writers may need to use * to name arguments explicitly:

Class SVC(BaseSVC): def __init__(self, *, C=1.0, kernel='rbf', degree=3, gamma='auto', coef0 = 0.0,...).Copy the code
  • Now the user needs to specify the name of the parameters sklearn.svm.svc (C=2, kernel=’poly’, degree=2, gamma=4, coEF0 =0.5).

  • This mechanism makes the API both reliable and flexible.


Minor: Constant in the Math module

# Python 3
math.inf # 'largest' number
math.nan # not a number

max_quality = -math.inf  # no more magic initial values!
    
for model in trained_models:    
     max_quality = max(max_quality, compute_quality(model, data))
Copy the code


Minor: Single precision integer type

Python 2 provides two basic integer types, namely int (64-bit signed integer) and long (somewhat confusing in C++) for long computations.

Python 3 has a single-precision int that contains long operations.

Here’s how to check if the value is an integer:

isinstance(x, numbers.Integral) # Python 2, the canonical way
isinstance(x, (long, int))      # Python 2
isinstance(x, int)              # Python 3, easier to remember
Copy the code


other

  • Enums has theoretical value, but string input is already widely used in the Python data stack. Enums do not appear to interact with NUMpy, and do not necessarily come from PANDAS.

  • Coroutines are also very promising for data flows, but have not yet seen mass adoption.

  • Python 3 has a stable ABI

  • Python 3 supports Unicode (so ω = δ φ / δ t is okay too), but you’re better off using the good old ASCII names

  • Some libraries like JupyterHub (Jupyter in Cloud), Django, and the new ipython only support Python 3, so features that aren’t useful to you will be useful for libraries that you might only want to use once.


Code migration Issues specific to data science (and how to solve them)

Discontinue support for nested parameters:

map(lambda x, (y, z): x, z, dict.items())Copy the code

However, it is still perfectly suited to different interpretations:

{x:z for x, (y, z) in d.items()}Copy the code

In general, understanding “translates” better between Python 2 and 3.

  • Map (),.keys(),.values(),.items(), and so on return iterators instead of lists. The main problems with iterators are that there are no trivial splits and you can’t iterate twice. Turning the result into a list solves almost everything.

  • For questions, see Python Q&A: How Do I port to Python 3? (https://eev.ee/blog/2016/07/31/python-faq-how-do-i-port-to-python-3/)


Main issues in teaching machine learning and data science in Python

  • Course authors should first spend time explaining what an iterator is, why it can’t be sharded/cascaded/multiplied/iterated twice (and how to handle it) like a string.

  • I’m sure most course writers would be happy to avoid such details, but now it’s almost impossible.


conclusion

Python 2 has coexisted with Python 3 for nearly a decade, and at this point we have to say: It’s time to move on to Python 3.

The research and production code should be shorter, easier to read, and significantly more secure after migrating to the Python 3 codebase.

Most libraries now support both 2.x and 3.x versions. But we shouldn’t wait until popular toolkits stop supporting Python 2 to enjoy the new language’s features in advance.

After the migration, I can guarantee that the application will go much smoother: “We will no longer do backward incompatibility (snarky.ca/why-python-…)” .


The original address: https://github.com/arogozhnikov/python3_with_pleasure