background

Persistence means keeping an object, even between multiple executions of the same program. In this article, you’ll get a general idea of the various persistence mechanisms for Python objects, from relational databases to Python’s pickling, among others. It also gives you a deeper understanding of Python’s object serialization capabilities.

What is persistence?

The basic idea of persistence is simple. Suppose you have a Python program, probably a program that manages daily to-do items, and you want to save application objects (to-do items) between multiple executions of the program. In other words, you want to store objects on disk for later retrieval. This is persistence. There are several ways to achieve this, each with its own advantages and disadvantages.

For example, object data can be stored in a text file in some format, such as a CSV file. Or you can use a relational database, such as Gadfly, MySQL, PostgreSQL, or DB2. These file formats and databases are excellent, and Python has robust interfaces for all of these storage mechanisms.

All of these storage mechanisms have one thing in common: the data stored is independent of the objects and programs that operate on it. The advantage of this is that the data can be used as a shared resource by other applications. The disadvantage is that this allows other programs to access the object’s data, which violates the principle of object-oriented encapsulation — that the object’s data can only be accessed through the object’s own public interface.

Also, for some applications, the relational database approach may not be ideal. In particular, relational databases do not understand objects. Instead, relational databases impose their own type system and relational data model (tables), each containing a set of tuples (rows), each row containing a fixed number of statically typed fields (columns). If your application’s object model cannot be easily translated into a relational model, you will encounter difficulties in mapping objects to tuples and tuples back to objects. This difficulty is often called an impesure-mismatch problem.

Object persistence

If you want to store Python objects transparently without losing information such as their identity and type, you need some form of object serialization: it is the process of turning arbitrarily complex objects into textual or binary representations of objects. Also, you must be able to restore an object to its original serialized form. In Python, this serialization process is called pickle, and objects can be pickled into strings, files on disk, or any file-like object, or unpickled into the original object. We’ll discuss pickle in more detail later in this article.

Suppose you like to keep everything as an object, and you want to avoid the overhead of converting objects into something based on non-object storage; Pickle files can provide these benefits, but sometimes you need something more robust and scalable than a simple pickle file. For example, pickling alone does not solve the problem of naming and locating pickled files, nor does it support concurrent access to persistent objects. If you need functionality in these areas, you turn to a database like the ZODB (Z object database for Python). The ZODB is a robust, multi-user, and object-oriented database system capable of storing and managing arbitrarily complex Python objects and supporting transactional operations and concurrency control. (See Resources to download the ZODB.) It’s interesting enough that even the ZODB relies on Python’s native serialization capabilities, and to use the ZODB effectively, you must have a good understanding of pickling.

Another interesting solution to the persistence problem is Prevayler, which was originally implemented in Java (see Resources for a developerWorks article on Prevaylor). Prevayler was recently ported to Python by a group of Python programmers under the name PyPerSyst, hosted by SourceForge (see Resources for a link to the PyPerSyst project). The Prevayler/PyPerSyst concept also builds on the native serialization capabilities of the Java and Python languages. PyPerSyst keeps the entire object system in memory and provides disaster recovery by periodically pickling system snapshots to disk and maintaining a command log from which the latest snapshots can be reapplied. So, while applications using PyPerSyst are limited by available memory, the advantage is that native object systems can be loaded completely into memory, thus being extremely fast, and are simpler to implement than a database like the ZODB, which allows for more objects than can be held in memory at the same time.

Now that we’ve briefly discussed the various ways to store persistent objects, it’s time to explore the pickle process in more detail. Although we are primarily interested in exploring ways to save Python objects without necessarily converting them to some other format, we still have a few areas of concern, such as how to effectively pickle and unpickle simple and complex objects, including instances of custom classes; How to maintain references to objects, including circular references and recursive references; And how to handle changes in class definitions so that previously pickled instances are used without problems. We’ll cover all of these issues in a subsequent discussion of Python’s pickling capabilities. Some pickled Python pickle modules and their cousin cPickle provide pickle support for Python. The latter is coded in C, which provides better performance and is recommended for most applications. We’ll continue to talk about pickles, but the examples in this article actually make use of cPickle. Since most of these examples will be displayed in a Python shell, I’ll show you how to import cPickle and refer to it as pickle:

>>> import cPickle as pickle
Copy the code

Now that you’ve imported the module, let’s take a look at the pickle interface. The pickle module provides the following function pairs: dumps(Object) returns a string containing an object in pickle format; Loads (string) returns an object contained in the pickle string; Dump (object, file) writes an object to a file. This file can be an actual physical file, or it can be any file-like object that has a write() method and can take a single string argument. Load (file) returns the object contained in the pickle file.

By default, dumps() and dump() use printable ASCII representations to create pickles. Both have a final (optional) argument, which, if True, specifies that pickles are created with a faster and smaller binary representation. The loads() and load() functions automatically detect whether pickles are in binary or text format.

Listing 1 shows an interactive session using the dumps() and loads() functions just described:

Listing 1. Demonstration of dumps() and loads()

>>> import cPickle as pickle >>> t1 = ('this is a string', 42, [1, 2, 3], None) >>> t1 ('this is a string', 42, [1, 2, 3), None) >>> p1 = pickle.dumps(t1) >>> p1 "(S'this is a string'/nI42/n(lp1/nI1/naI2/naI3/naNtp2/n." >>> print p1 (S'this is  a string' I42 (lp1 I1 aI2 aI3 aNtp2 . >>> t2 = pickle.loads(p1) >>> t2 ('this is a string', 42, [1, 2, 3], None) >>> p2 = pickle.dumps(t1, True) >>> p2 '(U/x10this is a stringK*]q/x01(K/x01K/x02K/x03eNtq/x02.' >>> t3 = pickle.loads(p2) >>> t3Copy the code

(‘this is a string’, 42, [1, 2, 3], None)

Note: The text pickle format is very simple and will not be explained here. In fact, all conventions used are documented in the pickle module. We should also point out that we are using simple objects in our examples, so using the binary pickle format does not show much efficiency in saving space. However, in systems that actually use complex objects, you will see that using the binary format can lead to significant improvements in size and speed. Next, let’s look at some examples that use dump() and load(), which use files and file-like objects. The operation of these functions is very similar to the dumps() and loads() we just saw, except that they have another ability – the dump() function dumps several objects to the same file, one after the other. Load () is then called to retrieve these objects in the same order. Listing 2 shows this capability in action:

Listing 2. dump() and load() examples

>>> a1 = 'apple'  
>>> b1 = {1: 'One', 2: 'Two', 3: 'Three'}  
>>> c1 = ['fee', 'fie', 'foe', 'fum']  
>>> f1 = file('temp.pkl', 'wb')  
>>> pickle.dump(a1, f1, True)  
>>> pickle.dump(b1, f1, True)  
>>> pickle.dump(c1, f1, True)  
>>> f1.close()  
>>> f2 = file('temp.pkl', 'rb')  
>>> a2 = pickle.load(f2)  
>>> a2  
'apple'  
>>> b2 = pickle.load(f2)  
>>> b2  
{1: 'One', 2: 'Two', 3: 'Three'}  
>>> c2 = pickle.load(f2)  
>>> c2  
['fee', 'fie', 'foe', 'fum']  
>>> f2.close()  
Copy the code

The power of Pickle

So far, we’ve covered the basics of pickling. In this section, I’ll discuss some of the more advanced issues you encounter when you start pickling complex objects, including instances of custom classes. Fortunately, Python can handle this situation easily.

Pickle is portable in terms of space and time. In other words, the pickle file format is architecturally independent of the machine, which means that, for example, you can create a pickle under Linux and then send it to a Python program running under Windows or Mac OS. And when you upgrade to a newer version of Python, you don’t have to worry about scrapping existing pickles. Python developers have guaranteed that the pickle format will be backwards-compatible with all versions of Python. In fact, detailed information about the current and supported formats is provided in the pickle module. Retrieves supported formats

> > > pickle. Format_version '1.3' > > > pickle.com patible_formats [' 1.0 ', '1.1', 1.2 ' ']Copy the code

Multiple references, same object

In Python, a variable is a reference to an object. It is also possible to refer to the same object with multiple variables. Python has proven to have no trouble maintaining this behavior with pickled objects, as shown in Listing 4:

Listing 4. Object reference maintenance

>>> a = [1, 2, 3]  
>>> b = a  
>>> a  
[1, 2, 3]  
>>> b  
[1, 2, 3]  
>>> a.append(4)  
>>> a  
[1, 2, 3, 4]  
>>> b  
[1, 2, 3, 4]  
>>> c = pickle.dumps((a, b))  
>>> d, e = pickle.loads(c)  
>>> d  
[1, 2, 3, 4]  
>>> e  
[1, 2, 3, 4]  
>>> d.append(5)  
>>> d  
[1, 2, 3, 4, 5]  
>>> e  
[1, 2, 3, 4, 5]  
Copy the code

Circular and recursive references

You can extend the object reference support just demonstrated to circular references (where two objects contain references to each other) and recursive references (where an object contains references to itself). The following two listings highlight this capability. Let’s first look at recursive references:

Listing 5. Recursive references

>>> l = [1, 2, 3]  
>>> l.append(l)  
>>> l  
[1, 2, 3, [...]]  
>>> l[3]  
[1, 2, 3, [...]]  
>>> l[3][3]  
[1, 2, 3, [...]]  
>>> p = pickle.dumps(l)  
>>> l2 = pickle.loads(p)  
>>> l2  
[1, 2, 3, [...]]  
>>> l2[3]  
[1, 2, 3, [...]]  
>>> l2[3][3]  
[1, 2, 3, [...]]  
Copy the code

Now, look at an example of a circular reference:

Listing 6. Circular reference

>>> a = [1, 2]  
>>> b = [3, 4]  
>>> a.append(b)  
>>> a  
[1, 2, [3, 4]]  
>>> b.append(a)  
>>> a  
[1, 2, [3, 4, [...]]]  
>>> b  
[3, 4, [1, 2, [...]]]  
>>> a[2]  
[3, 4, [1, 2, [...]]]  
>>> b[2]  
[1, 2, [3, 4, [...]]]  
>>> a[2] is b  
1  
>>> b[2] is a  
1  
>>> f = file('temp.pkl', 'w')  
>>> pickle.dump((a, b), f)  
>>> f.close()  
>>> f = file('temp.pkl', 'r')  
>>> c, d = pickle.load(f)  
>>> f.close()  
>>> c  
[1, 2, [3, 4, [...]]]  
>>> d  
[3, 4, [1, 2, [...]]]  
>>> c[2]  
[3, 4, [1, 2, [...]]]  
>>> d[2]  
[1, 2, [3, 4, [...]]]  
>>> c[2] is d  
1  
>>> d[2] is c  
1  
Copy the code

Note that if you pickle each object individually, rather than all together in a tuple, you get slightly different (but important) results, as shown in Listing 7:

Listing 7. Pickle separately vs. pickle together in a tuple

>>> f = file('temp.pkl', 'w')  
>>> pickle.dump(a, f)  
>>> pickle.dump(b, f)  
>>> f.close()  
>>> f = file('temp.pkl', 'r')  
>>> c = pickle.load(f)  
>>> d = pickle.load(f)  
>>> f.close()  
>>> c  
[1, 2, [3, 4, [...]]]  
>>> d  
[3, 4, [1, 2, [...]]]  
>>> c[2]  
[3, 4, [1, 2, [...]]]  
>>> d[2]  
[1, 2, [3, 4, [...]]]  
>>> c[2] is d  
0  
>>> d[2] is c  
0  
Copy the code

Equal, but not always the same

As implied in the previous example, these objects are the same only if they refer to the same object in memory. In the pickle case, each object is restored to an object equal to the original object, but not the same object. In other words, each pickle is a copy of the original object:

Listing 8. Restored object as a copy of the original object

>>> j = [1, 2, 3]  
>>> k = j  
>>> k is j  
1  
>>> x = pickle.dumps(k)  
>>> y = pickle.loads(x)  
>>> y  
[1, 2, 3]  
>>> y == k  
1  
>>> y is k  
0  
>>> y is j  
0  
>>> k is j  
1  
Copy the code

At the same time, we see that Python can maintain references between objects that are pickled as a unit. However, we also saw that calling dump() separately makes it impossible for Python to maintain references to objects pickled outside the unit. Instead, Python makes a copy of the referenced object and stores the copy with the pickled object. This is fine for applications that pickle and restore hierarchies of individual objects. But be aware that there are other scenarios.

It’s worth pointing out that there is an option that does allow you to pickle objects separately and maintain references to each other, as long as they are all pickled to the same file. The pickle and cPickle modules provide a Pickler (corresponding to Unpickler) that keeps track of objects that have been pickled. By using this Pickler, shared and circular references will be pickled by reference rather than by value:

Listing 9. Maintaining references between the separately pickled objects

>>> f = file('temp.pkl', 'w')  
>>> pickler = pickle.Pickler(f)  
>>> pickler.dump(a)  
<cPickle.Pickler object at 0x89b0bb8>  
>>> pickler.dump(b)  
<cPickle.Pickler object at 0x89b0bb8>  
>>> f.close()  
>>> f = file('temp.pkl', 'r')  
>>> unpickler = pickle.Unpickler(f)  
>>> c = unpickler.load()  
>>> d = unpickler.load()  
>>> c[2]  
[3, 4, [1, 2, [...]]]  
>>> d[2]  
[1, 2, [3, 4, [...]]]  
>>> c[2] is d  
1  
>>> d[2] is c  
1  
Copy the code

Objects that are not pickleable

Some object types are not pickleable. For example, Python cannot pickle a file object (or anything with a reference to a file object), because Python cannot guarantee that it can reconstruct the state of the file when unpickled (another example is too difficult to mention in this type of article). Attempting to pickle a file object results in the following error:

Listing 10. Result of attempting to pickle a file object

>>> f = file('temp.pkl', 'w')  
>>> p = pickle.dumps(f)  
Traceback (most recent call last):  
  File "<input>", line 1, in ?  
  File "/usr/lib/python2.2/copy_reg.py", line 57, in _reduce  
    raise TypeError, "can't pickle %s objects" % base.__name__  
TypeError: can't pickle file objects  
Copy the code

The class instance

Pickling class instances requires more care than pickling simple object types. This is mainly because Python pickles instance data (usually dict attributes) and the name of the class, not the code of the class. When Python unpickles an instance of a class, it tries to import the module containing the class definition using the exact class name and module name (including the path prefix of any package) that was used at the time the instance was pickled. Also note that class definitions must occur at the top level of a module, which means they cannot be nested classes (classes defined in other classes or functions).

When instances of classes are unpickled, their init() method is usually not called again. Instead, Python creates a generic class instance, applies the pickled instance attributes, and sets the class attribute of the instance to point to the original class.

The mechanism for unpickling the new classes introduced in Python 2.2 is slightly different from the original. Python uses the copy_reg module’s _reconstructor() function to restore instances of the new class, although the result of processing is essentially the same as that of the old class.

If you want to modify the default pickle behavior for instances of new or old classes, you can define special class methods getState () and setState (), which Python calls during the saving and restoration of state information for class instances. In the following sections, we’ll see examples that take advantage of these special methods.

Now, let’s look at a simple class instance. First, create a Python module for persist. Py that contains the following new class definitions:

Listing 11. Definition of the new class

class Foo(object):  
    def __init__(self, value):  
        self.value = value 
Copy the code

Now you can pickle instance Foo and look at its representation:

Listing 12. Pickled Foo instance

>>> import cPickle as pickle >>> from Orbtech.examples.persist import Foo >>> foo = Foo('What is a Foo? ') >>> p = pickle.dumps(foo) >>> print p ccopy_reg _reconstructor p1 (cOrbtech.examples.persist Foo p2 c__builtin__ object p3 NtRp4 (dp5 S'value' p6 S'What is a Foo? ' sb.Copy the code

You can see that the class name Foo and the fully qualified module name orbtech.examples.persist are stored in pickle. If you pickle the instance into a file and unpickle it later or unpickle it on another machine, Python tries to import the Orbtech.examples.persist module, and throws an exception if it cannot. A similar error occurs if you rename the class and the module or move the module to a different directory.

Here is an example of Python issuing an error message when we rename class Foo and then try to load a previously pickled instance of Foo:

Listing 13. Attempting to load a pickled instance of a renamed class Foo

>>> import cPickle as pickle  
>>> f = file('temp.pkl', 'r')  
>>> foo = pickle.load(f)  
Traceback (most recent call last):  
  File "<input>", line 1, in ?  
AttributeError: 'module' object has no attribute 'Foo'  
Copy the code

A similar error occurs after renaming the persist. Py module:

Listing 14. Attempting to load a pickled instance of the renamed persist. Py module

>>> import cPickle as pickle  
>>> f = file('temp.pkl', 'r')  
>>> foo = pickle.load(f)  
Traceback (most recent call last):  
  File "<input>", line 1, in ?  
ImportError: No module named persist  
Copy the code

We’ll provide techniques for managing such changes without breaking existing pickles in the schema Improvement section below. Special state methods

I mentioned earlier that some object types (for example, file objects) cannot be pickled. Special methods (getState () and setState ()) can be used to modify the state of class instances when handling instance properties of such unpickleable objects. Here is an example of class Foo that we have modified to handle file object properties:

Listing 15. Handling instance attributes that cannot be pickled

class Foo(object):  
    def __init__(self, value, filename):  
        self.value = value  
        self.logfile = file(filename, 'w')  
    def __getstate__(self):  
        """Return state values to be pickled."""  
        f = self.logfile  
        return (self.value, f.name, f.tell())  
    def __setstate__(self, state):  
        """Restore state from the unpickled state values."""  
        self.value, name, position = state  
        f = file(name, 'w')  
        f.seek(position)  
        self.logfile = f  
Copy the code

When pickling an instance of Foo, Python will only pickle the value returned to it when it calls the instance’s getState () method. Similarly, when unpickled, Python provides a setstate() method that passes the unpickled value as an argument to the instance. In the setState () method, a file object can be reconstructed from pickled name and location information and assigned to the instance’s logFile property.

Model to improve

Over time, you may find yourself having to change the class definition. If you already pickled a class instance and now need to change the class, you might want to retrieve and update those instances so that they can continue to function under the new class definition. We have already seen some errors when making certain changes to a class or module. Fortunately, the pickle and unpickle processes provide hooks that we can use to support this need for schema improvement.

In this section, we’ll explore some ways to predict common problems and how to solve them. Because class instance code cannot be pickled, methods can be added, changed, and removed without affecting existing pickled instances. For the same reason, you don’t have to worry about class attributes. You must ensure that the code module that contains the class definition is available in the unpickle environment. You must also plan for changes that can cause unpickle problems, such as changing the class name, adding or removing instance attributes, and changing the name or location of the module that defines the class.

Class name changes

To change the class name without breaking previously pickled instances, follow these steps. First, make sure that the definition of the original class has not changed so that it can be found when an existing instance is unpickled. Instead of changing the original name, create a copy of the class definition in the same module as the original class definition and give it a new class name. Then, replacing NewClassName with the actual NewClassName, add the following method to the original class definition: ### listing 16. Change class name: Method added to the original class definition

def __setstate__(self, state):  
    self.__dict__.update(state)  
    self.__class__ = NewClassName  
Copy the code

When an existing instance is unpickled, Python looks up the definition of the original class, calls the setstate() method of the instance, and reassigns the instance’s class attribute to the new class definition. Once you have determined that all existing instances have been unpickled, updated, and re-pickled, you can remove the old class definition from the source code module. These special state methods getState () and setState () once again give us control over the state of each instance and give us the opportunity to handle changes in instance properties. Let’s look at a simple class definition to which we will add and remove attributes. This is the original definition:

Listing 17. Initial class definition

class Person(object):  
    def __init__(self, firstname, lastname):  
        self.firstname = firstname  
        self.lastname = lastname
Copy the code

Given that instances of Person have been created and pickled, we’ve now decided that we really only want to store a name attribute, rather than a first and last name separately. Here is a way to change the definition of a class by migrating previously pickled instances to the new definition:

Listing 18. New class definition

class Person(object):  
    def __init__(self, fullname):  
        self.fullname = fullname  
    def __setstate__(self, state):  
        if 'fullname' not in state:  
            first = ''  
            last = ''  
            if 'firstname' in state:  
                first = state['firstname']  
                del state['firstname']  
            if 'lastname' in state:  
                last = state['lastname']  
                del state['lastname']  
            self.fullname = " ".join([first, last]).strip()  
        self.__dict__.update(state)  
Copy the code

In this example, we added a new attribute fullname and removed the two existing attributes firstName and lastName. When a previously pickled instance is unpickled, its previously pickled state is passed to setState () as a dictionary, which includes the values of the firstName and lastName attributes. Next, combine the two values and assign them to the new attribute fullname. In this process, we remove the old attributes from the state dictionary. After updating and re-pickling all instances that were previously pickled, you can now remove the setState () method from the class definition. Conceptually, a change in the name or location of a module is similar to a change in the name of a class, but handled quite differently. That’s because module information is stored in pickles, not properties that can be modified through the standard pickle interface. In fact, the only way to change module information is to perform a find and replace operation on the actual pickle file itself. Exactly how to do this depends on the operating system and the tools available. Obviously, in this case, you will want to back up your files so that errors do not occur. But this change should be simple, and changes to the binary pickle format should be just as effective as changes to the text pickle format.

From CobbLiu’s blog