Not always “big is better”: the importance of choosing data types – An example with CPython

Ever since I began, back in 1982, my adventure with computers (which later turned into a passion and finally into a profession), I can say that I have had the joy as well as the privilege of seeing, as well as touching, the progress of this marvellous technology (which followed quite rapidly, if we look back at the times).

In the beginning, it was 8 bits (at least for me: others were fortunate enough to enjoy 4 bits as well, for example), one could paraphrase, because the microprocessors used in the home computers of the time (the most widespread and affordable for the masses) easily manipulated data of that size (and much less easily larger data, which required more instructions to be executed for that purpose, with a considerable impact on performance).

The hunger for computing power, as well as the availability of more and more transistors packed into the small area of chips, led over time to the flourishing of 16-bit systems, then 32-bit systems, and in recent years, of course, 64-bit systems (which are the only ones now supported in mainstream systems).

It is easy, therefore, to understand how the idea that “more bits = more power” has taken root in the collective imagination (as we discussed in an old, but still relevant, article, in Italian, on the subject: From the Mhz Myth to the Bits Myth…), so that hardware has evolved towards the manipulation of ever larger data and, consequently, software has been adapted or written to use it.

All magnificent, considering the level of complexity software has reached, which has made it possible to bring more and more people closer to technology, in a progressively simpler and more convenient manner, extending its benefits to practically the entire community.

On the other hand, even for programmers, the transition was quick and easy: to exploit ever larger data, it was enough to use the appropriate (perhaps new) data type.

Or not to lift a finger and leave all the burden to the compiler. For example in C (as well as in Pascal and other languages), the integer data type (int) does not have a precise size, but is defined by the platform (ABI) and/or the compiler itself. Usually in 8-bit systems it was 16-bit, in 16- or 32-bit systems it could be 16- or 32-bit, and finally 32- or 64-bit in 64-bit systems.

The flip side of the coin, however, is that bigger data obviously requires more space to be stored in any device (from memories to storage systems), which in a system also inevitably translates into more bandwidth being used towards memory (in general, towards the entire hierarchy: from RAM to L3/L2/L1 cache).

This was not a big problem if we consider that technological advances have now made available huge amounts of memory and storage, as well as the relative bandwidth to access them, but the advent of processors first, and video cards later, with several bandwidth-hungry integrated cores have highlighted problems in sharing and utilising this precious resource.

Accompanied also by the slowing down of such progress (whatever one says, the infamous Moore’s Law is destined to come to an end, and we are approaching it), the problem has become and will become more and more relevant.

There arises, therefore, the need to make the best possible use of the resources we have, without overdoing it in excesses that have been encouraged by the explosion of their ever-increasing quantities.

Memory usage in (C)Python

Having just attended the Italian Pycon (a conference of enthusiasts of this marvellous language) and my past experiences (WPython) with the most widespread/mainstream version (CPython) of Python, I felt like dusting off an old idea and describing a possible optimisation for 64-bit systems, after attending some presentations that talked about reference counting and GIL.

Without going into too much detail, we can say that any data in Python (CPython to be exact: from now on, I will always refer to this implementation, which is the most widespread and the one we practically all use) is made up of a basic struct (struct, in C), regardless of the particular datatype, the definition of which is basically traceable to this one:

struct {
    Py_ssize_t ob_refcnt;
    PyTypeObject *ob_type;
} PyObject;

Here, too, I have wanted to simplify a lot, getting several technical details out of the way that are not relevant to the purpose of the article. What is really important is to understand that the basic building block of any Python object (because, in this language, every piece of data is an object) is made up of two key pieces of information: the number of times (ob_refcnt) that this object is referenced (by another object. Because it is contained in a list, for example) and its type (ob_type). Any other object, in fact, will always have these two elements at its base, to which it may possibly add others of its own.

In memory, its representation will be as follows for a 32-bit system:

Address/OffsetElementDescription
0ob_refcntNumber of other objects which are referencing this one
4ob_typeType of the object

whereas in a 64-bit system it will be (without any major surprises) this:

Address/OffsetElementDescription
0ob_refcntNumber of other objects which are referencing this one
8ob_typeType of the object

On the other hand, the pointers are doubled in size by going from 32 to 64 bits, and the same thing happened with the type (Py_ssize_t) which allows you to specify the maximum size (in bytes) for the integers that are used here (and which are normally the same size as the pointers). Address/Offset obviously refers to the address where that particular element is located in memory, starting from the beginning of the structure.

Apparently, there is nothing that can be changed, and this reflects the point made earlier: by moving to architectures that manipulate larger data, the size of the data types used increases accordingly, which is reflected in the memory occupation (and the associated bandwidth to access it).

In reality, and knowing the structure of other data types defined in Python, as well as the way in which these are routinely used, one can think of a different solution that saves a little memory and bandwidth for 64-bit systems (for which some #ifdef will be needed here, which I will omit from the examples that follow for simplicity’s sake).

Redefining PyObject & co.

To achieve this, it is first necessary to define two new data types:

NamePrimitive TypeSize (bytes)Description
Py_refcnt_tint4Maximum number of reference counts
Py_seq_len_tunsigned int4Maximum number of elements in sequences/containers (or strings)

which, as described, must be used respectively for the type relating to ob_refcnt, and for the type (used in other data types, as we shall see later) indicating the length of a string or, in general, the number of elements contained in sequences (lists, tuples, sets, etc.) or similar (dictionaries, …).

At this point, the basic PyObject type would become:

struct {
    Py_refcnt_t ob_refcnt;
    PyTypeObject *ob_type;
} PyObject;

which in memory would be represented as:

Address/OffsetElementDescription
0ob_refcntNumber of other objects which are referencing this one
4  
8ob_typeType of the object

Now the situation would even appear to have worsened, as a “hole” has been created in the middle of the structure (necessary to align the ob_type pointer to its natural size, that is 8 bytes) which is therefore unused.

In addition, using a 32-bit signed integer for ob_refcnt has also greatly reduced the possibility of referencing objects, as the number of references has gone from approximately 8 billion (263) to 2 billion (231). But more on this later.

The easiest and most straightforward thing to do is to try to use this empty space in some way, e.g. to indicate the length of a string, or the number of elements in a sequence, etc., or, in general, to leave it as a 32-bit field available for other data types.

A new definition of PyObject could, therefore, be as follows:

struct {
    Py_refcnt_t ob_refcnt;
    union {
       Py_seq_len_t ob_size; /* Number of items in variable part */
       Py_seq_len_t ob_user_data; /* Free for other data types */
    };
    PyTypeObject *ob_type;
} PyObject;

whose representation in memory would be:

Address/OffsetElementDescription
0ob_refcntNumber of other objects which are referencing this one
4ob_sizeNumber of items in variable part
8ob_typeType of the object

At this point, the advantage is already apparent, as we have gained an additional field while retaining the same amount of memory space: we have managed to pack three very important pieces of information in Python, but in only 16 bytes.

In contrast, the current implementation for objects that are equipped with the “length” property/attribute, which generally belongs to the PyVarObject type:

struct {
    PyObject ob_base;
    Py_ssize_t ob_size; /* Number of items in variable part */
} PyVarObject;

requires at least 24 bytes:

Address/OffsetElementDescription
0ob_refcntNumber of other objects which are referencing this one
8ob_typeType of the object
16ob_sizeNumber of items in variable part
24 Padding. To align the memory to 16 bytes granularity

In reality, and as can be seen, the space occupied is 32 bytes, because the allocated memory has a granularity of 16 bytes. So the space actually occupied by PyVarObject is double that of the proposed new implementation, which at this point simply becomes:

typedef struct PyObject PyVarObject;

thus a trivial alias of PyObject.

The first heir of PyVarObject: PyTupleObject

These changes are immediately reflected in all objects derived from PyVarObject. There are several of them in Python, but we will only consider a few, purely to show situations where there are benefits from this new structure and others where there are no benefits.

For example, tuples, which are among the most commonly used objects/sequences in Python, are transformed in this way (again simplifying the structure for purely educational purposes):

struct {
    PyVarObject ob_base;
    /* ob_item contains space for 'ob_size' elements.
       Items must normally not be NULL, except during construction when
       the tuple is not yet visible outside the function that builds it. */
    PyObject *ob_item[1];
} PyTupleObject;

Being a dynamic object, whose size is determined by the elements it contains, the space it occupies also varies accordingly, in proportion to their number. For example, the empty tuple (which has no elements) coincides with PyVarObject (so there is a saving with the new structure), while the one with only one element has the following memory representation:

Address/OffsetElementDescription
0ob_refcntNumber of other objects which are referencing this one
4ob_sizeNumber of items in variable part
8ob_typeType of the object
16ob_item[0]First element of the tuple
24 Padding. To align the memory to 16 bytes granularity

so it occupies the same space as the current Python implementation. Whereas with two elements:

Address/OffsetElementDescription
0ob_refcntNumber of other objects which are referencing this one
4ob_sizeNumber of items in variable part
8ob_typeType of the object
16ob_item[0]First element of the tuple
24ob_item[1]Second element of the tuple

this time with a saving. And so on. So there is an advantage only for tuples with an even number of elements.

Now it is the turn of PyListObject

An extremely used type which, however, always has advantages with the new structure is the list:

struct {
    PyVarObject ob_base;
    /* Vector of pointers to list elements.  list[0] is ob_item[0], etc. */
    PyObject **ob_item;

    /* ob_item contains space for 'allocated' elements.  The number
     * currently in use is ob_size.
     * Invariants:
     *     0 <= ob_size <= allocated
     *     len(list) == ob_size
     *     ob_item == NULL implies ob_size == allocated == 0
     * list.sort() temporarily sets allocated to -1 to detect mutations.
     *
     * Items must normally not be NULL, except during construction when
     * the list is not yet visible outside the function that builds it.
     */
    Py_seq_len_t allocated;
} PyListObject;

This is because, although it is a dynamic object (due to the elements it contains), its basic structure is always the same and never changes. In memory, it is represented as:

Address/OffsetElementDescription
0ob_refcntNumber of other objects which are referencing this one
4ob_sizeNumber of items in variable part
8ob_typeType of the object
16ob_itemVector of pointers to list elements. list[0] is ob_item[0], etc.
24allocatedNumber of elements which where effectively allocated in memory
28 Padding. Could be used by list.sort() to detect mutations.

The current implementation in Python, however, requires 48 bytes (instead of the 32 of the new one!).

Strings are more complicated!

Another widely used type where you can very often gain advantages from the new representation are strings, but these are handled internally by Python using three different types/structures (although they are each an extension of the previous one). Starting with the simplest (and most common!), which would become:

struct {
    PyVarObject ob_base;
    Py_hash_t hash;             /* Hash value; -1 if not set */
    struct {
        /* If interned is non-zero, the two references from the
           dictionary to this object are *not* counted in ob_refcnt.
           The possible values here are:
               0: Not Interned
               1: Interned
               2: Interned and Immortal
               3: Interned, Immortal, and Static
           This categorization allows the runtime to determine the right
           cleanup mechanism at runtime shutdown. */
        unsigned int interned:2;
        /* Character size:

           - PyUnicode_1BYTE_KIND (1):

             * character type = Py_UCS1 (8 bits, unsigned)
             * all characters are in the range U+0000-U+00FF (latin1)
             * if ascii is set, all characters are in the range U+0000-U+007F
               (ASCII), otherwise at least one character is in the range
               U+0080-U+00FF

           - PyUnicode_2BYTE_KIND (2):

             * character type = Py_UCS2 (16 bits, unsigned)
             * all characters are in the range U+0000-U+FFFF (BMP)
             * at least one character is in the range U+0100-U+FFFF

           - PyUnicode_4BYTE_KIND (4):

             * character type = Py_UCS4 (32 bits, unsigned)
             * all characters are in the range U+0000-U+10FFFF
             * at least one character is in the range U+10000-U+10FFFF
         */
        unsigned int kind:3;
        /* Compact is with respect to the allocation scheme. Compact unicode
           objects only require one memory block while non-compact objects use
           one block for the PyUnicodeObject struct and another for its data
           buffer. */
        unsigned int compact:1;
        /* The string only contains characters in the range U+0000-U+007F (ASCII)
           and the kind is PyUnicode_1BYTE_KIND. If ascii is set and compact is
           set, use the PyASCIIObject structure. */
        unsigned int ascii:1;
        /* The object is statically allocated. */
        unsigned int statically_allocated:1;
        /* Padding to ensure that PyUnicode_DATA() is always aligned to
           4 bytes (see issue #19537 on m68k). */
        unsigned int :24;
    } state;
} PyASCIIObject;

which in memory would be represented as:

Address/OffsetElementDescription
0ob_refcntNumber of other objects which are referencing this one
4ob_sizeNumber of items in variable part
8ob_typeType of the object
16hashHash value; -1 if not set
24stateSeveral flags reflecting the current status and type of the string
28 Padding. Will be used by subsequent string extentions.

Again, using 32 bytes instead of the 48 of the current implementation.

The next type used for strings, which is based on this, would then become:

struct {
    PyASCIIObject _base;
    Py_seq_len_t utf8_length;    /* Number of bytes in utf8, excluding the
                                 * terminating \0. */
    char *utf8;                 /* UTF-8 representation (null-terminated) */
} PyCompactUnicodeObject;

and in memory would be represented as:

Address/OffsetElementDescription
0ob_refcntNumber of other objects which are referencing this one
4ob_sizeNumber of items in variable part
8ob_typeType of the object
16hashHash value; -1 if not set
24stateSeveral flags reflecting the current status and type of the string
28utf8_lengthNumber of bytes in utf8, excluding the terminating 0.
32utf8UTF-8 representation (null-terminated)
40 Padding. Will be used by subsequent string extentions.

In this case it would occupy the same space (48 bytes) as the current implementation. While the next type, on the other hand, would benefit:

struct {
    PyCompactUnicodeObject _base;
    union {
        void *any;
        Py_UCS1 *latin1;
        Py_UCS2 *ucs2;
        Py_UCS4 *ucs4;
    } data;                     /* Canonical, smallest-form Unicode buffer */
} PyUnicodeObject;

In fact, its representation in memory would be:

Address/OffsetElementDescription
0ob_refcntNumber of other objects which are referencing this one
4ob_sizeNumber of items in variable part
8ob_typeType of the object
16hashHash value; -1 if not set
24stateSeveral flags reflecting the current status and type of the string
28utf8_lengthNumber of bytes in utf8, excluding the terminating 0.
32utf8UTF-8 representation (null-terminated)
40dataCanonical, smallest-form Unicode buffer

That is still 48 bytes of space occupied, as opposed to 64 in the current version in Python.

It is the turn of the dictionaries: PyDictObject

Finally, another object with a very wide use, also internally, is certainly the dictionary, which would also benefit from the new implementation:

struct {
    PyVarObject ob_base;
    PyDictKeysObject *ma_keys;

    /* If ma_values is NULL, the table is "combined": keys and values
       are stored in ma_keys.

       If ma_values is not NULL, the table is split:
       keys are stored in ma_keys and values are stored in ma_values */
    PyDictValues *ma_values;
} PyDictObject;

In fact, in memory it would be represented as:

Address/OffsetElementDescription
0ob_refcntNumber of other objects which are referencing this one
4ob_sizeNumber of items in variable part
8ob_typeType of the object
16ma_keysPointer to the keys of the dictionary
24ma_valuesPointer to the values of the dictionary

So 32 bytes instead of the 48 required by the current implementation.

Other considerations

I stop here in order not to over-lengthen the article, but the concept outlined should be clear enough and applicable to various other types of objects (even integers, which, for example, would become very similar to tuples) used both in Python and in the countless external libraries that have been created.

Similar reasoning could also be made for other basic data types that are used in various objects, such as Py_hash_t which is used for hash values. With 64-bit code, an integer of this size is used for their calculation, but one would have to ask whether this choice produces any real benefits.

The question to ask would be, in this case, whether the use of 64-bit hashes would make it possible to distribute the values of data structures using them much better than 32-bit hashes. Because if 32-bit hashes are sufficient, then they could also be adopted in 64-bit systems, so as to free up more space.

It should be emphasised in this context that using smaller data types is not in itself sufficient to reduce the size of current data structures. In fact, and as can be verified starting already with the PyObject and PyVarObject types, it is essential to try to fit the smaller types together well, so as to skilfully exploit the empty spaces left by the reduction made and, thus, make ends meet by putting the pieces of the “puzzle” in place.

It also requires good vision and knowledge of the objects and a keen eye for low-level details. In fact, many things have been omitted to simplify the discussion, but the structure fields used are not always the same, as others are added and/or removed depending on certain compilation flags (e.g. for debugging or code profiling).

The important thing, in any case, is that the new changes made are then effective for the code generated for the “mainstream” version (the final one; production, in short), for obvious reasons.

A Large” model for Python

All this can lead to what, drawing a parallel with the code that can be generated for the x86 architecture, could be defined as a “large” model (while the current one, consequently, as a “huge” model) for Python, in that there are more limits and restrictions on what can be done with this new model than what is available now and which, of course, has none.

The crux of the matter remains, therefore, whether the new, narrow limits are tolerable in comparison to the already highlighted advantages in terms of memory savings and relative bandwidth for access, or whether they are too heavy and plague normal execution in an unbearable way.

To understand this, however, it is necessary to break away from the cold numbers (the theory) and see how the new model would behave in the real world (the practice), with some concrete examples.

Let us start with what would seem to be the most obnoxious, namely the limitation on the reference count, which would go from more than 8 billion of billion to about 2 billion: an abysmal difference, to say the least! Yet this would mean having as many as 2 billion objects referencing that specific object. Or, to make another comparison, it would be like having a list (or a tuple. Or something else) containing the same object two billion times.

Clearly, these are absolutely realistic scenarios, but how common they might be (and, therefore, how much they might deter one from applying the new memory model) is, in my opinion, another matter entirely.

In fact, having 2 billion objects simultaneously in memory means that, taking the one that occupies the least (e.g. the None singleton), they will occupy 2 billion * 16 bytes = a good 32 GB of memory for them alone!

A staggering amount, albeit absolutely realistic and within the reach of modern computers, but one must consider that we are only talking about objects referencing exactly and only the same object. I doubt that there are any real-world applications that do just that…

Similar considerations can be made for the other limit, i.e. that of the maximum number of elements that can be contained in sequences (lists, tuples, …) or strings, which always rises from around 8 billion of billion to 4 billion (4Gi minus one, to be precise).

This means placing a ceiling of 4GB minus one character on a single string, or having lists that can accommodate a maximum of 4 billion elements, to give another example. Both, however, are not major limitations.

Taking the list example, it would mean that one containing 4 billion elements would require, for it alone, 4G * 8 bytes (each pointer to the referenced element takes 8 bytes) = approximately 32GB of memory. To which must be added at least another 4G * 16 bytes = about 64GB for the space occupied by the actual objects, assuming these occupy as little space as possible (16 bytes, in fact).

With regard to strings, we are still talking about 4 billion characters, so we range from a minimum of 4G * 1 byte = 4GB approx. for ASCII strings (where a character only requires one byte) up to 4G * 6 bytes = 24GB approx. for those encoded in UTF-8 format. We speak, however, of a limit applicable to a single string, but of course there can be far more and each of them with potentially the same limit/capacity.

On the other hand, these limits are reflected, precisely, on a single object, but during the execution of a Python application there can also be billions of them, and capable of consuming not only the available physical memory, but also the entire virtual address space of the processor.

Conclusions

It seems clear to me, therefore, that the limits imposed by the new “Large” model do not actually bring major problems, but, on the contrary, fit very well with the typical scenarios in which the most varied Python applications run.

Of course, it would still be possible to compile Python with the current memory model (“Huge“) for the small number of those that need to exceed these limits. We are, however, talking about scientific applications of a certain calibre, whose diffusion as well as low usage do not, in my humble opinion, justify the current higher consumption of memory and relative access bandwidth, which are also made to weigh heavily on all the other far more common use cases.

On the other hand, if it is true that technological advances have provided us with a lot of memory and bandwidth, it is equally true that this will not always be the case in the future (due to the demise of the aforementioned Moore’s Law), in addition to the fact that not all systems are equipped with them and, therefore, are penalised by the current choices.

It would be desirable, therefore, that Python developers take these considerations into account and implement this new model, so as to extend its benefits to the majority of users, who certainly do not need to be able to use or reference such a large number of objects, which would always remain purely theoretical for them and never even remotely used.

Press ESC to close