Thứ Tư, 18 tháng 11, 2015

Programming languages 101: Python containers vs. C++ containers

Hi all,
    I am always fascinated about programming languages, computer science and technology in general. At a dinner with my friends, we were discussing about programming languages Python, R, Fortran, C/C++ and data science. At a particular time, we were moving into an topic with a simple question how to implement linked-list in Python. For an experienced programmer like me, it is quite ridiculous for me to ask this question, because we do not need it in most cases. I think that Python will take care of basic data structures for you and you just use it. Python uses C to implement basic data structures for you, called CPython. And if you are crazy enough or want to reinvent the wheel or implement a new data structure, you can implement it in C/C++ and bind it to Python.
    So, I spent that a whole night to rethink about a new question: What underlying implementation of Python container list [] and what is the equivalence containers/implementations in C++?
    At first, I think the equivalent container in C++ would be vector. In the end, it has a slight  difference in their implementation. The implementation of Python list uses a contiguous array of references to other objects, and keeps a pointer to this array and the array’s length in a list head structure. Some explanations can found at [1], [2]. The intuitive idea is represented through this simple picture.
Courtesy from Laurent Luce [2]
The first left column represents array of pointers to keep tracking of Python objects. The right column represents Python objects itself. Any insertion, append, removal of elements would affect arrangements of the first column (array of pointers), but not on the second column (Python objects). In fact, moving expensive Python objects around is more than costly than its pointers. In contrast, C++ vector utilizes array as underlying storage mechanism. In a sense, Python list [] is better than C++ vector in the insert operation. In particular, if you insert an element in the middle of the array, the Python list just only need to shift references of Python objects, while C++ need to shift the C++ objects to new positions. The efficiency, you can imagine, is usually between moving heavy class objects and moving slight pointers. However, C++ offers more containers that are suitable your need. For example, if you want to insert objects frequently, it is efficient to use C++ lists (implemented by a double linked-list).
typedef struct {
    PyObject_HEAD
    Py_ssize_t ob_size;

    /* Vector of pointers to list elements.  list[0] is ob_item[0], etc. */
    PyObject **ob_item;

    /* ob_item contains space for 'allocated' elements.  The number
     * currently in use is ob_size.
     * Invariants:
     *     0 <= ob_size <= allocated
     *     len(list) == ob_size
     *     ob_item == NULL implies ob_size == allocated == 0
     */
    Py_ssize_t allocated;
} PyListObject;

    For me, Python is still new. And I think the best way to learn and understand a language is to ask simple and stupid questions like this. I have recently admired the development of Python language because it is motivated and evolved from exciting huge efforts of an open-source community.
    Of course, the container are quite an important concept of a programming language. The container in Python includes list, tupple, dict, string and something else. The container in C++11 includes vector, list (equivalent to Python list), forward_list (single linked-list), map (similar concept to Python dict), unorder_map, deque (double linked-list queue), string, and something else.
    Today, I have motivations to write blogs in Vietnamese. I decided that this blog is bilingual, i.e., English and Vietnamese. For me, it is easier to write technical contents in English, although Vietnamese is my mother tongue. It is because I tried to adapt to technical contents around Internet in English when I first came to learn computer science in university. Now, I find it hard to describe some technical concept in Vietnamese, especially with fast-pace changing field like Computer Science (CS). Overall, it is helpful because it is fun to contribute back to the community.

References:
[1] http://stackoverflow.com/questions/3917574/how-is-pythons-list-implemented
[2] http://www.laurentluce.com/posts/python-list-implementation/