Computational Linguistics

Demystifying NLP

Writing C in Cython

For the last two years, I’ve done almost all of my work in Cython. And I don’t mean, I write Python, and then “Cythonize” it, with various type-declarations etc. I just, write Cython. I use “raw” C structs and arrays, and occasionally C++ vectors, with a thin wrapper around malloc/free that I wrote myself. The code is almost always exactly as fast as C/C++, because it really is just C/C++ with some syntactic sugar — but with Python “right there”, should I need/want it.

This is basically the inverse of the old promise that languages like Python came with: that you would write your whole application in Python, optimise the “hot spots” with C, and voila! C speed, Python convenience, and money in the bank.

This was always much nicer in theory than practice. In practice, your data structures have a huge influence on both the efficiency of your code, and how annoying it is to write. Arrays are a pain and fast; lists are blissfully convenient, and very slow. Python loops and function calls are also quite slow, so the part you have to write in C tends to wriggle its way up the stack, until it’s almost your whole application.

Today a post came up on HN, on writing C extensions for Python. The author wrote both a pure Python implementation, and a C implementation, using the Numpy C API. This seemed a good opportunity to demonstrate the difference, so I wrote a Cython implementation for comparison:

import random
from cymem.cymem cimport Pool

from libc.math cimport sqrt

cimport cython

cdef struct Point:
    double x
    double y

cdef class World:
    cdef Pool mem
    cdef int N
    cdef double* m
    cdef Point* r
    cdef Point* v
    cdef Point* F
    cdef readonly double dt
    def __init__(self, N, threads=1, m_min=1, m_max=30.0, r_max=50.0, v_max=4.0, dt=1e-3):
        self.mem = Pool()
        self.N = N
        self.m = <double*>self.mem.alloc(N, sizeof(double))
        self.r = <Point*>self.mem.alloc(N, sizeof(Point))
        self.v = <Point*>self.mem.alloc(N, sizeof(Point))
        self.F = <Point*>self.mem.alloc(N, sizeof(Point))
        for i in range(N):
            self.m[i] = random.uniform(m_min, m_max)
            self.r[i].x = random.uniform(-r_max, r_max)
            self.r[i].y = random.uniform(-r_max, r_max)
            self.v[i].x = random.uniform(-v_max, v_max)
            self.v[i].y = random.uniform(-v_max, v_max)
            self.F[i].x = 0
            self.F[i].y = 0
        self.dt = dt


@cython.cdivision(True)
def compute_F(World w):
    """Compute the force on each body in the world, w."""
    cdef int i, j
    cdef double s3, tmp
    cdef Point s
    cdef Point F
    for i in range(w.N):
        # Set all forces to zero. 
        w.F[i].x = 0
        w.F[i].y = 0
        for j in range(i+1, w.N):
            s.x = w.r[j].x - w.r[i].x
            s.y = w.r[j].y - w.r[i].y

            s3 = sqrt(s.x * s.x + s.y * s.y)
            s3 *= s3 * s3;

            tmp = w.m[i] * w.m[j] / s3
            F.x = tmp * s.x
            F.y = tmp * s.y

            w.F[i].x += F.x
            w.F[i].y += F.y

            w.F[j].x -= F.x
            w.F[j].y -= F.y


@cython.cdivision(True)
def evolve(World w, int steps):
    """Evolve the world, w, through the given number of steps."""
    cdef int _, i
    for _ in range(steps):
        compute_F(w)
        for i in range(w.N):
            w.v[i].x += w.F[i].x * w.dt / w.m[i]
            w.v[i].y += w.F[i].y * w.dt / w.m[i]
            w.r[i].x += w.v[i].x * w.dt
            w.r[i].y += w.v[i].y * w.dt

The Cython version took about 30 minutes to write, and it runs just as fast as the C code — because, why wouldn’t it? It *is* C code, really, with just some syntactic sugar. And you don’t even have to learn or think about a foreign, complicated C API…You just, write C. Or C++ — although that’s a little more awkward. Both the Cython version and the C version are about 70x faster than the pure Python version, which uses Numpy arrays.

One difference from C: I wrote a little wrapper around malloc/free, cymem. All it does is remember the addresses it served, and when the Pool is garbage collected, it frees the memory it allocated. I’ve had no trouble with memory leaks since I started using this.

The “intermediate” way of writing Cython, using typed memory-views, allows you to use the Numpy multi-dimensional array features. However, to me it feels more complicated, and the applications I tend to write involve very sparse arrays — where, once again, I want to define my own data structures.


I found a Russian translation of this post here: http://habrahabr.ru/company/mailru/blog/242533/ . I don’t know how accurate it is.

2014/10/21 - Posted by | Uncategorized

8 Comments »

  1. I recently posted the next part in the blog series you mentioned above. By removing array-indexing macros from the code and operating on C pointers I saw a 45% performance increase. I’d be interested to know if there’s a way to do something similar in Cython.

    Here’s the link:

    https://www.crumpington.com/blog/2014/10-21-high-performance-python-extensions-part-2.html

    Comment by J. David Lee | 2014/10/23 | Reply

    • Okay so, when I ran your benchmark, I think that Simple2 was implemented as well. I get a lot of variance in my runs, and Simple2 often performs similarly to Simple1 for me. But, the Cython version consistently performs at the same level as Simple2.

      I’m running on a MacBook Air, and there might be some difference in the clang and gcc compilers here. I’ll post you some numbers, give me a second.

      Comment by honnibal | 2014/10/23 | Reply

      • Here’s 5 runs. As you can see, on run 3, C Simple 1 has a sharp dip in performance. But on all of the other runs, performance is similar between the 3 implementations.

        Btw, I can get performance to 22k steps/sec by switching to single precision floats. Is double precision necessary here?

        $ python run-tests.py
        Cython (1): 17814 steps/sec
        C Simple 1 (1): 16929 steps/sec
        C Simple 2 (1): 18214 steps/sec

        $ python run-tests.py
        Cython (1): 17575 steps/sec
        C Simple 1 (1): 17849 steps/sec
        C Simple 2 (1): 17498 steps/sec

        $ python run-tests.py
        Cython (1): 18081 steps/sec
        C Simple 1 (1): 13733 steps/sec
        C Simple 2 (1): 18202 steps/sec

        $ python run-tests.py
        Cython (1): 17251 steps/sec
        C Simple 1 (1): 17606 steps/sec
        C Simple 2 (1): 17278 steps/sec

        $ python run-tests.py
        Cython (1): 17672 steps/sec
        C Simple 1 (1): 17261 steps/sec
        C Simple 2 (1): 17253 steps/sec

        Comment by honnibal | 2014/10/23

      • That’s interesting. I don’t think it’s possible for a compiler to vectorize the “C Simple 1” implementation because of the way the array indices are computed at runtime. The fact that the timings for all your runs are so similar makes me suspect that your compiler isn’t using SSE for the “C Simple 2” implementation.

        I have no idea if the simd1.c implementation will compile on a Mac, but it should definitely be faster than “C Simple 1”.

        Being a toy problem, using float64 vs 32 was an arbitrary decision.

        Thank you for running these test.

        Comment by J. David Lee | 2014/10/23

    • To answer your question about “direct” array access…Well, I’m not going via numpy or anything. I was passing in the World struct, and letting it pick the member arrays out of that. I switched to passing in the arrays to computeF directly, and fixing the Python function call overhead. Now everything within the evolve function is “pure C”. This made almost no difference to performance, it looks like it’s up about 5%.

      Here’s the current version:

      https://github.com/syllog1sm/python-numpy-c-extension-examples/blob/master/lib/sim2.pyx

      Here’s the output of “cython -a”, which allows you to view the compilation line-by-line:

      https://rawgit.com/syllog1sm/python-numpy-c-extension-examples/master/lib/sim2.html

      Click each line to see the generated C. If the lines are coloured yellow, that means the Python C API is being called, and things will be slow. If the lines are white, it’s pure C.

      Comment by honnibal | 2014/10/23 | Reply

  2. Thanks for posting this — I had also just read that post and thought: “wouldn’t it be easier to do this in Cython? I used to write code much like the OP — but now I far prefer Cython — having nit deal with the Python-C interface for you is worth a LOT.

    Note that another option is to write C functions, and then call them from Cython. But as you point out, you usually get essentially the same results — in fact, almost identical code.

    Comment by Chris Barker | 2014/10/25 | Reply

    • Hi. I’m the author of the original post. I’ve used Cython a little bit, and I wonder if there are ways to do things like specify that arrays will be contiguous, or use OpenMP. In part 2 of that series, using contiguous arrays allowed the compiler to vectorize the code, and it ran about 45% faster in that case. Using OpenMP in C is straightforward (see part 3 of that series), and provided a substantial speed-up on a quad core machine, around 300% from the fastest single-core version.

      I don’t know if those things are possible or easy using Cython. If you have some ideas or resources I’d be happy to know about them.

      Thanks,

      David

      Comment by J. David Lee | 2014/10/29 | Reply


Leave a comment