Python is slow in terms of execution speed. Period. It is fast enough for most tasks out there, but when it comes to number crunching or manipulation of lots of objects, you can hear the interpreter grind its teeth.
Python advocates have a lot to say on how to handle these situations. The most common is "you can always rewrite that particular part of code in C". There are other recommendations: "Just import psyco", "Try to write your whole app in RPython", "Pyrex is impressive" and so on.
All of these possibilities do have some additional requirement, though. You will have to learn how to use these tools. Being fluent in Python alone will not be enough.
The other day I had a very narrow bottleneck in my Python code. It was all about checking what the first ball of a couple of balls is, in which a given n-dimensional point lies. "A couple" stands for several millions in this case; let's call it N.
The loop was like this:
for i, ball in enumerate(balls): distance = point - ball if dot(distance.T, distance) <= radiusSquared: return i
If you know scipy, this will be easy stuff for you. If you don't: it's just measuring the distance from the query point to the center of a ball and checking wether that distance is greater or less than the radius of all balls. The function returns as soon as it has found one ball.
The tricky point about this is, that the Python interpreter will make a for loop over N numbers and call a builtin (dot) in every loop. Once the builtin is called, Scipy's nature as a C extension speeds things up.
But what if the whole for loop could be written as a builtin? That's what we could do by writing it as a C extension.
Instead, I decided to look around and weave sounded interesting: Inline-C++. I figured that rewriting the above loop in C++ would be fairly easy. I copied the skeleton from scipy's PerformancePython webpage and just put in my code:
nBalls, dim = balls.shape radiusSquared = self.radiusSquared code = """ return_val = -1; for (long i = 0; i < nBalls; i++) { double distance = 0.0; for (long j = 0; j < dim; j++) { double diff = balls(i, j) - point(j); distance += diff * diff; } if (distance <= radiusSquared) { return_val = i; break; } } """ variables = 'point', 'balls', 'nBalls', 'dim', 'radiusSquared', result = weave.inline( code, variables, type_converters=weave.converters.blitz, compiler='gcc') return result if result != -1 else None
The amount of lines grew quite a bit, but it still fits on a single screen in my editor. It's actually quite simple, but we can expect to get some benefits of this:
- We can expect to run the above code completely from the L1 cache
- We can expect all our data to be at least in the L2 cache
And when code greatly decreases its cache misses, things start to become really fast.
In this case, speedup was of about several orders of magnitude. The function itself was faster by factor 2**12; the whole program started to run 60x as quick as before.
Furthermore, there were no caveats. Everything worked out of the box. On the first call of the function, the above code was compiled, which slowed down the first call. But once it is compiled (and weave caches compiled code over several interpreter sessions), things get pretty fast.
6 comments:
Using psyco can be as simple as putting the following two lines at the top of your code:
import psyco
psyco.full()
It probably won't give you as great a speed increase as Weave, but I'd say it's easier to use.
What is your "n"?
Did you try to look if there is some overhead involved? Where is the break-even?
There should be no overhead except the compiling once, since weave compiles a builtin function.
I had N of different sizes, starting at roughly ~2^20.
How often did you iterate the given code with 1M elements?
There could be some overhead because of boundary crossing from C to Python land, right? Does the data get copied from Python data structures in to C ones?
The iteration happens several times; actually it is part of the insert function of a data structure. [1]
The reduction of crossings of the C <-> Python boundary is what seems to be a major factor of the speedup here... and the data is not copied.
I really like your writing style, great information, thankyou for posting. wärmerückgewinnung industrie
Post a Comment