I stumbled over the following question the other day:
Is there a simple way to make python parallelize list comprehensions?
List comprehensions in general are not easily parallelizeable with standard python tools, but if it’s about a list comprehension that just maps a computationally intensive function over a list, and you want to parallelize the function call, then it’s fairly easy to do.
Consider the following scenario:
def func(n): return n*2 l = [1, 2, 3, 4, 5, 6, 7, 8] k = [func(x) for x in l] print(k)
In this example
func gets applied to the elements of
l sequentially, to compute
func does not need to consider anything that happened before or after. This is fairly inefficient if you are on a multicore machine and could have one instance of
func execute on each of those cores.
To parallelize this, you just need to use the
from multiprocessing import Pool def func(n): return n*2 l = [1, 2, 3, 4, 5, 6, 7, 8] with Pool() as p: k = p.map(func, l) print(k)
This runs on multiple cores by default. If you pass an
int argument to
Pool you can specify the number of processes that this should run on, by default it just takes the number of CPUs, that python thinks are available.
Here are the docs for Pool.
One of the many caveats of using pools is, that the objects must be serializable, which for functions means, that they need to be defined on the top level (this is heavily simplified, but not strictly relevant to the question anyway). So something like the following will not work:
from multiprocessing import Pool def my_amazing_func(): # func is not defined on the top level def func(n): return n*2 l = [1, 2, 3, 4, 5, 6, 7, 8] with Pool() as p: k = p.map(func, l) return k print(my_amazing_func())
Instead, it will fail with something along the lines of `AttributeError: Can’t pickle local object`. That is to be expected, but can be fairly confusing if you don’t know the background.
But on the other hand,
Pools can net you some amazing performance gains.
Here is a small demonstration program, that compares execution times for a function that takes at least one second to execute. If you are on a machine with multiple cores, the second version will execute much faster, than the first.
from time import sleep, perf_counter from multiprocessing import Pool def func(n): sleep(1) return n * 2 if __name__ == '__main__': l = [1, 2, 3, 4, 5, 6, 7, 8] # takes 8 seconds start = perf_counter() k1 = [func(x) for x in l] end = perf_counter() time_k1 = end - start with Pool() as p: # takes less than 8 seconds start = perf_counter() k2 = p.map(func, l) end = perf_counter() time_k2 = end - start assert k1 == k2 print('k1 took', time_k1) print('k2 took', time_k2)
That’s all, I hope someone found this interesting or helpful, if so, feel free to shoot me a message via the contact form.