In this post I want to highlight 3 patterns to implement very fast python code:
You create many processes, and you can communicate their results through a queue.
from multiprocessing import Process, Queue
from random import randint
def do_stuff(queue, table):
"""some irrelevant, long computation"""
print(f"Doing table {table}")
result = sum([randint(0, 40) for i in range(1_000_000)])
# 4) When we are done, we send the result through the queue
queue.put(result)
def main():
# 0) The queue will be our communication channel
queue = Queue()
# 1) Create our list of processes
tables = [i for i in range(40)]
processes = []
for table in tables:
process = Process(target=do_stuff, args=(queue, table))
process.start()
processes.append((process, queue))
# 2) we run them all
out = []
for process, queue in processes:
# 3) individual processes block until they terminate
process.join() #
# 5) We now have a result we can append to a global state
result = queue.get()
out.append(result)
print(out)
if __name__ == "__main__":
main()
You can manage how many processes to create using a process pool.
from multiprocessing import Pool
import multiprocessing
from random import randint
def do_stuff(table):
"""some irrelevant, long computation"""
print(f"Doing table {table}")
return sum([randint(0, 40) for i in range(1_000_000)])
def main():
n_cores = multiprocessing.cpu_count()
# Generate a list of some input data
tables = [i for i in range(20)]
# Create the process pool, with half the availables cores
# https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.Pool
with Pool(n_cores//2) as pool:
# same as map(do_stuff, tables), except
# it’s using CPU processes and
# splits the workload (tables) in pieces of size 'chunksize'.
results = set(pool.map(do_stuff, tables, chunksize=1))
print(f'{len(results)} results:')
print(results)
if __name__ == '__main__':
main()
Taking advantage of multiple threads can only get you so far. At some point, you may have to reach for a faster language, at least for the core algorithm that will take most of the execution time. You can have python generate threads for you, and use FFI (fluent foreign interfaces) to communicate with a fast implementation of the algorithm in a fast language.
Then, you take advantage of the best of both worlds:
I wrote about this in another post.
Here is the interesting bit: python for the CLI, rust for the algorithm, FFI to call Rust from Python.
I also wrote a bit before about the methodology I used to make a simple Monte Carlo simulation (for another context) much faster using Rust.