2021-06-01

Python Asynchronous Library Comparison

In this story, I will summarize 3 different ways in python to archive asynchronous programming.

There are usually 2 types of problems that asynchronous programming can have a big impact, these are usually called CPU-intensive and I/O-intensive.

CPU-intensive: Tasks that require a lot of CPU cycles, like sorting, search, graph traversal, matrix multiply.
I/O-intensive: Tasks that require a lot of IO, like read/write files, HTTP requests.

Use these 3 libraries and what’s the difference between them:

threading
multiprocessing
asyncio

Cause I/O is asynchronous in node.js, so we don’t need to care about I/O in nodejs, but it’s completely different in python.
We will use requests as our test tool, cause requests is I/O intensive.

Let’s start to run codes. At first, let’s try the synchronous version:

import requests
import time
def download_site(url, session):
  with session.get(url) as response:
    print("Got content from website: {}".format(url))
def download_all_sites(sites):
  with requests.Session() as session:
    for url in sites:
        download_site(url, session)
if __name__ == "__main__":
  sites = ["https://stackoverflow.com", "https://github.com"] * 10
  start_time = time.time()
  download_all_sites(sites)
  duration = time.time() - start_time
  print("Download time: {}".format(duration))

In this sample, we download every site 10 times, and it cost 8.953143119812012 seconds.

Ler’s try to use threading to add value to this program.

import concurrent.futures
import threading
import requests
import time
thread_local = threading.local()
def get_session():
  if getattr(thread_local, "session", None) is None:
    thread_local.session = requests.Session()
  return thread_local.session
def download_site(url):
  session = get_session()
  with session.get(url) as response:
    print("Got content from website: {}".format(url))
def download_all_sites(sites):
  with concurrent.futures.ThreadPoolExecutor(max_workers = 5) as executor:
    executor.map(download_site, sites)
if __name__ == "__main__":
  sites = ["https://stackoverflow.com", "https://github.com"] * 10
  start_time = time.time()
  download_all_sites(sites)
  duration = time.time() - start_time
  print("Download time: {}".format(duration))

This version cost about 2.7348077297210693 seconds. Has a great improvement.

Let’s try the asyncio version.

import asyncio
import aiohttp
import time
async def download_site(session, url):
  async with session.get(url) as response:
    print("Got content from website: {}".format(url))
async def download_all_sites(sites):
  async with aiohttp.ClientSession() as session:
    tasks = []
    for url in sites:
      task = asyncio.ensure_future(download_site(session, url))
      tasks.append(task)
    await asyncio.gather(*tasks, return_exceptions=True)
if __name__ == "__main__":
  sites = ["https://stackoverflow.com", "https://github.com"] * 10
  start_time = time.time()
  asyncio.get_event_loop().run_until_complete(download_all_sites(sites))
  duration = time.time() - start_time
  print("Download time: {}".format(duration)

This version only costs 0.6141669750213623 seconds. By now, this is the fastest version of all these test cases.

Let’s try the multiprocessing version. multiprocessing is different with threading and async io, it will use multiple CPUs from your machine.

import requests
import multiprocessing
import time
session = None
def set_global_session():
  global session
  if not session:
      session = requests.Session()
def download_site(url):
  with session.get(url) as response:
    print("Got content from website: {}".format(url))
def download_all_sites(sites):
  with multiprocessing.Pool(initializer = set_global_session) as pool:
        pool.map(download_site, sites)
if __name__ == "__main__":
  sites = ["https://stackoverflow.com", "https://github.com"] * 10
  start_time = time.time()
  download_all_sites(sites)
  duration = time.time() - start_time
  print("Download time: {}".format(duration))

This version cost around `2.18524312973022461 seconds.

So for CPU-intensive tasks, we should use multiprocessing, cause multiprocessing will use multiple CPUs and can reduce calculation time.
For I/O intensive tasks, we can choose threading or asyncio. This will help you run your tasks in a higher perfermance.

There are many other approaches that can reach run python tasks in an asynchronous way. My suggestion is to use celery (https://docs.celeryproject.org/)

This is a superb tool that can run many many tasks asynchronously.

Good luck!

haiyang's blog

Lifelong Learning

Python Asynchronous Library Comparison

Let’s start to run codes. At first, let’s try the synchronous version:

Ler’s try to use threading to add value to this program.

Let’s try the asyncio version.

Let’s try the multiprocessing version. multiprocessing is different with threading and async io, it will use multiple CPUs from your machine.