Multi-threading API Requests in Python

Speeding up python using multi-threading

When making hundred or thousands of API calls things can quickly get really slow in a single threaded application.

No matter how well your own code runs you’ll be limited by network latency and response time of the remote server. Making 10 calls with a 1 second response is maybe OK but now try 1000. Not so fun.

For a recent project I needed to make almost 50.000 API calls and the script was taking hours to complete. Now looking into multi-threading applications was no longer an option, it was required.

Classic Single Threaded Code

This is the boilerplate way to make an API request and save the contents as a file. The code simply loops through a list of URLs to call and downloads each one as a JSON file giving it a unique name.

import requests
import uuid
url_list = ['url1', 'url2']
for url in url_list:
    html = requests.get(url, stream=True)
    file_name = uuid.uuid1()
    open(f'{file_name}.json', 'wb').write(html.content)

Multi Threaded Code

For comparison here is the same code running multi-threaded.

import requests
import uuid
from concurrent.futures import ThreadPoolExecutor, as_completed

url_list = ['url1', 'url2']

def download_file(url, file_name):
    try:
        html = requests.get(url, stream=True)
        open(f'{file_name}.json', 'wb').write(html.content)
        return html.status_code
    except requests.exceptions.RequestException as e:
       return e

def runner():
    threads= []
    with ThreadPoolExecutor(max_workers=20) as executor:
        for url in url_list:
            file_name = uuid.uuid1()
            threads.append(executor.submit(download_file, url, file_name)
           
    for task in as_completed(threads):
        print(task.result()) 
      
runner()

Breaking it down you first need to import ThreadPoolExecutor and as_completed from concurrent.futures. This is a built-in python library so no need to install anything here.

Next you must encapsulate you downloading code into it’s own function. The function download_file does this in the above example, this is called with the URL to download and a file name to use when saving the downloaded contents.

The main part comes in the runner() function. First create an empty list of threads.

threads = []

Then create your pool of threads with your chosen number of workers (threads). This number is up to you but for most APIs I would not go crazy here otherwise you risk being blocked by the server. For me 10 to 20 works well.

 with ThreadPoolExecutor(max_workers=20) as executor:

Next loop through your URL list and append a new thread as shown below. Here it’s clear why you need to encapsulate your download code into a function since the first argument is the name of the function you wish to run in a new thread. The arguments after that are the arguments being passed to the download function.

You can think of this as making multiple copies or forks of the downloading function and then running each one in parallel in different threads.

threads.append(executor.submit(download_file, url, file_name)

Finally we print out the return value from each thread (in this case we returned the status code fro the API call)

for task in as_completed(processes):
        print(task.result())

That’s it. Easy to implement and gives a huge speedup. In my case I ended up with this performance.

Time taken: 1357 seconds (22 minutes)
49980 files
1.03 Gb

This works out at almost 37 files a second or 2209 files per minute. This is at least a 10x improvement in performance.

The full python docs are here, https://docs.python.org/3/library/concurrent.futures.html

Leave a comment

Your email address will not be published. Required fields are marked *