When making hundred or thousands of API calls things can quickly get really slow in a single threaded application.
No matter how well your own code runs you’ll be limited by network latency and response time of the remote server. Making 10 calls with a 1 second response is maybe OK but now try 1000. Not so fun.
For a recent project I needed to make almost 50.000 API calls and the script was taking hours to complete. Now looking into multi-threading applications was no longer an option, it was required.
Classic Single Threaded Code
This is the boilerplate way to make an API request and save the contents as a file. The code simply loops through a list of URLs to call and downloads each one as a JSON file giving it a unique name.
import requests
import uuid
url_list = ['url1', 'url2']
for url in url_list:
html = requests.get(url, stream=True)
file_name = uuid.uuid1()
open(f'{file_name}.json', 'wb').write(html.content)
Multi Threaded Code
For comparison here is the same code running multi-threaded.
import requests
import uuid
from concurrent.futures import ThreadPoolExecutor, as_completed
url_list = ['url1', 'url2']
def download_file(url, file_name):
try:
html = requests.get(url, stream=True)
open(f'{file_name}.json', 'wb').write(html.content)
return html.status_code
except requests.exceptions.RequestException as e:
return e
def runner():
threads= []
with ThreadPoolExecutor(max_workers=20) as executor:
for url in url_list:
file_name = uuid.uuid1()
threads.append(executor.submit(download_file, url, file_name))
for task in as_completed(threads):
print(task.result())
runner()
Breaking it down you first need to import ThreadPoolExecutor and as_completed from concurrent.futures. This is a built-in python library so no need to install anything here.
Next you must encapsulate you downloading code into its own function. The function download_file does this in the above example, this is called with the URL to download and a file name to use when saving the downloaded contents.
The main part comes in the runner() function. First create an empty list of threads.
threads = []
Then create your pool of threads with your chosen number of workers (threads). This number is up to you but for most APIs I would not go crazy here otherwise you risk being blocked by the server. For me 10 to 20 works well.
with ThreadPoolExecutor(max_workers=20) as executor:
Next loop through your URL list and append a new thread as shown below. Here it’s clear why you need to encapsulate your download code into a function since the first argument is the name of the function you wish to run in a new thread. The arguments after that are the arguments being passed to the download function.
You can think of this as making multiple copies or forks of the downloading function and then running each one in parallel in different threads.
threads.append(executor.submit(download_file, url, file_name)
Finally we print out the return value from each thread (in this case we returned the status code fro the API call)
for task in as_completed(processes):
print(task.result())
That’s it. Easy to implement and gives a huge speedup. In my case I ended up with this performance.
Time taken: 1357 seconds (22 minutes)
49980 files
1.03 Gb
This works out at almost 37 files a second or 2209 files per minute. This is at least a 10x improvement in performance.
The full python docs are here, https://docs.python.org/3/library/concurrent.futures.html
Thanks for an excellent article. I was able to get my code working with multiple threads downloading files via API calls.
I am getting syntax error while using this code
Can you tell show me what the error is?
Hi Bob!
Thanks for this article, this was really super duper helpful!
Just one thing I noticed – you may want to indent:
for task in as_completed(threads):
print(task.result())
This prints the return value to the console as each thread is completed.
Cheers,
C
Hi Cem,
thanks for this, post is edited 👍
Cheers,
Bob
Thanks a lot!
Btw, in line:
threads.append(executor.submit(download_file, url, file_name)
you are missing “)” at the end
🙄 Thanks again, post is corrected.
Hi Bob!!
Wonderful explanation, really helpful but can you tell me if I need to save all api downloaded data in a single json file(like multiple objects in a array within a json file), How can I accomplish this?
It’s -> its
Fixed 🙏