Multi-threading API Requests in Python

Speeding up python using multi-threading

When making hundred or thousands of API calls things can quickly get really slow in a single threaded application.

No matter how well your own code runs you’ll be limited by network latency and response time of the remote server. Making 10 calls with a 1 second response is maybe OK but now try 1000. Not so fun.

For a recent project I needed to make almost 50.000 API calls and the script was taking hours to complete. Now looking into multi-threading applications was no longer an option, it was required.

Classic Single Threaded Code

This is the boilerplate way to make an API request and save the contents as a file. The code simply loops through a list of URLs to call and downloads each one as a JSON file giving it a unique name.

import requests
import uuid
url_list = ['url1', 'url2']
for url in url_list:
    html = requests.get(url, stream=True)
    file_name = uuid.uuid1()
    open(f'{file_name}.json', 'wb').write(html.content)

Multi Threaded Code

For comparison here is the same code running multi-threaded.

import requests
import uuid
from concurrent.futures import ThreadPoolExecutor, as_completed

url_list = ['url1', 'url2']

def download_file(url, file_name):
        html = requests.get(url, stream=True)
        open(f'{file_name}.json', 'wb').write(html.content)
        return html.status_code
    except requests.exceptions.RequestException as e:
       return e

def runner():
    threads= []
    with ThreadPoolExecutor(max_workers=20) as executor:
        for url in url_list:
            file_name = uuid.uuid1()
            threads.append(executor.submit(download_file, url, file_name))
        for task in as_completed(threads):

Breaking it down you first need to import ThreadPoolExecutor and as_completed from concurrent.futures. This is a built-in python library so no need to install anything here.

Next you must encapsulate you downloading code into its own function. The function download_file does this in the above example, this is called with the URL to download and a file name to use when saving the downloaded contents.

The main part comes in the runner() function. First create an empty list of threads.

threads = []

Then create your pool of threads with your chosen number of workers (threads). This number is up to you but for most APIs I would not go crazy here otherwise you risk being blocked by the server. For me 10 to 20 works well.

 with ThreadPoolExecutor(max_workers=20) as executor:

Next loop through your URL list and append a new thread as shown below. Here it’s clear why you need to encapsulate your download code into a function since the first argument is the name of the function you wish to run in a new thread. The arguments after that are the arguments being passed to the download function.

You can think of this as making multiple copies or forks of the downloading function and then running each one in parallel in different threads.

threads.append(executor.submit(download_file, url, file_name)

Finally we print out the return value from each thread (in this case we returned the status code fro the API call)

for task in as_completed(processes):

That’s it. Easy to implement and gives a huge speedup. In my case I ended up with this performance.

Time taken: 1357 seconds (22 minutes)
49980 files
1.03 Gb

This works out at almost 37 files a second or 2209 files per minute. This is at least a 10x improvement in performance.

The full python docs are here,

Using Python with One Time Passwords (2FA)

Ever wanted to automate a process involving a vendor website and come across this. Me too πŸ˜”

Secure..but annoying to automate

One time passwords (aka MFA, 2FA, or two factor authentication as I’ll say from here on) are something everyone should use and are great for security, not so great for automation.

Until recently I thought this was hard to overcome without some super hacky solution but if you’re using selenium with python the solution is at hand and really easy to implement.

This solution works if you’re using an authenticator app like Google Authenticator that generates time based tokens that change every 60 seconds (which I recommend over SMS due to the sim swap security issue).

Using pyotp

The pyotp library handles creating new one time tokens. To enable the creating of new tokens we must know the secret key that we use when initially setting up the OTP.

When enabling 2FA you are usually prompted to scan a QR code. In reality this is simply a convenient way to enter the secret πŸ”‘, since the code is just the secret key encoded as a QR code.

As well as the QR code there’s always an option to enter the key manually, use this option and copy the key to a secure location. The key will usually be a long string of letters and numbers like E7XCRPCJABXKM575P3EIVNKYVG3DBRZD.

Note that if someone else gets your secret πŸ”‘ they will also be able to generate your tokens so please look after them!

Authenticator apps like Google Authenticator work offline by generating tokens based on the secret key and a time stamp. Using the same πŸ”‘ we can generate the same tokens in python.

The Code

First import the library and initialise with your secret πŸ”‘

from pyotp import *
# initialise with your secret key

When you want to create a token simply call this function.

token =

Now you can use the send_keys function in selenium to populate the 2FA field in the web application.

# find OTP element on page and send token
e = driver.find_elements(By.ID, "auth-mfa-otpcode"))


To test the tokens are working you can use the site below. Type anything as the username and it will give you the secret πŸ”‘ to use in your script.

Token test site


This solution could also be integrated into other applications by either embedded a small python script into the flow (using Alteryx or KNIME for example) or alternatively by creating a private API using Flask so any application supporting APIs can simply get the token via a simple API call.

You can now access any site protected by 2FA πŸš€

Do What I Mean!

DWIM (do what I mean) computer systems attempt to anticipate what users intend to do, correcting trivial errors automatically rather than blindly executing users’ explicit but potentially incorrect inputs.

When I hear about the AI singularity and how no one will have jobs in 10 years I usually just laugh.

Computers and software are really, really dumb.

Consider this, I’m using a terminal on my computer (Windows, Mac, Linux, doesn’t matter).

I’m in my home directory and want to change directory to Documents, but mistype.

C:\Users\Me>cd Document
The system cannot find the path specified.

How dumb is this? Any human would immediately see the error and know you meant to access the Documents directory. But computers are dumb. They do what you tell them, not what what you want. Blindly following rules.

We are a million miles from telling computers what result we want and letting them work out for themselves how to get the answer. Imagine if humans worked like this. I’d never ask my daughter to empty the dishwasher because explaining her all the steps involved in the process would take 100x longer than just doing it myself!

We need result based computing, not process based computing.

Two classic programmer jokes highlight the issue. Humans would never do this but computers do it every day.

My wife said: “Please go to the store and buy a carton of milk and if they have eggs, get six.” I came back with 6 cartons of milk She said, “why in the hell did you buy six cartons of milk”

“They had eggs”


A programmer goes out to get some dry cleaning. His wife told him, “While you’re out, pick up some milk”

He never came home

Daily Examples of Dumb Errors


Maybe I mean intMyNum…


Yes I know


You think?

LibreOffice Calc



You have a point


I’m scheduled to meet

The last message from the ‘Today’ panel on my MacBook is very telling. It’s clear that underlying all the smart AI is still lots of rules-based logic. This is if-then-else login, not AI.

  • Look in calendar for events in the next 24 hours
  • Parse events and turn into human friendly string
  • Use ‘from’ field as the person you’re about to meet
  • Fail

My computer has no idea ‘unknownorganizer’ isn’t a person, it doesn’t know how names work or what constitutes a name. It’s just a dumb parser.

While we read stories everyday about facial recognition or predicting flu outbreaks (although we’re a long way from predicting COVID 19 and Google Flu Trends shut down a long time ago due to inaccuracy) AI is still in it’s infancy and trying to do anything that we might call General AI is still a long way off.

The potential to make everyday software so much smarter is there, but I see very little focus on this type of work despite the idea being around for at least 60 years. The productivity gains could be huge but it seems something seemingly as simple as doing what we want is still too complex for computers to achieve.

Why are Logistics Carriers so Bad with Data?

Just to be clear when I say Logistic Carriers I mean UPS, DHL, DSV, GLS and many other three letter companies that deliver your Christmas presents, but in a business context they are also responsible for B2B deliveries of goods to shops and warehouses.

I’ve previously worked in Logistics for many years and so have some insight into integrations with Logistic Carriers and the picture is not pretty.

Now I’m working in consultancy and it’s my task to gather invoices from different carrier, process the data and present it in the form of an interactive dashboard.

Sounds easy, just use APIs to get the data, clean, reshape and present. If only πŸ™„

First up no carriers offers APIs for invoicing data. Unbelievable I know. They have APIs for tracking, label printing, updating shipping info, rate calculations, everything else but not invoicing. You have to log in to a web portal, search, select your files and download. Welcome to 1995.

It gets worse from there.


  • Provides PDF or CSV files πŸ‘
  • The CSV has 250 columns!
  • The services are provided as row items so to process data you need to pivot as one package can have many services applied
  • There are 70 different service charges!!!!
  • The service charges appear in the local language as plain text with no service codes making joining multiple files a nightmare
  • Dimensions appear as text fields like 24.0X 12.0X 8.0. Thank god for regexes.
  • Weights and quantities only appear on lines associated with actual freight costs
  • Sometimes data is just missing, orders, tracking numbers, sender, you name it.

Royal Mail

  • Provides CSV files that looks to be sourced from the original PDF
  • CSV files contain a totals row and subtotals along with other extra lines 😬
  • Charges are split across multiple lines (since the PDF does the same)
  • PDF has extra line charges not included in the CSV
  • If you download a single CSV file it’s delimited with a pipe ( | ) but if you download multiple files you get a consolidated file that’s comma delimited 😲
  • Developer site appears to be permanently down (at least it has been for the last two weeks)


  • Excel files available πŸ‘ but they look exactly like you’d expect a PDF invoice to look like 🀯
  • Excel files are still in xls format (the newer xlsx format was introduced in 2007, 13 years ago, this gives you an insight into their systems)
  • Only contains monetary values, no tracking info, no quantities of packages, individual weights etc.
  • Every line item is a summary line that’s immediately duplicated as a sub-total underneath
  • Random cells are merged and formatting makes reading data very challenging
  • Basically they’ve designed an invoice in Excel that generates the PDF they wish to send to their customers

On a side note I recently saw a job advertisement for DSV that specifically asks for strong Excel and VBA skills. Job descriptions are a great way to get an insight into a company and their systems.

What should have been a straightforward integration and data presentation exercise has tuned into a complex workflow requiring web automation, file conversion and extensive data cleaning before we can even start to look at the data.

Environmental Considerations

Moving forward companies will also have an increased demand from their investors to measure and set targets for CO2 emissions.

As things stand today most companies don’t know how their products get from A to B since the carriers are left to optimize their distributions for cost/deadlines with little consideration for the environment. Expect this to change. They will need to offer ‘Least impact’ options to customers and also provide the data to enable companies to report on their distribution environmental impact.

On the parcel tracking side there’s been lots of startup activity creating aggregators that act as a single source of data. Maybe we need that for invoicing as well πŸ€”

Loading Private Python Libraries to Alteryx at Runtime

The python tool in Alteryx is a great resource that I often use but it can easily get very cluttered with lots of reusable code that could easily be moved out and imported as a library.

Unfortunately by default you can’t just use the standard import statement since the location of your workflow at runtime is not in pythons path so it doesn’t search for libraries there.

Take this example where I have a python file called located in the same directory as my Alteryx workflow. The file contains a simple function to return the dates for the start and end of the previous two week period.

import datetime
def last_week():
    today =
    weekday = today.weekday()
    start_delta = datetime.timedelta(days=weekday, weeks=2)
    start_of_week = today - start_delta
    end_of_week = start_of_week + datetime.timedelta(days=13)
    return (start_of_week, end_of_week)

If you try to load this library from the python tool in Alteryx (import carrierpy as cr) you’ll see this error. The python tool cannot find the library as it’s not searching the current workflow path.

Module not found

Standard ways to access the current working directory also cannot be used as they return the path to the temp location used by the Jupyter notebook, not the directory the Alteryx workflow is saved in.

Workflow Constant Solution

To allow this to work we need to use the Alteryx module that we import at the start of every python tool.

In the python tool add the following code at top of the notebook to read the current path from the Alteryx library and add this to to sys.path. Then we can load modules from the local directory without issues.

from ayx import Alteryx
import pandas as pd
import sys

# get the path of the Workflow Directory
path = Alteryx.getWorkflowConstant('Engine.WorkflowDirectory')

# append this path to the python sys.path

# import from current directory
import carrierpy as cr

dates = cr.last_week()
# initialise data of lists. 
data = {'Date':[dates[0]]} 
# Create DataFrame 
df = pd.DataFrame(data) 

Sure enough this works and if we run the Jupyter notebook we see the correct data printed (make sure you run the entire workflow first otherwise the input path will not be populated and cached).

In the Alteryx workflow messages we now see the data is correctly passed out of the python tool.

Successful output

The browse tool also shows the date being output from the python tool 🐍

Correct date output into workflow

Now you can create your own libraries of often used python functions and import them into your workflows as required ♻️

Converting xls to xlsx files in Alteryx

There are numerous threads on the Alteryx Community where people are asking how to covert xls files to xlsx. The solutions generally suggest one of two approaches, either

  • Read the file in with an Input Data tool then output again in the new format
  • Use the Run Command tool to use an an Excel component to do the conversion

In my (admittedly edge!) case the server doesn’t have Excel installed and I don’t want to pay for a license just for this. Plus the file cannot be read natively by the input tool as it requires pre-processing in python first (it’s a really weird Excel file that looks like an PDF invoice but in Excel 😬)

LibreOffice Batch Converting

My solution is to use LibreOffice. It’s open source, free to use and includes batch conversions that can be run from the command line. By default the UI will not be shown when run in this way.

The basic command to do the conversion is shown below.

"C:\Program Files\LibreOffice\program\soffice.exe" --convert-to xlsx "C:\input_path\file.xls" -outdir "C:\output_path"

–convert-to xlsx should be followed by the name of the xls file to convert and the -outdir will be used to write the xlsx files to. The files will have the same name as the original just with the new extension.

Integrating this into an Alteryx workflow is just like any other using the Run Command to run an external program.

  • Use a Directory tool to read all the xls files
  • A Formula tool to create the command line for each file
  • A Run Command to first write the flow to a batch file and then run the file

The most difficult part is configuring the Run Command. The configuration should look like below so the flow is first written to a file ending with .bat followed by the tool running this newly created batch file.

The xlsconvert.bat file that is created should look something like this with a line per conversion.

"C:\Program Files\LibreOffice\program\soffice.exe" --convert-to xlsx "C:\Customer projects\Project\Alteryx\Data\Invoice - SBRY0191928.XLS" -outdir "C:\Customer projects\Project\Alteryx\Data\"
"C:\Program Files\LibreOffice\program\soffice.exe" --convert-to xlsx "C:\Customer projects\Project\Alteryx\Data\Invoice - SBRY0192237.XLS" -outdir "C:\Customer projects\Project\Alteryx\Data\"
"C:\Program Files\LibreOffice\program\soffice.exe" --convert-to xlsx "C:\Customer projects\Project\Alteryx\Data\Invoice - SBRY0192914.XLS" -outdir "C:\Customer projects\Project\Alteryx\Data\"

Each file will be processed turn and written to the -outdir you specified in the formula tool. VoilΓ .

Converted files together with original

Conversion Using the Python Tool

If you prefer to use python instead of the run command tool it’s very easy to run the same command. This code will run LibreOffice in headless mode again and silently convert the files from xls to xlsx format.

The only import required is subprocess which is already available to Alteryx in the default install.

from ayx import Alteryx
import subprocess
import os
import pandas as pd

# prepare outbound data
data = {"Files":[]}

# path with files to convert (assume in Data subdirectory to workflow)
PATH = Alteryx.getWorkflowConstant('Engine.WorkflowDirectory') + "Data"

# path to LibreOffice your executable
EXE = 'C:\Program Files\LibreOffice\program\soffice.exe'

# loop files, convert and get converted filename for outputting
for file in os.listdir(PATH):
    if file.lower().endswith("xls"):[EXE, '--convert-to','xlsx',os.path.join(PATH, file),'-outdir',PATH])
        filename, file_extension = os.path.splitext(file)
        data['Files'].append(os.path.join(PATH, filename, file_extension.lower().replace("xls", "xlsx")))


I personally prefer the python method over the run command purely because I find it more flexible and easier to set up but it works the same either way.

You could of course also use this method for any other Analytics platform such as KNIME, it could also be integrated into an RPA solution using UiPath or similar tool.

Jupyter Notebook Shortcut on MacOS using Anaconda

It’s not immediately clear on Mac OS how to start a Jupyter Notebook if you’re using Anaconda.

The actual executable is located at /Users/*YOURUSER*/opt/anaconda3/bin/jupyter-notebook but it can be a pain to either type the full link or even worse start the Anaconda application first just to open a notebook.

The easiest way I found is to make a symlink in your /usr/local/bin directory (make one if it doesn’t already exist) using the following command.

$ cd /usr/local/bin
$ ln -s /Users/*YOURUSER*/opt/anaconda3/bin/jupyter-notebook 

Then you can start jupyter-notebook from anywhere in the terminal by just typing jupyter-notebook.

Alternative Method

I had to use the above solution since conda was not in my path but it seems there’s another solution that’s even simpler. Just open a terminal, navigate to your anaconda installation and run the conda with the following arguements.

$ cd opt/anaconda3/bin
$ ./conda init zsh 

Brew Permissions Error

I recently installed Homebrew on my MacBook and immediately hit permissions issues (I’m running Catalina v 10.15.2 (19C57) and it seems permissions are an issue for many apps).

Trying to install wget gave this error.

Last login: Wed Nov 20 08:55:40 on ttys000
bob@Bobs-MacBook-Pro ~ % wget
zsh: command not found: wget
bob@Bobs-MacBook-Pro ~ % brew install wget
warning: unable to access '/Users/bob/.config/git/attributes': Permission denied
warning: unable to access '/Users/bob/.config/git/attributes': Permission denied
warning: unable to access '/Users/bob/.config/git/attributes': Permission denied

If you look at the permissions on the .config folder you immediately see the problem.

bob@Bobs-MacBook-Pro ~ % ls -la|grep .config
total 72
drwx------ 3 root staff 96 Nov 3 19:44 .config

By default it’s owned by root and my user has no permissions set. The fix is simple, change the owner to your user and give 744 permissions (must be run as sudo). Problem solved 😎

bob@Bobs-MacBook-Pro ~ % sudo chown -R bob:staff .config
bob@Bobs-MacBook-Pro ~ % sudo chmod -R 744 .config

The Knowledge Curve

Having recently changed jobs to become a Consultant after working 15 years in the Retail Industry I think about this chart a lot.

I’m personally at a stage where I think I know quite a bit but I’m mostly overwhelmed by how much more there is to know 🀯

I also need to become more comfortable with the fact keeping up with new technologies is almost hopeless, but is also not required.

With age and experience you realise the actual technology matters less than you think, what really matters are general techniques to solve problems, ways of isolating issues, seeing the big picture in your head and so on.

In other words you build mental models of how things work and apply that to problem solving. Things that only come with age and experience.

IoT Sensors – Saving Collected Data

Now that my sensors are successfully collecting data and the data is being processed by a Node-RED flow on my Raspberry Pi the final step is to add some persistence to the data.

As it stands the data is constantly streaming and being displayed on a dashboard but after that it’s discarded so there’s no way to see the min or max temperature for say the last 24 hours. We need a database to store and query the data.

If you intend to use a Raspberry Pi as I have done then I wouldn’t recommend setting the database up on the SD card as the read-write traffic can easily wear out the card prematurely. For this reason I mounted an external USB drive to the Pi and create the SQLite database on the external drive.
You can read this post to learn how to mount an external drive in Linux.

Choosing a Database

Three choices spring to mind that could work equally well on a small device with limited computing power, SQLite, MySQL and PostgreSQL. All are readily available to install and reasonably easy to set up but for our data SQLite stands out as being by far the easiest to set up and administer.

The one major downside to consider with SQLite is that it’s a file based database meaning that you won’t be able to access the data from your LAN or the internet (at least it’s not recommended). If you need to do this consider PostgreSQL or MySQL.

To install the SQLite node in Node-RED follow these instructions.

Database Setup

I use the command line to create a database for our sensor data as it’s very easy and requires no extra install. First open a terminal, create the directory and then a database, mine is called sensors in /mnt/hdd/sqlite/

$ mkdir /mnt/hdd/sqlite/
$ cd /mnt/hdd/sqlite
$ sqlite3 sensors
SQLite version 3.27.2 2019-02-25 16:06:06
Enter ".help" for usage hints.

Now your in the sqlite3 command prompt so we can go ahead and create a table for our data.

create table sensordata (location text, temp float, humidity float, timestamp integer);

Our table is very simple and just has four fields, location, temperature, humidity and timestamp. You can close sqlite3 using .quit. That’s all that’s required outside of Node-RED, our database is ready for data!

Using Node-RED with SQLite

The sqlite node requires little configuration, just add the node to the canvas and open the configuration. In the database entry click on the pencil icon to open the second screen shown below.

In the right hand screen type the location of the database you just created and make the mode Read-Write. Do Not make the mode Read-Write-Create otherwise the node will create a new database for you and will overwrite the one we already created.

The node is now configured. To write or query data in the database send SQL insert statements to the node via the msg.topic so leave the SQL Query entry as-is.

Inserting Sensor Data

To inset data we create a msg.topic containing the inset statement we require, for example.

msg.topic = "insert into sensordata values('" + location + "'," + dectmp + "," + dechum + "," + unixtime+")"

Another item to consider is that our beacons send new data every 3 seconds which is great for our live dashboard but is probably excessive for saving data. To reduce the volume of data we can add a delay node set to only allow one message per minute.

Just make sure if you have multiple sensors this applies per sensor otherwise we risk losing data form some locations.

Node-RED delay node
Allow one message per minute

We can check the data is being correctly inserted into the database using the sqlite3 command line again. Open the database in SQLite3 and query the database.

$ sqlite3 sensors
SQLite version 3.27.2 2019-02-25 16:06:06
Enter ".help" for usage hints.
sqlite> select count(*) from sensordata;

We see that the database has 5915 rows of data. We can now query the database to get min and max statistics per location for example and also show this data on the dashboard.

sqlite> select location,min(temp),max(temp),avg(temp) from sensordata group by location;

The flow to save the data is shown below showing the nodes to throttle the data for each sensor.

Writing data to the database
Writing data to the database

Adding Statistics

Node-RED historical sensor data
Statistics flow for historical data

This flow creates three buttons on the dashboard, one to show statistics for the last hour, one for the last 24 hours and the other for the last 7 days. The buttons are set up to send numerical data in hours, 1, 24 and 168 and the function node then creates an SQL statement for the required period by modifying the timestamp.

// converts hours injected to seconds
timespan = msg.payload*60*60

// gets current time in milliseconds, converts to seconds and subtracts the 'lookback'
msg.topic = "select max(temp) tmphigh,min(temp) tmplow,max(humidity) humhigh,min(humidity) humlow from sensordata where location='inside' and timestamp >= " + (Math.round((new Date()).getTime() / 1000)-(timespan))
return msg;

This function converts hours sent from the button node to seconds and then creates an SQL statement that filters the data using a WHERE statement that calculates the current UNIX time and minuses the required period.

Once the query is sent to SQLite the final nodes are to format the output and display the data as text on the dashboard along with a text label that changes depending on which button was pressed last.

Sensor statistics taken from SQLite
Sensor statistics taken from SQLite

We could also add statistics for when the min and max temperatures were recorded with this SQL.

sqlite> SELECT location,datetime(timestamp, 'unixepoch', 'localtime'),temp FROM sensordata WHERE temp=(SELECT min(temp) FROM sensordata);
inside|2019-10-30 06:18:23|20.91796875
inside|2019-10-30 06:18:23|20.91796875
inside|2019-10-30 06:19:23|20.91796875
inside|2019-10-30 06:30:22|20.91796875
inside|2019-10-30 06:31:22|20.91796875
inside|2019-10-30 06:31:22|20.91796875

We see the min temp has many records starting at 06:18:23 and ending 06:31:33.


We have now setup sensors, collected the data using Node-RED and MQTT, created a dashboard to show the live streaming data and also used a SQLite database to store the data for later querying.

This solution could easily be extended with extra sensors of the same type or we could add other sensors such as Hall sensors (to monitor doors or windows opening), CO2 sensors and so on.

We could also add analytics to our solution by querying the data from the database.