Magento Commerce API Fun πŸ™„

As a consultant you get to work with many different systems, some easy, some difficult and some downright weird.

Magento Commerce fits firmly into the last category.

For those that don’t know Magento is a very popular E-commerce platform owned by Adobe and used by some of the largest companies in the world.

How APIs Normally Work

When using APIs it’s standard practice to use pagination to avoid retrieving too many records at once and potentially maxing out the memory resources of either the server or your client.

Pagination is requesting records in batches and then selecting the first ‘page’ of data followed by the next and so on until all the data is downloaded.

Their Commerce API is well documented and at first look seems pretty standard.

Magento uses a fairly standard syntax to request pages, this requests 1000 records per batch and the first page of results.


It’s the return data that’s a mess.

Normally the return message would include a link to the next page, not Magento.

Normally if you request a page beyond the end of the dataset you’ll get no data,

Magento just returns the last items over and over again.

To find out if I’ve reached the end of the dataset I ended up reading the response, removing the page number, then hashing the contents and comparing to the hash of the previous returned page. If they’re the same I’m getting the same data again so I can stop πŸ™„

    # get next page
    URL = base_URL.replace('[currentPage]=1', f'[currentPage]={i}')
    r = requests.get(URL, auth=headeroauth)

    # remove page number and hash contents
    hash = gethash(r.text.replace(f'"current_page":{i}','').encode('utf-8'))
        if hash == last_hash:
            # I'm at the end
            last_hash = hash
            with open(f'{download_dir}customer_page{i}.json', 'w') as f:
except Exception as e:

This works most of the time except in the circumstance where the underlying data changes between the two ‘final’ calls so I get the same data set again but with a few ‘extra’ records. Because of thisI I still have to check for duplicates before loading the data into a database.

While Googling for the issue I was pleased to find that Reddit tends to agree with my sentiments 😁

Reddit comment on Magento API
Reddit agrees with my sentiments

Custom Alteryx Tools

So far I’ve written four Alteryx tools based on the Alteryx Python SDK.

You can install by downloading the yxi file under the releases tab in GitHub and simply double click to install. The tools will be available in the Alteryx Tool menus and function just like the built-in tools that ship with Alteryx.

Data Hasher

alt text

Takes the incoming data field and outputs a hash of the data using the selected hash, supports the following hashing algorithms.

  • DSA
  • MD4
  • MD5
  • RIPEMD160
  • SHA
  • SHA1
  • SHA224
  • SHA256
  • SHA384
  • SHA512
  • blake2b
  • blake2s
  • dsaEncryption
  • dsaWithSHA
  • ecdsa-with-SHA1
  • sha3_224
  • sha3_256
  • sha3_384
  • sha3_512
  • whirlpool
alt text

File Copy and Move

alt text

Copies or moves files from source to destination. If the destination is a path the filenames will be unchanged, otherwise it will use the provided filename.

Compressed File Extractor

alt text

Takes the incoming mapped file path and extracts the contained files into the supplied location. By default it will not overwrite existing files but this can be changed in the tools configuration.

alt text


alt text

This tool has no inputs. Place tool on the canvas and optionally configure the tools settings. The tool outputs data on:

  • General Information
  • Platform Information
  • CPU
  • Memory
  • Disks
  • Network
  • Processes (optional)
  • Services (optional)
  • Environment Variables (optional)
alt text

File Hashing in Alteryx

Identifying Duplicate Files

Imagine you have a folder full of files, all with different names, where you need to identify which, if any, are duplicates.

Duplicate Files
Some of these might be the same?

You could use the file size but it’s not a guarantee the files are exactly the same just because they are the same size.

Hashing files is an easy way to identify if two files are the same as each other since even small changes in the file contents will result in a completely different hash. It also works on the file itself so the filename is irrelevant.

Current Alteryx Options

Alteryx has an md5 formula but that can only be used to calculate the hash of a field not an entire file.

The current solution is to read the file as a Blob using the blob Input tool then create an md5 hash of the blob field. While this works fine it’s very slow and you can only use md5 as the hashing algorithm.

Alteryx Blob Hashing

Python Macro Solution

To give more options and flexibility I’ve created a macros based on the python tool. The macros uses python’s hashlib library to do the hashing as this allows for 29 different hashing algorithms made available through the OpenSSL library.

Using the macro is simple, just feed data to the macro where one field contains the full path to the file to hash and select the hashing algorithm to use. The macro will hash the file and return all the input data while appending the hash field to the end.

File type is also irrelevant. It works equally on executables, images, binary files and so on.

Fike Hasher Macro

If you want to compare two folders just use two Directory tools, union them and input to the Macro.

The macro is available in the Alteryx Gallery for anyone to download.

Alteryx Gallery FileHasher

Bitcoin Proof of Work

(With try yourself code!)

Calculating hashes

As much as bitcoin (note here I’m talking about the software also known as bitcoin core) is a highly complex system many of the basic elements can be fairly easily described and demonstrated.

Here I’ll demonstrate a very simple example using python showing how mining works and how something called ‘difficulty’ is controlled.

But first we need some background.

Mining (New Blocks)

One of the most important elements is the process of mining that in turn adds new blocks to the blockchain but also creates newly minted Bitcoin.

Mining is where computers basically race each other to solve a mathematical problem. If they solve it first they create a new block of transactions on the blockchain and are rewarded with newly minted Bitcoins.

What is the Block Reward?
The amount that miners may claim as a reward for creating a block.

These new Bitcoins are known as the Block Reward which coincidently is big news right now as this reward is due to be cut in half on May 12th. This event, known as the Halving, occurs every 210000 blocks which translates to every 4 years or so.

But back to the miners.

The problem they need to solve is actually simple to explain but hard to solve. They take a group of transaction hashes and hash them together along with a random variable (called a nonce). This is shown below.

The trick is that the Bitcoin software controls the difficulty by telling the miners that the hash they need to produce must start with at least a minimum number of zeros (target hash).

Since the hashes generated are completely random the only way to generate a hash with the required number of zeros is to keep adding different random number to the original hash and try again.

What is Hashing?

In simple terms, hashing means taking an input string of any length and giving out an output of a fixed length. In the context of cryptocurrencies like bitcoin, the transactions are taken as input and run through a hashing algorithm (bitcoin uses SHA-256) which gives an output of a fixed length.

The magic of hashing is that it’s a one way process and output cannot be predicted from the input, meaning brute force computing is the only way to get the required output hash. There are no shortcuts.

The other benefit of hashing is that although they are hard to generate they can be very easily verified afterwards using almost no computing power. So my laptop can verify a miner has created a valid block with a valid hash in microseconds.

Hashing is a type of Trapdoor function since it is easy to go from input to hash but impossible to get the original input from the hash.

How does the difficulty adjust over time?

Bitcoin is programmed to generate new blocks every 10 minutes on average. A predictable supply.

In the early days the computing power used for mining was minimal so the difficulty to produce a block was relatively easy and only a few zeros were required for a valid block.

As the years have passed more and more miners come onboard with more and more powerful equipment. If the difficulty didn’t adjust new blocks would be created too fast and the average block time would be far below 10 minutes.

This would also mean the total supply (capped at 21 million bitcoin) would be minted almost immediately.

The elegance of mining and difficulty retargeting is that every 2016 blocks (about every 2 weeks) the code self adjusts the difficulty so if blocks are being mined too fast the difficulty increases, meaning miners now have to calculate hashes with more leading zeroes.

Conversely if mining power leaves the network the opposite happens. Blocks are mined too slowly so after 2016 blocks the difficulty is adjusted downwards and fewer leading zeros are required on hashes.

OK, I Need a Demonstration 🀯

To demonstrate the mining process we can use some fairly simple python code.

Python mining code

This code runs 6 loops with increasing difficulty starting with a target hash with no zeros and ending with 5. Within each difficulty it calculates a hash that matches the target hash and outputs the number of iterations required, the time taken and the hash produced.

The initial input is a text string, in this case ‘Hash try #’.

Next it creates a hash of the text and checks if the hash has the required number of leading zeros. If it does we print the output and hash and then try again but on the next loop the required zeros (target hash) increased by one.

If the hash doesn’t meet our difficulty we add a digit to the input text, hash this and test again. This loop continues until we finally get an output hash that meets our target hash.

Difficulty Increases Rapidly!!

As you can see creating a hash with one leading zero is trivially easy even on my laptop. Even three leading zeros only takes 0.012 seconds.

But things really change rapidly as we increase the difficulty so at 5 leading zeros it takes 1.45 seconds and over 800.000 iterations.

On my Macbook Pro generating a hash with 6 leading zeros took 64 seconds and over 44 million iterations.

I finally tried to generate a hash with 7 leading zeros:

Hash try # 675,193,594 in 952.3370 seconds => 000000096a22c89e6d0a2f1ea37719f8546aac9becfaf1e7875983f6df35adfa

After 675 million iterations and almost 16 minutes the target hash was found!

The current difficulty as of today requires 19 leading zeros and miners calculate 117 Exahashes per second (one Exahash is or a million million million)

Historical Bitcoin Hash Rate in Exa hashes (1018) per second

This explains why mining on regular computers is no longer possible and dedicated hardware is required located in places where electricity is cheap.

The Code

If you just copy this code and run you can try for yourself. Try adjusting the variable ZEROS_REQUIRED to another integer to change the hash target.

You can also try with a different input text by editing “Hash try #” and changing it to any text you like.

from timeit import default_timer as timer
import random
import time

def block_miner(text, digits):
    import hashlib
    n = 1
    ntext = text
    while hashlib.sha256(ntext.encode('utf-8')).hexdigest()[0:digits] != "0"*digits:
        ntext = f'{text} {n}'
    return(text,n , hashlib.sha256(ntext.encode("utf-8")).hexdigest())

if __name__ == '__main__':
    for i in range(0, ZEROS_REQUIRED + 1):
        start = time.perf_counter() 
        name, iters, hash = block_miner("Hash try #", i)   
        end = time.perf_counter()  
        print(f'{name} {iters:10,} in {end-start:7.4f} seconds => {hash}')

One of the amazing aspects of bitcoin is that everything is controlled algorithmically and can be independently validated by anyone. If you run a bitcoin node you can see every transactions and validate them for yourself if you wish.

You Should Probably Learn To Code

Code vs Platform Capabilities

There’s been a clear trend going on for many years now known as ‘No code’ or ‘Low code’.

This essentially describes tools or platforms that provide you with a GUI that abstracts away the underlying technicalities so non-technical users can use powerful software.

Doing this has been a huge benefit to business since the addressable market for Excel-like tools is 1000x greater than Coding-like tools. Business users benefit as well since they effectively leverage their skills and become a more valuable employee.

So Why Learn to Code?

It’s simple. When we talk about No Code or Low Code it just means that someone else has written the underlying code and abstracted the problem into a more easily understandable form. This usually ends up in the form of a component, tool or function you can use in the platform.

Someone else's computer

This is great for maybe 70-80% of use cases. It’s fast to develop, simple to use and has a quick learning curve using pre-built components.

However as you become more experienced you’ll discover the more problems you encounter the greater the chance that your particular use case has not been covered.

The solution tends to be adding complexity while trying to work within the constraints of your particular tool. Which in turn defeats the object of using the tool in the first place, since the whole idea is they are easy to use and understand.

This leads to the red area in the chart shown here, the part I called Platform+. This is a dangerous area as you’re pushing the limits of what your tool can do while making the solution overly complex.

Code vs Platform Capability
Platform - safe
Platform+ - might want to rethink
Code - just code it

But how do you know when you’re ventured into the red area? A few pointers.

  • An colleague experienced in your tool of choice can’t understand what you’ve done in 10 minutes
  • You rely on a patchwork of external tools, batch files, scripts, other executables linked together
  • Linked dependencies and applications become a spaghetti like construction
  • You end up using brute force on problems (parsing html and xml like it’s a string…)
  • Your solution is hard to update
  • You think the solution should be easy but it proves very difficult to achieve

For me the last point is probably the most telling sign. If I know how to solve the problem from an architectural point of view but find it really hard to actually create the solution then it’s a warning flag in my head..

The Happy Medium

You should use no or low code tools as much as possible. Development time is reduced considerably and ease of use means they are a huge benefit.

Having aid that I do think it’s important to recognize the limits of whatever tools you’re using and consider if it’s really the best solution or just the best solution you’re able to make with this tool.

Knowing how to code effectively gives you superpowers and increases the surface area of possible solutions.

Consider coding to be like an extended toolbox, you don’t always need to open the big toolbox but it’s great to know if you do you’ll always find the tool you need to fix the problem.


Web Automation Using Python and Selenium

Python automation using Selenium
Python automation using Selenium

Selenium is a software library that’s used to automate web browsers. It’s often used to run automated testing but can be used for anything involving a web browser. Since so much software is now running in the cloud and accessible via web interfaces Selenium is ideal for automating lots of manual tasks.

There’s libraries available in many different languages and for many different browsers but for this how-to I’ll be using python and Chrome.

To use Selenium and Chrome you’ll need to download an executable called ChromeDriver and place the file somewhere on your computer. The executable must be used in conjunction with an installed Chrome browser. You need both the Chrome browser and ChromeDriver installed to use Selenium.

1. Download ChromeDriver

Download the file from and place somewhere on your computer.

If you save the executable in the same directory as you existing Chrome you don’t need to specify the location in your code, otherwise you’ll need to provide the path to the executable in the code.

The version of ChromeDriver you download must match the version of Chrome you have installed. After downloading just unzip the file and place the executable in your file system.

2. Install Selenium

You can read about Selenium from their site and also read the full Python API docs.

To install simply use pip

pip install selenium

If you’re using Anaconda either use the Anaconda UI or from the conda terminal type:

conda install -c conda-forge selenium

3. Starting a new browser session

The most basic code to start a new session is shown below. Note that here I specify the path to the chromedriver.exe, if you saved it in your Chrome directory this parameter is not required.

from selenium import webdriver
from import Options

WIN_DRIVER = "YOUR PATH TO chromedriver.exe"

# control if the browser will be visible (if headless==True then invisible)

# create the options with the HEADLESS variable
chrome_options = Options()

# create a new webdriver object passing the options and path to chromedriver.exe
driver = webdriver.Chrome(options=chrome_options, executable_path=WIN_DRIVER)

Breaking this down into more detail this creates a new webdriver object which will opens a blank Chrome browser on the screen. The only import option to pass to begin with is the one to control if the browser will be visible on the screen (headless).

Chromedriver running
Blank Chrome window (note the banner at the top)

4. Navigating to new pages

To load a new web page we use the driver.get() command

Use driver.get(url) to open a web page

5. Locating Elements

Once the browser is running all that’s left to do is actually automate the actions you would like. 99% of this involves reading either reading content from pages or finding and interacting with elements on the page, such as clicking links or selecting check boxes.

Finding element by id is one of the most common and reliable ways to find an element. Here we locate an element with id mylink.

from import By

link_elem =(By.ID, 'mylink')
e = driver.find_element(*link_elem)

If an id isn’t available you can use any of these methods to find an element.

  • ID = “id”
  • XPATH = “xpath”
  • LINK_TEXT = “link text”
  • PARTIAL_LINK_TEXT = “partial link text”
  • NAME = “name”
  • TAG_NAME = “tag name”
  • CLASS_NAME = “class name”
  • CSS_SELECTOR = “css selector”

I won’t cover all the options to locate elements since the official docs do a very thorough job.

6. Interacting with Elements

Reading Attributes

Once you’ve located the element or elements you can then interact with them or read their attributes. For example on the xkcd page first get the Archive link using By.LINK_TEXT and then read it’s href attribute using the get_attribute() method.

from import By

# get the link with text Archive
link_elem =(By.LINK_TEXT, 'Archive')
e = driver.find_element(*link_elem)

# get the href of the link

You can also get multiple elements at once using find_elements (note the s on elements). On the xkcd archive page there’s a long list of links to every comic created.

xkcd archive page with links to all comics

To get all these links we can use this code. Broken down we:

  • Wait until an element (‘licenseText’) at the very bottom of the page is located
  • Locate the elements that are in the div with id=’middleContainer’ where the link text contains the word ‘Science’
  • Loop through the list of found elements and create a new list with text, href and title.
# wait for element at bottom of page to be sure it's loaded fully
link_ref = (By.ID, 'licenseText')
e = WebDriverWait(driver, 10).until(EC.presence_of_element_located(link_ref))

# get all links containing the word 'Science'
link_ref = (By.XPATH, "//div[@id='middleContainer']/a[contains(text(), 'Science')]")
e = driver.find_elements(*link_ref)

# create list of tuples with text, href and title
[(a.text,a.get_attribute('href'),a.get_attribute('title')) for a in e]

The output should look like this.

All links containing the word ‘Science’

Clicking and Selecting

If the element can be clicked, like links, buttons, check boxes etc. then simply click them!

Given a select element that looks like this:

html select
Select drop down

The code below will allow you to change the value in the drop down. You can either change the select using the value (which isn’t visible on the page but you can see in the html), the visible text or by index.

from import Select

elem_ref = (By.ID, 'my_select')
select = Select(driver.find_element(*elem_ref))

# select value the value

# select by the visible text

# select by position of the element to select

6. Exception Handling

The two most common exceptions to catch are when you cannot locate an element on the page or the script times out while waiting to find an element. To catch these we need to import the exception handlers from selenium.

# import the exception handlers
from selenium.common.exceptions import NoSuchElementException, TimeoutException

Catching when you cannot locate an element:

link_elem =(By.ID, 'my_elem')
    e = driver.find_element(*link_elem)
except NoSuchElementException:
    print('Element can't be found')

Catching when the script times out trying to locate an element:

link_ref = (By.ID, 'my_elem')
    e = WebDriverWait(driver, 10).until(EC.presence_of_element_located(link_ref))
except TimeoutException:
    print('Element can't be found')

7. Handling Timeouts and Waits

By default locating elements will run immediately with no built-in delay. This means that if you just loaded a new page you might be trying to locate elements that haven’t yet loaded into the DOM.

To handle this there’s two main options, implicit waits and explicit waits.

Explicit Waits

An explicit wait is where you specify how long to wait before the action should timeout. The script will try to locate the element until the timeout and then throw a TimeoutException which you can catch.

from import WebDriverWait
from import expected_conditions as EC
from selenium.common.exceptions import TimeoutException

link_ref = (By.ID, 'licenseText')
    e = WebDriverWait(driver, 10).until(EC.presence_of_element_located(link_ref))
except TimeoutException:
    print('Element can't be found')

Here the script will wait for 10 seconds before timing out.

Implicit Waits

Once set implicit waits are set for every action in the script that follows. This script will wait 10 seconds every time you try to locate an element before throwing an ElementNotfoundException.

# import the exception handler
from selenium.common.exceptions import NoSuchElementException
# set the implicit wait here

    e = driver.find_element_by_id("my_element")
except NoSuchElementException:
    print('Element can't be found')

8. Alerts (Popups)

Browser popup dialog

If the page generates a browser alert or popup box you can also interact with these using Selenium. To get the alert use:

alert_obj = driver.switch_to.alert

To read the message in the alert use:


To accept, dismiss the alert use:

# to accept the default

# to cancel

9. Closing the Browser

When you’ve done with your browser session it’s good to clean up after yourself. Close the browser session when done with:


Help Locating Elements

On Chrome the Developer Tools (press Ctrl+Shift+i) are a huge help in location elements and getting their, Id’s, name or XPath. The image below shows how to access and copy the XPath of an element.

Use Chrome’s Inspector to get the XPath of elements


The jupyter-notebook example using xkcd can be downloaded from my GitHub repository.

Multi-threading API Requests in Python

Speeding up python using multi-threading

When making hundred or thousands of API calls things can quickly get really slow in a single threaded application.

No matter how well your own code runs you’ll be limited by network latency and response time of the remote server. Making 10 calls with a 1 second response is maybe OK but now try 1000. Not so fun.

For a recent project I needed to make almost 50.000 API calls and the script was taking hours to complete. Now looking into multi-threading applications was no longer an option, it was required.

Classic Single Threaded Code

This is the boilerplate way to make an API request and save the contents as a file. The code simply loops through a list of URLs to call and downloads each one as a JSON file giving it a unique name.

import requests
import uuid
url_list = ['url1', 'url2']
for url in url_list:
    html = requests.get(url, stream=True)
    file_name = uuid.uuid1()
    open(f'{file_name}.json', 'wb').write(html.content)

Multi Threaded Code

For comparison here is the same code running multi-threaded.

import requests
import uuid
from concurrent.futures import ThreadPoolExecutor, as_completed

url_list = ['url1', 'url2']

def download_file(url, file_name):
        html = requests.get(url, stream=True)
        open(f'{file_name}.json', 'wb').write(html.content)
        return html.status_code
    except requests.exceptions.RequestException as e:
       return e

def runner():
    threads= []
    with ThreadPoolExecutor(max_workers=20) as executor:
        for url in url_list:
            file_name = uuid.uuid1()
            threads.append(executor.submit(download_file, url, file_name))
        for task in as_completed(threads):

Breaking it down you first need to import ThreadPoolExecutor and as_completed from concurrent.futures. This is a built-in python library so no need to install anything here.

Next you must encapsulate you downloading code into its own function. The function download_file does this in the above example, this is called with the URL to download and a file name to use when saving the downloaded contents.

The main part comes in the runner() function. First create an empty list of threads.

threads = []

Then create your pool of threads with your chosen number of workers (threads). This number is up to you but for most APIs I would not go crazy here otherwise you risk being blocked by the server. For me 10 to 20 works well.

 with ThreadPoolExecutor(max_workers=20) as executor:

Next loop through your URL list and append a new thread as shown below. Here it’s clear why you need to encapsulate your download code into a function since the first argument is the name of the function you wish to run in a new thread. The arguments after that are the arguments being passed to the download function.

You can think of this as making multiple copies or forks of the downloading function and then running each one in parallel in different threads.

threads.append(executor.submit(download_file, url, file_name)

Finally we print out the return value from each thread (in this case we returned the status code fro the API call)

for task in as_completed(processes):

That’s it. Easy to implement and gives a huge speedup. In my case I ended up with this performance.

Time taken: 1357 seconds (22 minutes)
49980 files
1.03 Gb

This works out at almost 37 files a second or 2209 files per minute. This is at least a 10x improvement in performance.

The full python docs are here,

Using Python with One Time Passwords (2FA)

Ever wanted to automate a process involving a vendor website and come across this. Me too πŸ˜”

Secure..but annoying to automate

One time passwords (aka MFA, 2FA, or two factor authentication as I’ll say from here on) are something everyone should use and are great for security, not so great for automation.

Until recently I thought this was hard to overcome without some super hacky solution but if you’re using selenium with python the solution is at hand and really easy to implement.

This solution works if you’re using an authenticator app like Google Authenticator that generates time based tokens that change every 60 seconds (which I recommend over SMS due to the sim swap security issue).

Using pyotp

The pyotp library handles creating new one time tokens. To enable the creating of new tokens we must know the secret key that we use when initially setting up the OTP.

When enabling 2FA you are usually prompted to scan a QR code. In reality this is simply a convenient way to enter the secret πŸ”‘, since the code is just the secret key encoded as a QR code.

As well as the QR code there’s always an option to enter the key manually, use this option and copy the key to a secure location. The key will usually be a long string of letters and numbers like E7XCRPCJABXKM575P3EIVNKYVG3DBRZD.

Note that if someone else gets your secret πŸ”‘ they will also be able to generate your tokens so please look after them!

Authenticator apps like Google Authenticator work offline by generating tokens based on the secret key and a time stamp. Using the same πŸ”‘ we can generate the same tokens in python.

The Code

First import the library and initialise with your secret πŸ”‘

from pyotp import *
# initialise with your secret key

When you want to create a token simply call this function.

token =

Now you can use the send_keys function in selenium to populate the 2FA field in the web application.

# find OTP element on page and send token
e = driver.find_elements(By.ID, "auth-mfa-otpcode"))


To test the tokens are working you can use the site below. Type anything as the username and it will give you the secret πŸ”‘ to use in your script.

Token test site


This solution could also be integrated into other applications by either embedded a small python script into the flow (using Alteryx or KNIME for example) or alternatively by creating a private API using Flask so any application supporting APIs can simply get the token via a simple API call.

You can now access any site protected by 2FA πŸš€

Do What I Mean!

DWIM (do what I mean) computer systems attempt to anticipate what users intend to do, correcting trivial errors automatically rather than blindly executing users’ explicit but potentially incorrect inputs.

When I hear about the AI singularity and how no one will have jobs in 10 years I usually just laugh.

Computers and software are really, really dumb.

Consider this, I’m using a terminal on my computer (Windows, Mac, Linux, doesn’t matter).

I’m in my home directory and want to change directory to Documents, but mistype.

C:\Users\Me>cd Document
The system cannot find the path specified.

How dumb is this? Any human would immediately see the error and know you meant to access the Documents directory. But computers are dumb. They do what you tell them, not what what you want. Blindly following rules.

We are a million miles from telling computers what result we want and letting them work out for themselves how to get the answer. Imagine if humans worked like this. I’d never ask my daughter to empty the dishwasher because explaining her all the steps involved in the process would take 100x longer than just doing it myself!

We need result based computing, not process based computing.

Two classic programmer jokes highlight the issue. Humans would never do this but computers do it every day.

My wife said: “Please go to the store and buy a carton of milk and if they have eggs, get six.” I came back with 6 cartons of milk She said, “why in the hell did you buy six cartons of milk”

“They had eggs”


A programmer goes out to get some dry cleaning. His wife told him, “While you’re out, pick up some milk”

He never came home

Daily Examples of Dumb Errors


Maybe I mean intMyNum…


Yes I know


You think?

LibreOffice Calc



You have a point


I’m scheduled to meet

The last message from the ‘Today’ panel on my MacBook is very telling. It’s clear that underlying all the smart AI is still lots of rules-based logic. This is if-then-else login, not AI.

  • Look in calendar for events in the next 24 hours
  • Parse events and turn into human friendly string
  • Use ‘from’ field as the person you’re about to meet
  • Fail

My computer has no idea ‘unknownorganizer’ isn’t a person, it doesn’t know how names work or what constitutes a name. It’s just a dumb parser.

While we read stories everyday about facial recognition or predicting flu outbreaks (although we’re a long way from predicting COVID 19 and Google Flu Trends shut down a long time ago due to inaccuracy) AI is still in it’s infancy and trying to do anything that we might call General AI is still a long way off.

The potential to make everyday software so much smarter is there, but I see very little focus on this type of work despite the idea being around for at least 60 years. The productivity gains could be huge but it seems something seemingly as simple as doing what we want is still too complex for computers to achieve.

Why are Logistics Carriers so Bad with Data?

Just to be clear when I say Logistic Carriers I mean UPS, DHL, DSV, GLS and many other three letter companies that deliver your Christmas presents, but in a business context they are also responsible for B2B deliveries of goods to shops and warehouses.

I’ve previously worked in Logistics for many years and so have some insight into integrations with Logistic Carriers and the picture is not pretty.

Now I’m working in consultancy and it’s my task to gather invoices from different carrier, process the data and present it in the form of an interactive dashboard.

Sounds easy, just use APIs to get the data, clean, reshape and present. If only πŸ™„

First up no carriers offers APIs for invoicing data. Unbelievable I know. They have APIs for tracking, label printing, updating shipping info, rate calculations, everything else but not invoicing. You have to log in to a web portal, search, select your files and download. Welcome to 1995.

It gets worse from there.


  • Provides PDF or CSV files πŸ‘
  • The CSV has 250 columns!
  • The services are provided as row items so to process data you need to pivot as one package can have many services applied
  • There are 70 different service charges!!!!
  • The service charges appear in the local language as plain text with no service codes making joining multiple files a nightmare
  • Dimensions appear as text fields like 24.0X 12.0X 8.0. Thank god for regexes.
  • Weights and quantities only appear on lines associated with actual freight costs
  • Sometimes data is just missing, orders, tracking numbers, sender, you name it.

Royal Mail

  • Provides CSV files that looks to be sourced from the original PDF
  • CSV files contain a totals row and subtotals along with other extra lines 😬
  • Charges are split across multiple lines (since the PDF does the same)
  • PDF has extra line charges not included in the CSV
  • If you download a single CSV file it’s delimited with a pipe ( | ) but if you download multiple files you get a consolidated file that’s comma delimited 😲
  • Developer site appears to be permanently down (at least it has been for the last two weeks)


  • Excel files available πŸ‘ but they look exactly like you’d expect a PDF invoice to look like 🀯
  • Excel files are still in xls format (the newer xlsx format was introduced in 2007, 13 years ago, this gives you an insight into their systems)
  • Only contains monetary values, no tracking info, no quantities of packages, individual weights etc.
  • Every line item is a summary line that’s immediately duplicated as a sub-total underneath
  • Random cells are merged and formatting makes reading data very challenging
  • Basically they’ve designed an invoice in Excel that generates the PDF they wish to send to their customers

On a side note I recently saw a job advertisement for DSV that specifically asks for strong Excel and VBA skills. Job descriptions are a great way to get an insight into a company and their systems.

What should have been a straightforward integration and data presentation exercise has tuned into a complex workflow requiring web automation, file conversion and extensive data cleaning before we can even start to look at the data.

Environmental Considerations

Moving forward companies will also have an increased demand from their investors to measure and set targets for CO2 emissions.

As things stand today most companies don’t know how their products get from A to B since the carriers are left to optimize their distributions for cost/deadlines with little consideration for the environment. Expect this to change. They will need to offer ‘Least impact’ options to customers and also provide the data to enable companies to report on their distribution environmental impact.

On the parcel tracking side there’s been lots of startup activity creating aggregators that act as a single source of data. Maybe we need that for invoicing as well πŸ€”