Introduction

This guide will show you how to use a recursive loop to download data from an API. I my case I’ll be using a GraphQL API but the approach would work in a similar way for any other REST API.
This guide will not go into the details about the workings of GraphQL as those guides are readily available online.

Paginated APIs

If the API you are accessing has a large number of records to return they will often return data in batches, called pages, and you need to move forward through the data using a method called pagination. The reason for this is to reduce server load, so instead of returning a massive 100,000 record JSON object you get the data in smaller chunks.

Essentially you call the API requesting say 200 records and you get 200 back, you then request 200 more and so on until there are no more records returned. A typical request looks like below.

GraphQL query
Typical GraphQL query with pagination

GraphQL API

Using GraphQL we use a connection to paginate through the records using cursor-based pagination. In this case I’ll be downloading product data from an eCommerce website using a ProductConnection. The query gets 200 product numbers and descriptions per request.

The important parts of the request object is (first: 200,after:””). This requests the first 200 records using a cursor which is initially empty, hence the blank string (escaped with backslashes which KNIME requires, if you test in Postman for example you don’t need the backslashes).

The hasNextPage object will be returned with the value as either false, if there are no more records to retrieve or true if there are more records.

The cursor object will also be populated with a string which is a pointer to the record returned. To get the next records we need to find the last cursor returned and on the next iteration ask for the next 200 records after this cursor.

Now we need to put this together into a KNIME workflow.

The KNIME Workflow

To download the data using KNIME we need to use a combination of the Recursive Loop node and POST Request to send the query to the GraphQL endpoint. The finished workflow is shown below.

KNIME Recursive Loop workflow
KNIME Recursive Loop

The elements of the workflow are:

  • Input the query string (plus anything else required for the POST request, URL, authentication etc.)
  • Start the recursive part of the flow
  • POST the query string to the Graph QL endpoint
  • Parse the response to collect the product data (middle branch) to feed into the Loop end

Top Branch

  • Use a Column Filter to isolate the hasNextPage value
  • Convert hasNextPage from text to a boolean and invert the value
  • Convert the row to a variable and feed into the loop end.
  • Recursive loops can be set to stop using a variable that should be true when you want it to stop. The value hasNextPage returns true when there are more pages so we need to invert it so when it returns false (no more pages) the variable becomes true and we stop the loop.

Bottom Branch

  • Using a String Manipulation node to replace the after:”” part of the query with the extracted cursor value so it reads something like after:”YOURCURSOR”. I use regexReplace for this.
  • Filter the columns and feed back into the recursion port of the End loop node

The GraphQL Data

On the first iteration since we have no cursor we ask for the first 200 records after an empty cursor (after:””).
We parse the JSON payload returned and get the last cursor which in our case will be the 200th cursor value. We get a cursor for every record returned and it usually looks like a random string something like “AeghYtn=“.

Cursor object returned from GraphQL
One cursor for each record


Using this cursor we place this into the current query object so the next request is for records after:“AeghYtn=“.

Using the cursor in the next query request
Get more records using the last cursor as the starting point

At the same time we check to see if there are more pages to be returned. If not the loop will end otherwise another iteration will run using the new query string in the next POST request.

hasNextPage object
Object to determine if we have more pages

This continues until there are no more records to be read.

The KNIME Details

I haven’t shown all the nodes in details as many are standard to many flows but I’ve highlighted the ones critical to a recursive loop using GraphQL APIs and numbered them below.

Itemised workflow in KNIME
Numbered details described below

[1] JSON Path

As well as parsing the actual data returned we get the cursor and the next page value. A trick to get the last cursor is to use the following syntax.

$['data']['productConnection']['edges'][-1:]['cursor']

Using [-1:] to access the last cursor element makes the node dynamic so it we change the workflow to get 500 records at once it will still work. If we had used [199] (remember arrays are zero indexed) to get the last cursor this would only work when we request 200 records.

The path to get the hasNextPage value is much simpler as the element only exists once.

JASON Path setup
JSON paths to return last cursor

[2] String Manipulation

Here we modify the query string for the next iteration so instead of an empty curser we populate the query with the last cursor returned from the JSON Path node.

Essentially we need to replace the string in quotation marks used in the after:”” clause. To do this I use a regexReplace function as show below.

regexReplace($query$, "after:\\\"[A-za-z0-9=]*\\\"", join("after:\\\"",$cursor$,"\\\""))

This of course needs to also work on the next iteration when the after:”” part is not an empty string but will contain the cursor from the previous iteration. In my case all cursors contain either letters, numbers or the equals operator so I test for [A-za-z0-9=].

String manipulation to edit query string
regexReplace on query string

[3] Rule Engine

This is part of the flow to control the iterations. We invert the value of hasNextPage so when there are no more pages it becomes true and the iteration loop stops.

Rule engine to invert hasNextPage
Invert hasNextPage value

[4] Recursive Loop End

To end the recursion we use the variable we created from hasNextPage. All the data from each iteration will now be collected so you can continue to process the data downstream.

Recursive Loop End
End loop when hasNextPage (inverted) is true

Summary

Recursive loops can be tricky to get working but once you understand the basics of recursion and how KNIME loops work they are a very useful tool and can be used in a wide range of applications where you are required to download large amounts of data via APIs.

The documentation for the Recursive loop can be found on the KNIME Hub.

Leave a comment

Your email address will not be published.