This guide will show you how to use a recursive loop to download data from an API. I my case I’ll be using a GraphQL API but the approach would work in a similar way for any other REST API.
This guide will not go into the details about the workings of GraphQL as those guides are readily available online.
If the API you are accessing has a large number of records to return they will often return data in batches, called pages, and you need to move forward through the data using a method called pagination. The reason for this is to reduce server load, so instead of returning a massive 100,000 record JSON object you get the data in smaller chunks.
Essentially you call the API requesting say 200 records and you get 200 back, you then request 200 more and so on until there are no more records returned. A typical request looks like below.
Using GraphQL we use a connection to paginate through the records using cursor-based pagination. In this case I’ll be downloading product data from an eCommerce website using a ProductConnection. The query gets 200 product numbers and descriptions per request.
The important parts of the request object is (first: 200,after:””). This requests the first 200 records using a cursor which is initially empty, hence the blank string (escaped with backslashes which KNIME requires, if you test in Postman for example you don’t need the backslashes).
The hasNextPage object will be returned with the value as either false, if there are no more records to retrieve or true if there are more records.
The cursor object will also be populated with a string which is a pointer to the record returned. To get the next records we need to find the last cursor returned and on the next iteration ask for the next 200 records after this cursor.
Now we need to put this together into a KNIME workflow.
The KNIME Workflow
The elements of the workflow are:
- Input the query string (plus anything else required for the POST request, URL, authentication etc.)
- Start the recursive part of the flow
- POST the query string to the Graph QL endpoint
- Parse the response to collect the product data (middle branch) to feed into the Loop end
- Use a Column Filter to isolate the hasNextPage value
- Convert hasNextPage from text to a boolean and invert the value
- Convert the row to a variable and feed into the loop end.
- Recursive loops can be set to stop using a variable that should be true when you want it to stop. The value hasNextPage returns true when there are more pages so we need to invert it so when it returns false (no more pages) the variable becomes true and we stop the loop.
- Using a String Manipulation node to replace the after:”” part of the query with the extracted cursor value so it reads something like after:”YOURCURSOR”. I use regexReplace for this.
- Filter the columns and feed back into the recursion port of the End loop node
The GraphQL Data
On the first iteration since we have no cursor we ask for the first 200 records after an empty cursor (after:””).
We parse the JSON payload returned and get the last cursor which in our case will be the 200th cursor value. We get a cursor for every record returned and it usually looks like a random string something like “AeghYtn=“.
Using this cursor we place this into the current query object so the next request is for records after:“AeghYtn=“.
At the same time we check to see if there are more pages to be returned. If not the loop will end otherwise another iteration will run using the new query string in the next POST request.
This continues until there are no more records to be read.
The KNIME Details
I haven’t shown all the nodes in details as many are standard to many flows but I’ve highlighted the ones critical to a recursive loop using GraphQL APIs and numbered them below.
 JSON Path
As well as parsing the actual data returned we get the cursor and the next page value. A trick to get the last cursor is to use the following syntax.
Using [-1:] to access the last cursor element makes the node dynamic so it we change the workflow to get 500 records at once it will still work. If we had used  (remember arrays are zero indexed) to get the last cursor this would only work when we request 200 records.
The path to get the hasNextPage value is much simpler as the element only exists once.
 String Manipulation
Here we modify the query string for the next iteration so instead of an empty curser we populate the query with the last cursor returned from the JSON Path node.
Essentially we need to replace the string in quotation marks used in the after:”” clause. To do this I use a regexReplace function as show below.
regexReplace($query$, "after:\\\"[A-za-z0-9=]*\\\"", join("after:\\\"",$cursor$,"\\\""))
This of course needs to also work on the next iteration when the after:”” part is not an empty string but will contain the cursor from the previous iteration. In my case all cursors contain either letters, numbers or the equals operator so I test for [A-za-z0-9=].
 Rule Engine
This is part of the flow to control the iterations. We invert the value of hasNextPage so when there are no more pages it becomes true and the iteration loop stops.
 Recursive Loop End
To end the recursion we use the variable we created from hasNextPage. All the data from each iteration will now be collected so you can continue to process the data downstream.
Recursive loops can be tricky to get working but once you understand the basics of recursion and how KNIME loops work they are a very useful tool and can be used in a wide range of applications where you are required to download large amounts of data via APIs.
The documentation for the Recursive loop can be found on the KNIME Hub.