Caching Response Content¶
You haven’t experienced it yet, but if you get complicated data back from a REST API, it may take you many tries to compose and debug code that processes that data in the way that you want. (See the Nested Data chapter.) It is a good practice, for many reasons, not to keep contacting a REST API to re-request the same data every time you run your program.
To avoid re-requesting the same data, we will use a programming pattern known as caching. It works like this:
- Before doing some expensive operation (like calling
requests.get
to get data from a REST API), check whether you have already saved (“cached”) the results that would be generated by making that request. - If so, return that same data
- If not, perform the expensive operation and save (“cache”) the results (e.g. the complicated data) in your cache file so you won’t have to perform it again the next time.
There are three reasons why caching is a good idea during your software development using REST APIs:
- It reduces load on the website that is providing you data. It is always nice to be courteous when using other people’s resources. Moreover, some websites impose rate limits: for example, after 15 requests in a 15 minute period, the site may start sending error responses. That will be confusing and annoying for you.
- It will make your program run faster. Connections over the Internet can take a few seconds, or even tens of seconds, if you are requesting a lot of data. It might not seem like much, but debugging is a lot easier when you can make a change in your program, run it, and get an almost instant response.
- It is harder to debug the code that processes complicated data if the content that is coming back can change on each run of your code. It’s amazing to be able to write programs that fetch real-time data like available iTunes podcasts or the latest tweets from Twitter. But it can be hard to debug that code if you are having problems that only occur on certain Tweets (e.g. those in foreign languages). When you encounter problematic data, it’s helpful if you save a copy and can debug your program working on that saved, static copy of the data.
There are some downsides to caching data – for example, if you always want to find out when data has changed, and your default is to rely on already-cached data, then you have a problem. However, when you’re working on developing code that will work, caching is worth the tradeoff.
In this book, we are providing a pattern we’re calling the “caching pattern” in order to perform this caching operation.
This is not the only way to perform caching in a program. (In fact, if you go on to learn about web development, you’ll find that you encounter caching all the time – if you’ve ever had the experience of seeing old data when you go to a website and thinking, “Huh, that’s weird, it should really be different now… why am I still seeing that?” that happens because your web browser is performing a kind of caching operation on data from the internet.)
However, in this book, we’ll treat this as the default caching pattern for now.
We will use a python dictionary to store the results of expensive operations (the calls to requests.get()
).
Each invocation of requests.get
might have a different URL. Each different URL represents the data that is being requested. For example, you might have a URL that points to data about “all the words that rhyme with the word rain” or “all the words that rhyme with the word orange”. You might have a URL that points to “20 photos tagged with the word ‘river’”, another that points to “50 photos tagged with the word ‘river’”, and yet another that points to “20 photos tagged with the words ‘river’ and ‘mountains’”. Etc.
When you cache data, you want to check whether you have already gotten data that corresponds to the request you’re making. If so, just use that data for processing in your program. If not, go make the request, retrieve data for it, and cache it – save it, so the next time you make exactly the same request, you’ll already have data, and won’t have to make another expensive operation happen (e.g. going to make a request to the internet). Making a request to a REST API on the internet is more “expensive” – takes more time – than getting data out of a file that’s on your computer.
A Pattern for Caching¶
Your goal as you look at this code should be to try to understand its structure, and understand how you could write similar code that requests and caches data from a different API, and/or how you could add a function that accesses data from another API to the same program.
When you begin writing a program that will use caching for expensive operations that get data, the first thing you need to do, if you follow the pattern we’re using in this book, is set up your cache.
You’ll need to:
- Decide what the file that holds your cached data will be called, and save that in a variable. That variable will be a global variable for the whole file. (We use
CACHE_FNAME
.) - Try to open a file saved as that name.
- Try to read the contents of that file into a string.
- Try to load the data in the contents of that file into a Python object, and save that data in a variable (we use
CACHE_DICTION
), which will be global for the entire program, so that you can access the cached data anytime later. - If any one of those things doesn’t work, create a variable (again,
CACHE_DICTION
) to hold the data you’ll be caching during the program.
During the program, you’ll add key-value pairs to it… and each time, you’ll dump the dictionary to a JSON-formatted string and save that string to a file with the CACHE_FNAME
name.
This code does that:
CACHE_FNAME = 'cache_file_name.json'
try:
cache_file = open(CACHE_FNAME, 'r')
cache_contents = cache_file.read()
cache_diction = json.loads(cache_contents)
cache_file.close()
except:
cache_diction = {}
Note
The reason you see the variable CACHE_FNAME in all caps is because it’s convention to use all caps for variable names that are intended to be constants – variables whose values will not change throughout the program, and will only be referred to.
Paying attention to stylistic conventions in programming like this is helpful, not because it necessarily changes how the code works, but because it will make it easier for other programmers to read your code. And knowing these conventions will make it much easier for you to understand others’ code!
It’s important to understand your end goal with this pattern of caching responses.
A cache file made with a program like this will eventually contain a JSON-formatted string which represents data from a big, complicated Python dictionary. This dictionary’s keys will be strings, each of which represent a request. For example, a string that represents “a request for data about 50 photos tagged with the word ‘mountains’”. The value corresponding to that key will be a dictionary or list that comes from making specifically that request (e.g. a request for data about 50 photos on Flickr tagged with the word ‘mountains’).
The code that you saw earlier makes that happen.
The next thing you need in order to create a process where such a file can be created is a function that will reliably create a string that is that unique identifier of a particular request to a REST API.
This function should accept two required arguments: the base url for a REST API, the dictionary of query parameters and their values that you would pass to requests.get
, and a third optional argument: list of any query parameters needed for the request that contain private information you would not want to share, even if you shared your data (e.g. the api_key
for a request to Flickr).
The function should return a string that represents a unique identifier of a specific API request, which, given the same input to the function, will always be the same.
We’ve provided such a function you can use, called params_unique_combination
:
def params_unique_combination(baseurl, params_d, private_keys=["api_key"]):
alphabetized_keys = sorted(params_d.keys())
res = []
for k in alphabetized_keys:
if k not in private_keys:
res.append("{}-{}".format(k, params_d[k]))
return baseurl + "_".join(res)
For example, with this base url: https://api.datamuse.com/words
And this parameters dictionary: {"rel_rhy":"rain"}
, an invocation of params_unique_combination
like so: params_unique_combination("https://api.datamuse.com/words",{"rel_rhy":"rain"})
would return a string that looks like this:
https://api.datamuse.com/wordsrel_rhy-rain
That’s pretty simple, because there’s only one query parameter and its associated value. But this is pretty useful when you have a complicated set of query parameters and values. (Check out the section of the book about searching for tags on Flickr!)
When you use some more complicated processes for requesting data from APIs, there are some additional layers of complication in order to cache data, but for what we’ve seen so far, this pattern and this helper function params_unique_combination
will always work if you’re careful.
Check your understanding
-
exceptions-1: Why is it important to use a function like the params_unique_combination function in this caching pattern?
- Because when requests.get encodes URL parameters, the params might be in any order, which would make it hard to compare one URL to another later on, and you could cache the same data multiple times.
- Comparing the strings "rowling&harry+potter" and "harry+potter&rowling", they are different as far as Python is concerned, but they are the same as far as meaning to a REST API is concerned! That's why we need to manipulate these strings carefully for the cache dictionary.
- Because otherwise, it's too much data in the same function, and the program will not run.
- There's no such thing as too much in a function to run, even though sometimes it's a good idea to break functionality up into multiple functions for clarity and ease.
- You don't, actually. This function is just a fancy way of calling requests.get.
- This function has nothing to do with calling requests.get. It only formulates information into a unique string.
- Because the params_unique_combination function as written here is what saves the cache data file so you have it later!
- This function does not save a cache file at all. It only formulates information into a unique string.
Finally, you’ll need to write the function to request and cache data from an API. Here, we’ll write a function requesting data from the datamuse API about words that rhyme with a certain word.
You’ll need to:
- As always, set up your function input, base url, and paramaters dictionary in the function body, like you did in functions before.
- Check if the unique identifier created using the
params_unique_combination
function is in the cache dictionary already. - Then, if it is, great – you don’t even need to make a request. Grab the data in the cache corresponding to that unique request, and return it (or manipulate it in some way to return what you want)
Otherwise, if the unique identifier is not in the cache dictionary yet, that’s fine.
- Make a request to the internet, using the base url and the params dictionary with
requests.get
, and get a resopnse back. P - Pull the text data out of that response, and load it into a Python object.
- Add a key-value pair to the
CACHE_DICTION
cache dictionary, where the key is the unique identifier string representing the request, and the value is that Python object that represents the data you got back from the request. - Dump the whole
CACHE_DICTION
cache dictionary to a string. - Open the
CACHE_FNAME
file for writing and write the string version of the cache dictionary to that file. Then, close the file. - Return the data (or manipulate it in some way to return what you want)
Here’s an example of such a function:
def get_from_datamuse_caching(rhymes_with):
baseurl = "https://api.datamuse.com/words"
params_diction = {}
params_diction["rel_rhy"] = rhymes_with
unique_ident = params_unique_combination(baseurl,params_diction)
if unique_ident in CACHE_DICTION:
return CACHE_DICTION[unique_ident]
else:
resp = requests.get(baseurl, params_diction)
CACHE_DICTION[unique_ident] = json.loads(resp.text)
dumped_json_cache = json.dumps(CACHE_DICTION)
fw = open(CACHE_FNAME,"w")
fw.write(dumped_json_cache)
fw.close() # Close the open file
return CACHE_DICTION[unique_ident]
The same way you can write a function to get data from many REST APIs using the function structure you’ve seen before, you can write functions to get and cache data by following this pattern.
This gives you a lot of power, and allows you to use and process a lot of data, repeatedly, that you get from REST APIs – but you don’t have to worry about e.g. not having an internet connection, the data changing in some surprising way midway through your work, or running into “rate limits” for the REST API (restrictions for how many times you can make requests to an API on the same internet connection).