Extracting from Nested Data¶
A common problem, especially when dealing with data returned from a web site, is to extract certain elements from deep inside a nested data structure. In principle, there’s nothing more difficult about pulling something out from deep inside a nested data structure: with lists, you use [] to index or a for loop to get them them all; with dictionaries, you get the value associated with a particular key using []. But it’s easy to get lost in the process and think you’ve extracted something different than you really have. Because of this, we have created a usable technique to help you during the debugging process.
Follow the system described below and you will have success with extracting nested data. The process involves the following steps:
- Understand the nested data object.
- Extract one object at the next level down.
- Repeat the process with the extracted object
Understand. Extract. Repeat.
To illustrate this, we will walk through extracting information from the data returned from the Twitter API, which you will work with later in the course. This nested dictionary results from querying Twitter, asking for three tweets matching “University of Michigan”. As you’ll see, it’s quite a daunting data structure, even when printed with nice indentation as it’s shown below.
Understand¶
At any level of the extraction process, the first task is to make sure you understand the current object you have extracted. There are few options here.
- Print the entire object. If it’s small enough, you may be able to make sense of the printout directly. If it’s a little bit larger, you may find it helpful to “pretty-print” it, with indentation showing the level of nesting of the data. We don’t have a way to pretty-print in our online browser-based environment, but if you’re running code with a full python interpreter, you can use the dumps function in the json module. For example:
import json
json.dumps(res, indent = 2)
- If printing the entire object gives you something that’s too unwieldy, you have other options for making sense of it.
- Print the type of the object.
- If it’s a dictionary:
- print the keys
- If it’s a list:
- print its length
- print the type of the first item
- print the first item if it’s of manageable size
Extract¶
In the extraction phase, you will be diving one level deeper into the nested data.
- If it’s a dictionary, figure out which key has the value you’re looking for, and get its value. For example:
res2 = res['statuses']
- If it’s a list, you will typically be wanting to do something with each of the items (e.g., extracting something from each, and accumulating them in a list). For that you’ll want a for loop, such as
for res2 in res
. During your exploration phase, however, it will be easier to debug things if you work with just one item. One trick for doing that is to iterate over a slice of the list containing just one item. For example,for res2 in res[:1]
.
Repeat¶
Now you’ll repeat the Understand and Extract processes at the next level.
Level 2¶
First understand.
It’s a list, with three items, so it’s a good guess that each item represents one tweet.
Now extract. Since it’s a list, we’ll want to work with each item, but to keep things manageable for now, let’s use the trick for just looking at the first item.
Level 3¶
First understand.
Then extract. Let’s pull out the information about who sent each of the tweets. Probably that’s the value associated with the ‘user’ key.
Now repeat.
Level 4¶
Understand.
Extract. Let’s print out the user’s screen name and when their account was created.
Now, we may want to go back have it extract for all the items rather than only the first item in res2.
Reflections¶
Notice that each time we descend a level in a dictionary, we have a [] picking out a key. Each time we look inside a list, we will have a for loop. If there are lists at multiple levels, we will have nested for loops.
Once you’ve figured out how to extract everything you want, you may choose to collapse things with multiple extractions in a single expression. For example, we could have this shorter version.
Even with this compact code, we can still count off how many levels of nesting we have extracted from, in this case four. res[‘statuses’] says we have descended one level (in a dictionary). for res3 in… says we are have descended another level (in a list). [‘user’] is descending one more level, and [‘screen_name’] is descending one more level.