Working with Large Datasets Faster: using the map function
From Mastering Large Datasets with Python by John T. Wolohan.
An introduction to map
map can be used in place of a
for loop to transform a sequence of data using a function. Consider the example of applying the mathematical function
n+7 to a list of integers: [-1,0,1,2]. The graphic in figure 1 shows a series of numbers being mapped to their outputs.
This figure shows the essence of
map. We have an input of some length, in this case four, and an output of that same length. And each input gets transformed by the same function as all the other inputs. These transformed inputs are then returned as our output. This is all fine and good, but most of us aren't concerned with middle-school math type problems such as applying simple algebraic transformations. Let's take a look at a few ways that
map can be used in practice to begin to see its power.
Scenario You want to generate a call list for your sales team, but the original developers for your customer sign-up form forgot to build data validation checks into the form. As a result, all the phone numbers are formatted differently. For example, some are formatted nicely-(123) 456–7890-some are only numbers-1234567890-some use dots as separators-123.456.7890-and others, trying to be helpful, include a country code-+1 123 456–7890.
First, let’s tackle this problem in a way that you’re probably familiar with already: for looping. We’ll do that in listing 1. Here, we first create a regular expression that matches all numbers and compile it. Then, we go through each phone number and get the digits out of that number with the regular expression’s
findall method. From here, we count off the digits from the right. We assign the first four from the right as the last four, the next three as the first three, and the next three as an area code. We assume any other digits are a country code (+1 for the United States). We store all of these in variables, and then we use Python's string formatting to append them to a list to store our results:
❶ Compile our regular expression
❷ loop through all the phone numbers
❸ Gather the numbers into variables
❹ append the numbers in the right format
How do we tackle this with map? Similarly, but with map, we have to separate this problem into two parts. Let’s separate it like this:
First up, we’ll tackle formatting the phone numbers. To do that, let’s create a small class with a method that finds the last seven numbers of a string and returns them in our pretty format. That class compiles a regular expression to find all the numbers. We can then use the last seven numbers to print a phone number in the format we desire. If there are more than seven, we’ll ignore the country code. We want to use a class (instead of a function) here because it allows us to compile the regular expression once but use it many times. Over the long run, this saves our computer the effort of repeatedly compiling the regular expression. We’ll create a
prettyFormat method that expects a mis-formatted phone number (a string), and uses the compiled regular expression to find all of the numbers. Then we then take matches at positions -10,-9 and -8 using Python's slice syntax and assign them to a variable named area code. These numbers should be our area code. We take the matches at positions -7, -6, and -5 and assign them to be the first three numbers of the phone number. And we take the last four numbers to be the last four of the phone numbers. Again, any numbers that occur before -10 is ignored. These are country codes. Lastly, we use Python's string formatting to print the numbers in our desired format. That class looks something like listing 2.
❶ Create a class to hold our compiled regular expression
❷ Create an initialization method to compile the regular expression
❸ Create a format method to do the formatting
❹ Gather the numbers from the phone number string
❺ Return the numbers in the desired “pretty” format
Now that we’re able to turn phone numbers of any format into phone numbers in a pretty format, we can combine it with map to apply this to a list of phone numbers of any length. To combine it with
map, we'll instantiate our class and pass the method as the function that map applies to all the elements of a sequence. We can do that like this:
❶ Initialize test data to validate our function
❷ Initialize our class to use its method
❸ map the prettyFormat method across the phone numbers and print the results
You’ll notice at the bottom that we convert our
map results to a
list before we print them. If we were going to use them in our code, we wouldn't need to do this; but because maps are lazy, if we print them without converting them to a
list we'll see a generic
map object as output. This isn't as satisfying as the nicely formatted phone numbers that we expected. Another thing you'll notice about this example is that we were set up perfectly to take advantage of map because we were doing a one-to-one transformation. We transformed each element of a sequence; turning this problem into our middle-school algebra example, applying
n+7 to a list of numbers.
In figure 2 we can see the similarities between the two problems. In each, we’re taking a sequence of data transforming it with some function and getting the outputs. The only difference between the two is the data type (integers versus phone number strings) and the transformation (simple arithmetic versus regular expression pattern matching and pretty printing.) The key with
map is recognizing situations where this pattern can be applied. Once you start looking for this pattern, you'll start to see it everywhere. Let's take a look at another, yet more complex version of this pattern: web scraping.
Scenario In the early 2000s, your company’s arch-rival may have posted some information about their top-secret formula on their blog. All their blog posts can be accessed through a URL of the date the post was made, e.g., https://arch-rival-business.com/blog/01-01-2001 . Design a script that can get the content of every web page posted between January 1, 2001 and December 31, 2010.
At first glance, this may not seem like a scenario where we can use map. After all, map is for data transformation: taking a series of data and converting it into an equal-length series of transformed data. What we want to do is scrape a bunch of web pages. No data transformation is happening…or is it? Let’s think about how we’re going to get the data from our arch-rival’s blog. We’ll be retrieving data from URLs. These URLs, then, can be our input data. And the transformation takes these URLs and turn them into webpage content. Thinking about the problem like this, we can see that its similar our others.
Figure 3 shows the problem posed in the same format as the previous problems we’ve solved with map. On the top, we can see the input data. Instead of phone numbers or integers, we’ll have URLs. On the bottom, again, we have our output data. This is where we’ll eventually have our HTML. In the middle, we have a function that takes each URL and returns HTML.
Retrieving URLs with map
With the problem posed like this, we know we can solve it with map. The question then becomes: how can we get a list of all these URLs? Python has a handy datetime library for solving problems like this. Here, we create a generator function that takes two tuples in (YYYY,MM,DD) format and produces a list of dates between them. We use a generator here instead of a normal loop because this prevents us from storing all these numbers in memory in advance. The word yield here distinguishes this as a generator, instead of a traditional function which uses return. The majority of the work this function does comes from Python’s datetime library’s date class. The datetime date class represents a date and contains knowledge about the Gregorian calendar and some convenience methods for working with dates. You’ll notice that we import the date class directly as the word date. In our function, we instantiate two of these classes: one for our start date and one for our stop date. Then, we let our function generate new dates until we hit our stop date. The last line of our function uses the ordinal date representation, which is the date as the number of days since January 1, year 1. By incrementing this value and turning that value into a date class, we can increase our date by one. Because our date class is calendar aware, it automatically progresses through the weeks, months and years. It even accounts for leap years. Lastly, it’s worth looking at the line our yield statement is on. This is where we output URLs. We take the base URL of the website-http://arch-rival-business.com/blog/-and append the date formatted as a MM-DD-YYYY string to the end as our problem specified. The
strftime method from the date class allows us to use a date formatting language to turn dates into strings formatted any way we want.
❶ Import the datetime library’s date class
❷ Create our generator function
❸ Unpack the date tuples to store them as dates
❹ Loop through all the dates until we’ve reached our stop date
❺ Return the date as a path
❻ Increment the date by one day
Once we’ve got our input data, the next step’s coming up with a function to turn our input data into the output data. Our output data is going to be the web content of the URL. Lucky for us again, Python provides some useful tools for that in its
urllib.request library. Taking advantage of that, a function like this may work for us:
This function takes a URL and returns the HTML found at that URL. We rely on Python’s request library’s
urlopen function to retrieve the data at the URL. This data is returned to us as an
HTTPResponse object, but we can use its read method to return the HTML as a string. It's worth trying this function out in your REPL environment on a URL for a website you visit often (like http://manning.com) to see the function in action. Then, like in previous scenarios, we can apply this function to all the data in our sequence using map like this:
This single line of code takes our
get_url function and applies it to each and every url generated by our
urls_for_days_between function. Passing the two date tuples (2000,1,1) and (2011,1,1) to our
urls_for_days_between function provides a generator of days between January 1, 2000 and January 1, 2011: every day of the first decade of the 21st century. The values returned by this function are stored in the variable
blog_posts. If you run this on your local machine, the program should finish almost instantly. How is that possible? We can't scrape ten years of web pages that quickly, can we? Well, no. But with our generator function and with map, we don't try to.
Map is what we call a lazy function. That means that map doesn’t evaluate when we call it. Instead, when we call map, Python stores the instructions for evaluating the function and it runs these instructions at the exact moment we ask for the value. This is why in attempts to see the values of our map statements previously, we’ve explicitly converted the maps to lists: lists in Python require the actual objects, not the instructions for generating those objects. If we think back to our first elementary example of
n+7 across a list of numbers: [-1,0,1,2]-we used figure 4 to describe map.
It’s a little more accurate to think about
map like in figure 5.
In figure 5, we have the same input values on the top and the same function we’re applying to all of those values, but our outputs have changed. Where before we had 6,7,8 and 9, now we have instructions. If we have the computer evaluate these instructions the results would be 6, 7, 8 and 9. A lot of time in our programs act like these two are equal. As programmers, we’ll need to remember that there’s a slight difference: the default
map in Python doesn't evaluate when called it creates instructions for later evaluation. As a Python programmer, you've probably already seen lazy data floating around. A common place to find lazy objects in Python is the
range function. When moving from Python2 to Python3, the Python folks decided to make
range lazy, allowing Python programmers (you and me) to create massive ranges without doing two things:
These benefits are the same for
map. We like a lazy map because it allows us to transform a lot of data without an unnecessarily large memory or spending the time to generate it. This is exactly what we want to happen. That's all for now. If you want to learn more about the book, check it out on liveBook here and see this slide deck.
About the author:
J.T. Wolohan is a lead data scientist at Booz Allen Hamilton and a PhD researcher at Indiana University, Bloomington, affiliated with the Department of Information and Library Science and the School of Informatics and Computing. His professional work focuses on rapid prototyping and scalable AI. His research focuses on computational analysis of social uses of language online.
Originally published at https://freecontent.manning.com.