ARTICLE

Working with Files in Python

Manning Publications
29 min readMay 5, 2020

From Python Workout by Reuven Lerner

This article delves into working with files of varying types, in Python.

__________________________________________________________________

Take 37% off Python Workout by entering fcclerner into the discount code box at checkout at manning.com.
__________________________________________________________________

Files

Files are an indispensable part of the world of computers in general and of programming in particular. We read data from files, and write to files. We work with networks, which often (like in Unix) use abstractions deliberately designed to look like everything is a file. To normal, everyday users, there are different types of files — Word, Excel, PowerPoint, and PDF, among others. To programmers, things are a bit more complicated; we see files as a bunch of bytes (or characters), from which we can read into strings and into which we can write strings. Although Python is certainly capable of working with files in all sorts of different formats, the assumption in this article is that we’ll be dealing with plain-text files.

Perhaps the files have data formatted in a certain way, such as CSV or JSON — but the bulk of the parsing is done by Python modules and using simple string methods. We certainly won’t be parsing anything as complex as PDF. And to be honest, you probably shouldn’t be doing that either; you should rely on modules that others have written, which take care of such things.

Last line

Ask the user for the name of a text file. Display the final line of that file.

Solution:

filename = input("Enter a filename: ")
previous_line = ''
for current_line in open(filename): ❶
previous_line = current_line
print(previous_line, end='') ❷

❶ Iterate over each line of the file. You don’t need to declare a variable; iterate directly over the result of open

❷ The end parameter to the print function lets you specify what should be printed after the call to print. By default, it’s a newline character.

Discussion

The above code uses a number of common Python paradigms and expressions. Although you could certainly implement this solution in less code, this version tries to reduce the amount of memory used, and we can apply this solution to long files.

When you invoke open, you’re creating an object, one which is referred to in many other languages as a “file handle,” a sort of agent or mediator between our program and the outside world. We tend to think of these as “file” objects, but Python 3 created an entire hierarchy of objects that follow a common set of protocols. Although we can talk loosely about “file” objects, the fact is that open usually returns an _io.TextIOWrapper object.

(No, it’s not a name that rolls easily off of the tongue.) Here’s how open is usually invoked:

f = open(filename)

If you’re planning to iterate over a file, then there’s not necessarily any reason to set a variable. In the case of this exercise, I think that it’s preferable not to create a new variable, but rather to iterate over our anonymous file object, the result of open(filename).

It’s typical in Python to iterate with a for loop over a file object, as in:

for line in open(filename):     
print(len(line)) ①

①len returns the number of characters in a string. This returns the length of the line. Remember that the line includes the trailing \n character.

You don’t have to iterate over the lines of a file. Indeed, if you’re working with binary files, then the concept of a “line” in the file is nonsensical.

In this particular exercise, you were asked to print the final line of a file. One way to do this might be:

for line in open(filename):    
pass ❶

print(line)

pass means, “Don’t do anything.” It exists because Python requires that we have at least one line in the body of a for loop.

The above trick works because we iterate over the lines of the file and assign line in each iteration—-but we don’t do anything in the body of the for loop. Rather, we use pass, which is a way of telling Python to do nothing. The reason we execute this loop is for its side effect—-namely, that the final value assigned to line remains in place after the loop exits.

Looping over the rows of a file to get the final one strikes me as a bit strange, even if it works. My preferred solution, as outlined above, is to iterate over each line of the file, getting the current line but immediately assigning it to previous_line.

When we exit from the loop, previous_line contains whatever was in the most recent line; we can print it out.

Is there a real difference between these two programs? Yes, but it’s largely a semantic difference: In this version, we explicitly assign to a variable, rather than relying on Python’s variable-assignment techniques and scoping.

Beyond the exercise

Iterating over files, and understanding how to work with the content as (and after) you iterate over them, is an important skill to have when working with Python. It’s also important to understand how to turn the contents of a file into a Python data structure, something we’ll look at several more times in this article. Here are a few ideas for things you can do when iterating through files in this way:

  1. Iterate over the lines of a text file. Find all of the words (i.e., non-whitespace surrounded by whitespace) that contain only integers, and sum them.
  2. Create a text file (using an editor, not necessarily Python) containing two tab-separated columns, with each column containing a number. Then read through the file you created with Python. For each line, multiply each number by the second, and then sum the results. Ignore any line that doesn’t contain two numeric columns.
  3. Read through a text file, line by line. Use a dict to keep track of how many times each vowel (a, e, i, o, and u) appears in the file. Print the resulting tabulation.

/etc/passwd to dict

This exercise assumes that you have access to a copy of /etc/passwd. If you don’t have access to such a file, you can download one that I’ve uploaded, at https://gist.github.com/ reuven/7647d1af56cc8c6a9744 .

The format of this file, traditionally used by Unix systems to keep track of users, passwords, and login shells, is as follows:

nobody:*:-2:-2::0:0:Unprivileged User:/var/empty:/usr/bin/false
root:*:0:0::0:0:System Administrator:/var/root:/bin/sh
daemon:*:1:1::0:0:System Services:/var/root:/usr/bin/false

Each line is a set of fields, separated by colon (:) characters. The first field is the username, and the third field is the ID of the user. On my system, the nobody user has ID -2, the root user has ID 0, and the daemon user has ID 1. You can ignore all but the first and third fields in the file.

This format has one exception: a line that begins with a # character is a comment, and should be ignored by the parser. Similarly, you should ignore any empty (blank) lines in the file, containing nothing but whitespace characters.

For this exercise, you must create a dictionary based on /etc/passwd, in which the dict’s keys are usernames and the values are the numeric IDs of those users. You should then iterate through this dict, displaying one username and user ID on each line in alphabetical order.

Solution

users = {}
with open('/etc/passwd') as f:
for line in f:
if not line.startswith("#") and line.strip(): ❶
user_info = line.split(":") ❷
users[user_info[0]] = user_info[2]

for username, user_id in sorted(users.items()): ❸
print(f"{username}: {user_id}")

❶ Ignore comment and blank lines

❷Turn the line into a list of strings

❸ Display the users and IDs, sorted by username

Discussion

Once again, we’re opening a text file and iterating over its lines, one at a time. Here, we assume that we know the file’s format, and that we can extract fields from within each record.

In this particular case, we’re splitting each line into fields across the “:” character, using the str.split method. str.split always returns a list, although the length of that list depends on the number of times that : occurs in the string. In the case of /etc/passwd, we can assume that any line containing : is a legitimate user record, and it has all of the necessary fields.

The file might contain comment lines beginning with . If we were to invoke str.split (https://docs.python.org/3/library/stdtypes.html?highlight=str%20split# str.split) on those lines, we’d get back a list, but one containing only a single element—-leading to an IndexError exception if we tried to retrieve user_info[2]. It’s important that we ignore those lines that begin with . Fortunately, there’s a str.startswith (https://docs.python.org/3/library/stdtypes.html? highlight=str%20split#str.startswith) method, which returns True if the line starts with the string passed as an argument. By negating this (if not line.startswith("#")), we can be sure that we’re only splitting line for legitimate lines.

I also ignore any line containing only whitespace. I use str.strip to remove all whitespace from the start and end of the string—and if we get an empty string back, then we know that the line was blank, and ignore it.

Assuming that it has found a user record, our program then adds a new key-value pair to users. The key is user_info[0], and the value is user_info[2]. Notice how we can use user_info[0] as the name of a key; as long as the value of that variable contains a string, we may use it as a dictionary key.

I use with (https://docs.python.org/3/reference/compound_stmts.html#with) here to open the file, ensuring that it’s closed when the block ends. I often don’t use with when I’m reading from files, because I don’t worry much about the resources consumed by a single open file, when it comes to reading. When I’m writing to files, by contrast, I do worry and always use with.

You might have noticed a slight stylistic issue with the above program: I used a for loop in order to iterate over the file and extract fields. I could have used a dictionary comprehension, which would have compacted my code greatly by using the following:

users =  { line.split(':')[0] : line.split(':')[2]           
for line in open('/etc/passwd')
if not line.startswith('#') and line.strip() }

I love dictionary comprehensions, but recognize that it’s sometimes hard to put all of the functionality inside of them. Furthermore, their syntax can be a bit off-putting to less experienced Python developers. This dict comprehension takes advantage of the fact that we can iterate over the lines of /etc/passwd. Each line can then be checked whether it begins with # and contains non-whitespace; if it passes these tests, then we split the line across :, and return the dictionary with name-ID pairs.

Once we finish creating our dictionary, we iterate over it and print each key-value pair.

We could, in theory, iterate over the dict and get the keys. But I generally prefer to invoke dict.items, and get a tuple containing a key-value pair with each iteration. Then I can capture the keys and values in variables (via unpacking), giving me clearer names with which to work. I can also pass the output from dict.items to sorted, which returns the pairs sorted by key.

Beyond the exercise

At a certain point in your Python career, you’ll stop seeing files as sequences of characters on a disk, and start seeing them as raw material you can transform into Python data structures. I’m a particular fan of turning files into dictionaries, in no small part because many files lend themselves to this sort of translation — but you can also use more complex data structures. Here are some additional exercises you can try to help you see that connection, and make the transformation in your code:

  1. Read through /etc/passwd, creating a dict in which user login shells (the final field on each line) are the keys. Each value is a list of the users for whom that shell is defined as their login shell.
  2. Ask the user to enter integers, separated by spaces. From this input, create a dictionary whose keys are the factors for each number, and the values are lists containing those of the users’ integers which are multiples of those factors.

From /etc/passwd, create a dict in which the keys are the usernames (as in the main exercise), and the values are dicts with keys (and appropriate values) for user ID, home directory, and shell.
with and context managers

It’s common to use with when opening files in Python. As we’ve seen, you can say:

with open('myfile.txt', 'w') as f:     
f.write('abc\n')
f.write('def\n')

Most people believe, correctly, that using with ensures that the file f is flushed and closed at the end of the block.

Because with is overwhelmingly used with files for this reason, many developers believe that there’s some inherent connection between with and files, but the truth is that with is a much more general construct, one which is known as a “context manager” in Python.

The basic idea is as follows:

You use with, along with an object and a variable to which you want to assign the object,

The object should know how to behave inside of the context manager,

When the block starts, with turns to the object. If an enter method is defined on the object, then it runs. In the case of files, the method is defined, but does nothing.

When the block ends, with once again turns to the object, executing its exit method. This method gives the object a chance to change or restore whatever state it was using.

It’s pretty obvious, then, how with works with files: Perhaps the enter method isn’t important and doesn’t do much, but the exit method is important, and does a lot—-specifically, in flushing and closing the file.

If you pass two or more objects to with, then the enter and exit methods are invoked on each of them, in turn.

Other objects can and do adhere to this protocol. Indeed, if you want, you can write your own classes to dictate how to behave inside of a with statement. (Details of how to do this are in the “resources” table at the end of the article.)

Are context managers only used in the case of files? No, but this is the most common case, by far. Two other cases I can think of offhand are (a) when processing database transactions and (b) when locking certain sections in multi-threaded code. In both cases, the idea is that you want to have a section of code which is executed within a certain context — -and Python’s context management, via with, comes to the rescue.

Word count

Unix systems contain many utility functions. One of the most useful to me (in my writing) is wc (http://www.computerhope.com/unix/uwc.htm), the “word count” program. If you run wc against a text file, it’ll count the characters, words, and lines that the file contains.

The challenge for this exercise is to write a version of wc in Python. Your version of wc returns four different types of information about the files:

  1. Number of characters (including whitespace)
  2. Number of words (separated by whitespace)
  3. Number of lines
  4. Number of unique words (case sensitive — “NO” is different from “no”)

The program should ask the user for the name of an input file, and then produce output for that file.

I placed a test file (wcfile.txt) at https://gist.github.com/reuven/5660728. You may download and use that file in order to test your implementation of wc. Any file will do, but if you use this one, your results will match up with mine.

Solution

filename = input("Enter a filename: ")

counts = {'characters':0,
'words':0,
'lines':0}
unique_words = set() ❶

for one_line in open(filename):
counts['lines'] += 1
counts['characters'] += len(line)
counts['words'] += len(line.split())

unique_words.update(line.split()) ❷

counts['unique words'] = len(unique_words) ❸
for key, value in counts.items():
print(f"{key}: {value}")

❶ You can create sets with curly braces, but not if they’re empty! Use set() to create a new, empty set.

set.update adds all of the elements of an iterable to a set.

❸ Stick the set’s length into counts for a combined report

Discussion

This program demonstrates a number of aspects of Python that many programmers use on a daily basis. First and foremost: Many people who are new to Python believe that by measuring four aspects of a file, they should read through the file four times. That might mean opening the file once and reading it four times, using the seek method to move the file pointer to the start. Or they might even open the file four separate times. But it’s more standard in Python to write this sort of program to loop over the file once, iterating over each line and accumulating whatever data it can find from that line.

How do accumulate this data? We can use separate variables, and there’s nothing wrong with that, but I prefer to use a dictionary because the counts are closely related, and because it also reduces the code I need to produce a report.

Once we’re iterating over the lines of the file, how can we count the various elements?

Counting lines is the easiest part: Each iteration goes over one line, and we can add 1 to counts['lines'] at the top of the loop. Next, we want to count the number of characters in the file. Because we’re already iterating over the file, there’s not that much work to do: We get the number of characters in the current line with calculating len(one_line), and then adding that to counts['characters'].

Many people are surprised that this includes whitespace characters, such as space and tab, as well as newline. Yes, even an “empty” line contains a single newline character. But if we didn’t have newline characters, it wouldn’t be obvious to the computer when it should start a new line. Such characters are necessary, and take up some space. Next, we want to count the number of words. In order to get this count, we turn line into a list of words, invoking one_line.split. I invoke split without any arguments, which causes it to use all whitespace – spaces, tabs, and newlines – as delimiters. The result is put into counts['words'].

The final item to count is unique words. We could, in theory, use a list to store new words. But it’s much easier to let Python do the hard work for us, using a set to guarantee the uniqueness. We create the unique_words set at the start of the program, and then use unique_words.update (https://docs.python.org/3/library/stdtypes.html?# frozenset.update) to add all of the words in the current line into the set. In order for the report to work on our dictionary, we add a new key-value pair to counts, using len(unique_words) in order to count the number of words in the set.

Beyond the exercise

Creating reports based on files is a common use for Python, and using dictionaries to accumulate information from those files is also common. Here are some additional things you can try to do, similar to and beyond this exercise:

  1. Ask the user to enter (on one line, separated by spaces) specific words whose frequencies should be counted in a file. Then read through the file, counting how many times each word appears. Store the counts in a dictionary, in which the user-entered words are the keys, and the counts are the values.
  2. Create a dictionary in which the keys are the names of files on your system, and the values are the sizes of those files. To calculate the size, you can use os.stat (https://docs.python.org/3/library/os.html?highlight=os%20stat#os.stat).
  3. Given a directory, read through each file and count the frequency of each letter. (Force letters to be lowercase, and ignore non-letter characters.) Use a dictionary to keep track of the letter frequencies. What are the five most common letters?

Longest word

Ask the user for the name of a directory. For each regular file in that directory, find the longest word. Return a dictionary of filenames and longest words.

Solution

import os
dirname = input("Enter a directory name: ")

def longest_word(filename):
longest_word = ''
for one_line in open(filename):
for one_word in one_line.split():
if len(one_word) > longest_word:
longest_word = one_word
return longest_word

print({filename : os.path.join(dirname, filename) ❶
for filename in os.listdir(dirname) ❷
if os.path.isfile(os.path.join(dirname, filename)) }) ❸

❶ Get the filename and its full path

❷ Iterate over all of the files in dirname

❸ We’re only interested in files, not directories or special files

Discussion

In this case, we’re asked to take a list of filenames, and turn them into a dictionary of filenames and strings. The strings, which are the dictionary’s values, should contain the longest word from the filename in question.

Whenever you hear that you need to transform a collection of inputs into a collection of outputs, you should immediately think about comprehensions — -most commonly, list comprehensions, but set comprehensions and dictionary comprehensions are also quite useful. In this case, we’ll use a dictionary comprehension. The keys in the resulting dict are the filenames. Each value is a string, the longest word that appeared in the file.

How can we get the longest word in each file? By using a combination of common Python techniques: Open the file, read its contents, split the contents into a list of strings, sort the resulting list, and taking the longest word. But wait a second: This means reading the entire file into a string, and then turning that potentially long string into an even longer list. We then sort the list, using even more memory to get a single word. This works well for short files, but it’s going to use a lot of memory for even medium-sized files.

My solution is to iterate over every line of a file, and then over every word in the line. If we find a word which is longer than the current longest_word, then we replace the old word with the new one.

Also note the use of os.path.join (https://docs.python.org/3/library/os.path.html? highlight=os%20path%20join#os.path.join) to combine directory and filenames. This is shorter, cleaner, and more platform independent than combining strings together.

I should add that the max function takes a key argument, like sorted. We could shorten the implementation of longest_word in this way. I’ve found that many people don’t know about this functionality—-and for them, using key is less clear. If you’re curious, though, the implementation looks like this:

import os
dirname = input("Enter a directory name: ")

def longest_word(filename):
longest_word = ''
for one_line in open(filename):
longest_word = max(one_line.split(), key=len)
return longest_word

print({filename : os.path.join(dirname, filename) for filename
in os.listdir(dirname)
if os.path.isfile(os.path.join(dirname, filename)) })

Beyond the exercise

Producing reports about files and file contents using dicts and other basic data structures is commonly done in Python. Here are a few possible exercises to practice these ideas further:

  1. Use the hashlib module, and the md5 function within it, to calculate the MD5 hash for the contents of every file in a user-specified directory. Then print all of the filenames and the MD5 hashes, each one in a row.
  2. Ask the user for a directory name. Show all of the files in the directory, as well as how long ago the directory was modified. You’ll probably want to use a combination of os.stat and the “Arrow” package on PyPI to do this easily.
  3. Open an HTTP server’s logfile. (If you lack one, then you can read one from me, at https://gist.github.com/reuven/5875971 .) Summarize how many requests resulted in numeric response code — 202, 304, etc.

Directory listings

For a language that claims “there’s one way to do it,” Python has too many ways to list files in a directory. The two most common are os.listdir and glob.glob, both of which I mentioned in this article. A third way is to use pathlib, a relatively new addition to Python which provides us with an object-oriented API to the filesystem.

The easiest and most standard of these is os.listdir, a function in the os module. It returns a list of strings, the names of files in the directory. For example:

filenames = os.listdir('/etc/')

The good news is that it’s easy to understand and work with os.listdir. The bad news is that it returns a list of filenames without the directory name, which means that in order to open or work with the files, you’ll need to add the directory name at the beginning—-ideally with os.path.join, which works cross-platform.

The other problem with os.listdir is that you can’t filter the filenames by a pattern. You get everything, including subdirectories and hidden files. If you want all of the .txt files in a directory, os.listdir won’t be enough.

This is where the glob module comes in: It lets you use patterns, sometimes known as “globbing,” to describe the files that you want. Moreover, it returns a list of strings—with each string containing the complete path to the file. For example, I can get the full paths of the configuration files in /etc/ on my computer with

filenames = glob.glob('/etc/*.conf')

I don’t need to worry about other files or subdirectories in this case, which makes it much easier to work with. For a long time, glob.glob was my go-to function for finding files.

But recently, I’ve discovered pathlib, a module that comes with the Python standard library and which makes things easier in many ways. You start by creating a pathlib.Path object, which represents a file or directory:

import pathlib
p = pathlib.Path('/etc/')

Once you have this Path object, you can do lots of things with it that previously required separate functions—-including the ones I’ve described above. For example, we can get an iterator that returns files in the directory with iterdir:

for one_filename in p.iterdir():
print(one_filename)

In each iteration, we don’t get a string, but rather a Path object (or more specifically, on my Mac I get a PosixPath object). This allows us to do alot more than print the filename; we can open it and inspect it with the Path object.

If you want to get a list of files matching a pattern, as I did with glob.glob, then you can use the glob method:

for one_filename in p.glob('*.conf'):
print(one_filename)

pathlib is a great addition to recent Python versions. If you have a chance to use it, you should; I’ve found that it clarifies and shortens quite a bit of my code.

Reading and writing CSV

In a CSV (“comma-separated values”) file, each record is stored on one line, and fields are separated by commas. CSV is commonly used for exchanging information, e.g. in the world of data science. For example, a CSV file might contain information about different vegetables:

lettuce,green,soft
carrot,orange,hard
pepper,green,hard
eggplant,purple,soft

Each line in this CSV file contains three fields, separated by commas. No headers describe the fields, although many CSV files have headers.

Sometimes, the comma is replaced by another character to avoid potential ambiguity; my personal favorite is to use a TAB character ('\t' in Python strings).

Python comes with a csv (https://docs.python.org/3/library/csv.html?module-csv) module that handles many of the tasks associated with writing to and reading from CSV files. For example, you can write to a CSV file with the following code:

import csv

with open('/tmp/stuff.csv', 'w') as f:
o = csv.writer(f) ❶
o.writerow(range(5)) ❷
o.writerow(['a', 'b', 'c', 'd', 'e'])

❶ Create a csv.writer object, wrapping our file f

❷ Write a full line of the CSV file, separated by commas

Not all CSV files necessarily look like CSV files. For example, the standard Unix /etc/passwd file, which contains information about users on a system (but no longer users’ passwords, despite its name), separates fields with : characters.

For this exercise, create a program that creates a subset of /etc/passwd in a new file. You should read from /etc/passwd and produce a new file whose contents are the username (index 0) and the user ID (index 2). Note that a record may contain a comment, in which it won’t have anything at index 2; you should take this into consideration when writing the file. The output file should use TAB characters to separate the elements.

The input looks like:

root:*:0:0::0:0:System Administrator:/var/root:/bin/sh
daemon:*:1:1::0:0:System Services:/var/root:/usr/bin/false
_ftp:*:98:-2::0:0:FTP Daemon:/var/empty:/usr/bin/false

and the output looks like:

root    0
daemon 1
_ftp 98

Notice that the comment line in the input file isn’t placed in the output file.

Solution

with open('/etc/passwd') as passwd, open('/tmp/output.csv', 'w') as output:    
r = reader(passwd, delimiter=':') ❶
w = writer(output, delimiter='\t') ❷
for record in r:
if len(record) > 1:
w.writerow((record[0], record[2]))

❶ Fields in the input file are separated by colons (:)

❷ Fields in the output file are separated by tabs (\t)

Discussion

The above program uses a number of aspects of Python which are useful when working with files.

We’ve already seen and discussed with earlier in this article. Here, you can see how with can be used to open two separate files, or generally to define any number of objects whose scope is limited to the with block. As soon as our block exits, both of the files are automatically closed.

We define two variables in the with statement, for the two files with which we’ll work: The passwd file is opened for reading from /etc/passwd. The output file is opened for writing, and writes to /tmp/output.csv. Our program acts as a go-between, translating from the input file and placing a reformatted subset into the output file.

We do this by creating one instance of csv.reader, which wraps passwd. Because /etc/passwd uses colons (:) to delimit fields, we must tell this to csv.reader. Otherwise, it tries to use commas, which is almost certainly not going to work well. Similarly, we define an instance of csv.writer, wrapping our output file, and indicating that we want to use TAB as the delimiter. Now that our objects are in place for reading and writing CSV data, we can run through the input file, writing a row (line) to the output file for each of those inputs. We take the username (from index 0) and the user ID (from index 2), create a tuple, and pass that tuple to csv.writerow. Because our csv.writer objects knows how to write to a file, and knows what delimiter (TAB) we want to have between the elements, it takes care of these automatically.

Perhaps the trickiest thing here’s to ensure that we don’t try to transform lines which contain comments — — those which begin with a hash (#) character. You can do this a number of ways, but the method I employed is to check the number of fields we got for the current input line. If there’s only one field, then it must be a comment line, or perhaps another type of malformed line. In such a case, we ignore the line altogether.

Beyond the exercise

CSV files are extremely useful and common, and the csv module that comes with Python works with them well. If you need something more advanced for working with files, then you might want to look into pandas (http://pandas.pydata.org/), which reads and writes CSV files, as well as much more. Here are some more things you can do to practice working with CSV files:

  1. Extend this exercise by asking the user to enter a space-separated list of integers, indicating which fields should be written to the output CSV file. Also ask the user which character should be used as a delimiter. Then read from /etc/passwd, writing the user’s chosen fields, separated by the user’s chosen delimiter.
  2. Turn a dictionary into a CSV file. The fields in the CSV file should be: (1) the key, (2) the value, (3) the type of the value (e.g., str or int).
  3. Create a CSV file (using the csv module), in which each line contains ten random integers between 10 and 100. Now read the file back (again, using the csv module), and print the sum and mean of the numbers on each line.

JSON

JSON (“JavaScript Object Notation,” described at http://json.org/) is a popular format for data exchange. JSON-encoded data can be read into a large number of programming languages, including Python, using the json module (https://docs.python.org/3/library/ json.html?#module-json). We can read a JSON file with json.load; the result is a combination of Python dicts, lists, ints, and strings.

In this exercise, you’re analyzing test data in a high school. A scores directory on the filesystem contains a number of files in JSON format. Each file represents the scores for one class. If we’re trying to analyze the scores from class 9a, the scores are in a file called 9a.json that looks like this:

[{"math" : 90, "literature" : 98, "science" : 97},
{"math" : 65, "literature" : 79, "science" : 85},
{"math" : 78, "literature" : 83, "science" : 75},
{"math" : 92, "literature" : 78, "science" : 85},
{"math" : 100, "literature" : 80, "science" : 90}
]

The directory may also contain files for tenth grade (10a.json, 10b.json, and 10c.json), and other grades and classes in the high school.

Note that valid JSON uses double quotes ("); no single quotes ('). This can be surprising and frustrating for Python developers to discover! Also notice that the file contains the JSON equivalent of a list of dicts.

For this exercise, you must summarize, for each class, the highest, lowest, and average test scores for each subject, in each class. Given two files (9a.json and 9b.json) in the scores directory, we see the following output:

scores/9a.json    
science: min 75, max 97, average 86.4
literature: min 78, max 98, average 83.6
math: min 65, max 100, average 85.0
scores/9b.json
science: min 35, max 95, average 82.0
literature: min 38, max 98, average 72.0
math: min 38, max 100, average 77.0

Solution

import json
import glob

scores = { }

for filename in glob.glob("scores/*.json"):
scores[filename] = { }

f = open(filename)
for result in json.load(f): ❶
for subject, score in result.items():
scores[filename].setdefault(subject, []) ❷
scores[filename][subject].append(score)

for one_class in scores: ❸
print(one_class)
for subject, subject_scores in scores[one_class].items():
min_score = min(subject_scores)
max_score = max(subject_scores)
average_score = float(sum(subject_scores)) / len(subject_scores)

print("\t{subject}: min {min_score}, max {max_score}, average
{average_score}")

❶ Read from the file f, and turn it from JSON into Python objects

❷ Make sure that subject exists as a key in scores[filename]

❸ Summarize the scores

Discussion

In many languages, the first response to this kind of problem is: Let’s create our own class! But in Python, although we can (and often do) create our own classes, it’s often easier and faster to make use of built-in data structures — lists, tuples, and dicts.

In this particular case, we’re reading from a JSON file. JSON is a data representation, much like XML; it isn’t a data type, per se. If we want to create JSON, we must use the json module to turn our Python data into JSON-formatted strings. And if we want to read from a JSON file, we must read the contents of the file, as strings, into our program, and then turn it into Python data structures.

There may be more than one file. We know that the files are all located under the scores subdirectory, and that they’ve a .json suffix. We could use os.listdir on the directory, filtering (perhaps with a list comprehension) through all of those filenames ending with .json. We can also use the glob (https://docs.python.org/3/library/glob.html#glob.glob) module, and more specifically the glob.glob function, which takes a Unix-style filename pattern with (among others) and ? characters, and returns a list of those filenames matching the pattern. By invoking glob.glob('scores/.json'), we get all of the files ending in .json within the scores directory. We can then iterate over that list, assigning the current filename (a string) to filename.

Next, we create a new entry in our scores dictionary, which is where we store the scores. This is a dictionary of dictionaries, in which the first level is the name of the file—and also the class—from which we read the data. The second level keys are the subjects; its values are a list of scores, from which we can then calculate the statistics we need. Once we define filename, we immediately add the filename as a key to scores, with a new, empty dictionary as the value.

Sometimes, you need to read each line of a file into Python, and then invoke json.loads to turn that line into data. In our case the file contains a single JSON array. We must use json.load to read from the file object f, which turns the contents of the file into a Python list of dictionaries. Because json.parse returns a list of dicts, we can iterate over it. Each test result is placed in the result variable, which is a dictionary, in which the keys are the subjects and the values are the scores. Our goal is to reveal some statistics for each of the subjects in the class, which means that although the input file reports scores on a per-student basis, our report ignores the students in favor of the subjects.

Given that result is a dict, we can iterate over its key-value pairs with result.items(), using parallel assignment to iterate over the key and value (here called subject and score). Now, we don’t know in advance what subjects are in our file, nor do we know how many tests might happen. It’s easiest for us to store our scores in a list. This means that our scores dict has one top-level key for each filename, and a one second-level key for each subject. The second-level value is a list, to which we append with each iteration through the JSON-parsed list.

We want to add our score to the list, like so:

scores[filename][subject]

Before we can do that, we need to make sure that the list exists. One easy way to do this is with dict.setdefault, which assigns a key-value pair to a dictionary, but only if the key doesn’t already exist. d.setdefault(k, v) is the same as saying:

if k not in d:    
d[k] = v

We use dict.setdefault (https://docs.python.org/3/library/stdtypes.html?# dict.setdefault) to create the list if it doesn’t yet exist, and then in the next line, we add the score to the list for this subject, in this class.

When we complete our initial for loop, we have all of the scores for each class. We can now iterate over each class, printing the name of the class.

Then, we iterate over each subject for the class. We once again use the method dict.items to return a key-value pair—-in this case, calling them subject (for the name of the class) and subject_scores (for the list of scores for that subject).

Now, we use an f-string to produce some output, using the built-in min

(https://docs.python.org/3/library/functions.html?#min) and max (https://docs.python.org/3/ library/functions.html?#max) functions, and then combining sum (https://docs.python.org/ 3/library/functions.html?#sum) and len to get the average score.

Although this program read from a file containing JSON and then produced output on the user’s screen, it could as easily have read from a network connection containing JSON, and/or written to a file or socket in JSON format. As long as we use built-in and standard Python data structures, the json module can take our data and turn it into JSON.

Beyond the exercise

Here are some more tasks you can try that use JSON:

  1. Convert /etc/passwd from a CSV-style file into a JSON-formatted file. The JSON contains the equivalent of a list of Python tuples, with each tuple representing one line from the file.
  2. For a slightly different challenge, turn each line in the file into a Python dict. This requires identifying each field with a unique column or key name; if you’re not sure what each field in /etc/passwd does, then you can give it an arbitrary name.
  3. Ask the user for the name of a directory. Iterate through each file in that directory, (ignoring subdirectories), getting (via os.stat) the size of the file and when they were last modified. Create a JSON-formatted file on disk listing each filename, size, and modification timestamp. Then read the file back in, and identify which files were modified most and least recently in that directory, and which files are largest and smallest in that directory.

Reverse lines

Given a text file, create a new file whose contents are identical, except that the characters on each line have been reversed. If a file looks like:

abc def
ghi jkl

then the output file is:

fed cba
lkj ihg

Notice that the newline remains at the end of the string, but the rest of the characters are all reversed.

Solution

with open(infilename) as infile, open(outfilename, 'w') as outfile:    
for one_line in infile:
outfile.write(f"{one_line.rstrip()[::-1]}\n") ❶

str.rstrip removes all whitespace from the right side of a string

Discussion

This solution depends not only on the fact that we can iterate over a file one line at a time, but also that we can work with more than one object in a with statement. Remember that with takes one or more objects, and allows us to assign variables to those objects. I particularly like the fact that when I want to read from one file and write to another, I can use with to open one for reading, open a second for writing, and then do what I’ve shown here.

I read through each line of the input file, then reverse the line using Python’s slice syntax — -remember that s[::-1] means that we want all of the elements of s, from the start to the end, but with a step size of -1, which returns a reversed version of the string. But before we can reverse the string, we want to remove the newline character which is the final character in the string. We run str.rstrip() on the current line, and then we reverse it. We then write it to the output file, adding a newline character to descend by one line.

The use of with guarantees that both files are closed when the block ends. When we close a file that we opened for writing, it’s automatically flushed, which means that we don’t need to worry about whether the data has been saved to disk.

I should note that people often ask me how to read from and write to the same file. Python supports that, with the r+ mode, but I find that this opens the door to many potential problems, because of the chance that you’ll overwrite the wrong character, and mess up the format of the file you’re editing. I suggest that people use this sort of read-from-one, write-to-the-other code, which has roughly the same effect, without the potential danger of messing up the input file.

Beyond the exercise

Here are some more exercise ideas for translating files from one format to another using with and this kind of technique:

  1. “Encrypt” a text file by turning all of its characters into their numeric equivalents (with the built-in ord function), and writing that file to disk. Now “decrypt” the file (using the built-in chr function), turning the numbers back into their original characters.
  2. Given an existing text file, create two new text files. The new files each contain the same number of lines as the input file. In one, you’ll write all of the vowels (a, e, i, o, and u) from the input file. In the other output file, you’ll write all of the consonants. (You can ignore punctuation and whitespace.)
  3. The final field in /etc/passwd is the “shell,” the Unix command interpreter which is invoked when a user logs in. Create a file containing one line per shell, in which the shell’s name is written, followed by all of the usernames that use the shell. For example:
/bin/bash:root, jci, user, reuven, atara
/bin/sh:spamd, gitlab

Conclusion

It’s almost impossible to imagine writing programs without using files. And although there are many different types of files, Python is well-suited for working with text files — such as logfiles and configuration files, as well those formatted in such standard ways as JSON and CSV.

It’s important to remember a few things when working with files:

  1. You typically open files for either reading or writing
  2. You can (and should) iterate over files one line at a time, rather than reading the whole thing into memory at once
  3. Using with when opening a file for writing ensures that the file is flushed and closed
  4. The csv module makes it easy to read from and write to CSV files
  5. The json module’s dump and load functions allow us to move between Python data structures and JSON-formatted strings
  6. Reading from files into built-in Python data types is a common and powerful technique

That’s all for this article.

If you want to learn more about the book, check it out on our browser-based liveBook reader here.

--

--

Manning Publications

Follow Manning Publications on Medium for free content and exclusive discounts.