ARTICLE
Basic Text Processing in Functional Style
From Haskell in Depth by Vitaly Bragilevsky
___________________________________________________________________
Take 37% off Haskell in Depth. Just enter fccbragilevsky into the discount code box at checkout at manning.com.
___________________________________________________________________
This article explores text processing in the functional programming style.
Suppose you want to write a program that analyzes the vocabulary of a given text. There exist many sophisticated methods for such an analysis, but you want to be quick. The task is quite basic. Namely, you should:
- Extract all words from the given text file and print them.
- Count the size of the vocabulary.
- Find the most frequently used words.
Such a program could be part of a larger social media text analyzer, which could be used to mine various pieces of information (ranging from level of education or social position to the risk of financial default) by analyzing the texts people post on their social media pages.
Or, more likely, you’ve gotten up in the middle of the night with a desire to explore the size of Shakespeare’s vocabulary. How many different words did Shakespeare use in Hamlet?
First attempt: Exploring the problem in a REPL with Strings
Let’s start by exploring Shakespeare’s vocabulary. Suppose you have a file, texts/hamlet.txt, with the full text of Hamlet. How many different words are there? You can use the GHCi REPL (read-eval-print loop) to find out without writing a program. You only need to import two modules for processing lists and characters, read the file into the String
, and then manipulate its content:
ghci> :m + Data.List Data.Char
ghci> text <- readFile "texts/hamlet.txt"
ghci> ws = map head $ group $ sort $ map (map toLower) $ words text
ghci> take 7 ws
["&","'em?","'gainst","'tane","'tis","'tis,","'twas"]
The idea is to split the file’s content into a list of words, make all of them lowercase, sort them, and then remove repetitions by grouping the equal words and taking the first word in each group. The initial result shows that we’ve forgotten about leading (and trailing) punctuation, and we need to do some cleanup:
ghci> ws1 = map (takeWhile isLetter . dropWhile (not . isLetter)) $ words text
ghci> ws = map head $ group $ sort $ filter (not . null) $ map (map toLower) ws1
ghci> take 7 ws
["a","abhominably","abhorred","abilitie","aboord","aboue","about"] ghci> length ws
4633
This is better, although looking at the other words in the resulting list, ws
, reveals that this isn’t a correct way to clean words: for example, it’s removed all hyphenated words (due to takeWhile isLetter
).
NOTE
I’m using the ghci>
prefix for the code executed in the GHCi REPL. You can use the same by executing the following command:
$ ghci
GHCi, version 8.2.2: http://www.haskell.org/ghc/ :? for help
Prelude> :set prompt "ghci> "
ghci>
Alternatively, you can tweak your .ghci file to make this change permanent. The GHC User Guide gives you more details about the location of .ghci file: you can write any GHCi command there; for example, defining values, importing modules, or executing some Haskell code.
Moving away from REPL and Strings
The REPL approach to solving problems quickly gets clumsy; let’s try to write a complete program to solve the same problem — extracting a vocabulary (a list of used words) from the given text file and computing the size of the vocabulary.
In this attempt we’ll also switch to the Data.Text
datatype, which is much more suitable for handling textual data in terms of both performance and convenience. The resulting program is presented in the following listing.
Listing 1. Extracting a vocabulary: the second attempt (vocab1.hs)
import Data.Char
import Data.List
import qualified Data.Text as T ①
import qualified Data.Text.IO as TIO ①
import System.Environment ②
main = do
[fname] <- getArgs ③
text <- TIO.readFile fname ④
let ws = map head $ group $ sort $ map T.toCaseFold $ filter (not . T.null)
$ map (T.dropAround $ not . isLetter) $ T.words text ⑤
TIO.putStrLn $ T.unwords ws ⑥
print $ length ws
① Import the modules for working with Text
② Import the module for reading command-line arguments
③ Read the command-line arguments into a list of `String`s
④ Read the file content into the Text
value
⑤ Process text
into a list of words
⑥ Print all the words, delimited by spaces
Now you can read the filename from the command line without worrying too much about incorrect user input. This is done in the first line of the main
function. The modules Data.Text
and Data.Text.IO
are usually imported with qualifiers to avoid name clashes with Prelude
(the module which is imported by default). They come with the text
package, which should be installed prior to compiling this program (although it may be installed already, depending on your GHC distribution).
The module Data.Text
provides many functions analogous to the list functions from Prelude
and Data.List
, but it also adds specific text processing functions that were used in listing 1. For example:
toCaseFold :: Text -> Text
dropAround :: (Char -> Bool) -> Text -> Text
The function toCaseFold
converts the whole Text
value to folded case, and does it significantly faster (and in a more correct way) than mapping with toLower
over every character. The function dropAround
removes leading and trailing characters that satisfy the given predicate (the function of type Char → Bool
).
TIP
You can use the website https://hoogle.haskell.org to find functions by name or, even more usefully, by type annotation: for example, searching for “(Char → Bool) → Text → Text” gives you the dropAround
function I’ve described, together with other functions for cleaning texts. You can also Google what you need, but adding the “hackage” or “stackoverflow haskell” keywords to your search query usually leads to much more relevant results.
Give it a try!
Running the program from listing 1 on the text of Shakespeare’s Hamlet results in something like this (with the part in the middle stripped away):
$ vocab1 texts/hamlet.txt
a a'th a-crosse a-downe a-downe-a a-dreames a-foot a-sleepe a-while a-worke
...
4827
We won’t attempt to make the cleaning results better, because the main goal of this chapter is to discuss structuring functional programs. In fact, breaking text on words can’t be done reliably without diving deep into the Unicode rules on text boundary positions. If you’re interested, look at the documentation for the module Data.Text.ICU
from the text-icu
package; you’ll find fascinating details there on what a word is. Even then, you need to add some sort of semantic analysis to come up with a bulletproof solution that works beyond the English language. Don’t forget to check your own solution with several text files.
Designing a functional program as a set of IO actions
The sort of programming presented in the previous subsections resembles scripting more than actual programming. In Haskell we tend to express our ideas in terms of types and functions first and then proceed with implementing them. Remember that our task is to explore the vocabulary of a given text file, but let’s be more specific here: from now on we’ll regard a vocabulary as consisting of entries with words and numbers of occurrences (they can be used later for determining the most frequent words, for example). Now that we’ve agreed on the types of an entry (a word and its number of occurrences) and a vocabulary (a list of entries), we are ready to write a type-based outline for the program:
type Entry = (T.Text, Int) -- one entry
type Vocabulary = [Entry] -- a list of entries extractVocab :: T.Text -> Vocabulary
printAllWords :: Vocabulary -> IO ()
processTextFile :: FilePath -> IO ()
main :: IO ()
This outline clearly shows that our plan is to read and process the text file (processTextFile
) by:
- Extracting a vocabulary from the file’s content
- Using the vocabulary to print all words
But there’s more than that in this outline: we plan to read and process command-line arguments (the name of the file, which is a variable of type FilePath
) in the main
function, and, if this is done correctly, proceed with processing the text file (using processTextFile
). This function reads the content of the given file (this is the second component of the user input in this program, after the command-line arguments) into a variable of type Text
, extract the vocabulary (with the extractVocab
function), and finally print it (with printAllWords
).
Note the function extractVocab
: it’s the only pure function in this program. We can see that from its type, T.Text → Vocabulary
—there’s no IO
there.
You can see this scenario visualized in the flowchart depicted in figure 1. I’ve tried to present all the components of the program there: user input, actions in the I/O part of the program, and their relations with the pure functions. Arrows in this flowchart represent moving data between the user and the program, and between functions within the program.
The function extractVocab
here does exactly what we did before in the let
expression inside main
, it takes a Text
and returns a Vocabulary
:
extractVocab :: T.Text -> Vocabulary
extractVocab t = map buildEntry $ group $ sort ws
where
ws = map T.toCaseFold $ filter (not . T.null) $ map cleanWord $ T.words t
buildEntry ws@(w:_) = (w, length ws)
cleanWord = T.dropAround (not . isLetter)
Once we have a Vocabulary
we can print it as follows:
printAllWords :: Vocabulary -> IO ()
printAllWords vocab = do
putStrLn "All words: "
TIO.putStrLn $ T.unlines $ map fst vocab
I prefer to have a separate function for file processing to avoid many lines of code in the main
function, where I’m going to read command-line arguments and check them for correctness:
processTextFile :: FilePath -> IO ()
processTextFile fname = do
text <- TIO.readFile fname
let vocab = extractVocab text
printAllWords vocab main = do
args <- getArgs
case args of
[fname] -> processTextFile fname
_ -> putStrLn "Usage: vocab-builder filename"
This program structure is flexible enough to accommodate several task changes. For example, it’s easy to print the total number of words in the text file and find the most frequently used words. These new goals can be expressed in types as follows:
printWordsCount :: Vocabulary -> IO ()
printFrequentWords :: Vocabulary -> Int -> IO ()
Unfortunately, there’s a problem with this approach: we tend to stick with IO
, and almost every function in the program is an I/O action. Next, I’ll show you how to do the same task in a completely different way.
Embracing pure functions
The problem with functions like printAllWords
, printWordsCount
, or printFrequentWords
is that they are too tightly and unnecessarily coupled with I/O. Even their own names suggest that the same functionality can be achieved by combining the impure print
with pure computations.
It’s generally agreed within the Haskell community that we can get the most out of the advantages of functional programming when we use pure functions as much as possible: they’re easier to combine with other functions, they can’t break something in other parts of the program, and we can even reason about their correctness.
Let’s try to follow the path of purity as we solve our main problem in this section — extracting a vocabulary — to see how pure functions allow us to extend the functionality of the program easily.
Putting I/O apart with pure functions
Remember that your initial goal was to write a program that was able to:
- Extract the vocabulary of a text file.
- Print all words used in the text.
- Count and print the number of different words.
- Find the specified number of the most frequently used words.
For example, when running it on Shakespeare’s Hamlet you’d like to get something like the following output:
All words:
a
...
zone Total number of words: 29575 Frequent words:
the - 993
and - 862
to - 683
of - 610
i – 547
The task is more ambitious than what we’ve programmed this far, but we can handle it easily thanks to the clear separation between pure and impure parts of the program. This is the new outline, now with a big choice of pure functions:
extractVocab :: T.Text -> Vocabulary
allWordsReport :: Vocabulary -> T.Text
wordsCountReport :: Vocabulary -> T.Text
frequentWordsReport :: Vocabulary -> Int -> T.Text
processTextFile :: FilePath -> IO ()
main :: IO ()
Every *Report
function here is a function that takes a Vocabulary
with additional arguments if needed and returns the resulting text in the form of T.Text
ready for printing. We’ll use these functions in processTextFile
. Preparing these reports can be done with two auxiliary pure functions:
wordsCount :: Vocabulary -> Int
wordsByFrequency :: Vocabulary -> Vocabulary
The functions in the I/O part of the program (IO
-actions) now become more trivial: the goal of processTextFile
is reading the text file, calling extractVocab
to extract the vocabulary, and printing the results of all the *Report
functions. The complete flowchart is depicted in figure 2.
Nothing new happens when preparing the reports on the list of words used and their counts, but dealing with frequently occurring words is more interesting. Let’s discuss that now.
Computing the most frequent words by sorting them
Note that I’ve deliberately split the job between two functions, wordsByFrequency
and frequentWordsReport
:
wordsByFrequency :: Vocabulary -> Vocabulary
frequentWordsReport :: Vocabulary -> Int -> T.Text
Is this useful in practice? The function wordsByFrequency
can be used later—its goal is to sort the vocabulary by number of occurrences of each word—but the goal of frequentWordsReport
is only to prepare the report for printing. It isn’t a good idea to add the Int
argument to wordsByFrequency
, because this greatly limits its potential for reuse.
Let’s move onto the implementation of computing the most frequent words. First, you need to sort the vocabulary entries by descending number of occurrences. This can be achieved using the sortBy
function from the Data.List
module, with the following type:
sortBy :: (a -> a -> Ordering) -> [a] -> [a]
This function requires a comparison function returning the Ordering
datatype (whose values can be LT
, EQ
, or GT
, meaning less than, equal, or greater than, respectively). The function sortBy
sorts elements of the list in ascending order by default. To address the required descendancy, you can construct a comparison function by using the comparing
function from the Data.Ord
module over the values, wrapped with the type Down
. This implementation reverses the results of comparing two values of the given type, providing you with a sort in descending order instead of the default ascending sortBy
behavior:
wordsByFrequency :: Vocabulary -> Vocabulary
wordsByFrequency = sortBy (comparing $ Down . snd)
Note also that calling the snd
function over the vocabulary entry enables you to compare numbers of occurrences instead of words.
I recommend reading Roman Cheplyaka’s blog post on the descending sort in Haskell to see some of the difficulties behind this seemingly easy task:https://ro-che.info/articles/2016-04-02-descending-sort-haskell. This post explores the behavior of sortBy
and compares it with the other sorting functions, such as sortOn
and sortWith
.
Now that you can sort the vocabulary entries as required, you can move on to preparing the corresponding report. One problem here’s how to combine string literals with T.Text
values. Instead of looking for specific functions for doing that, you can teach the compiler to regard string literals as values of type T.Text
, using the OverloadedStrings
GHC extension. To enable this extension in the source code, start your file with the LANGUAGE
pragma:
{-# LANGUAGE OverloadedStrings #-}
Now you can use string literals directly as values of type Text
without converting them:
frequentWordsReport :: Vocabulary -> Int -> T.Text frequentWordsReport vocab n = T.append "\nFrequent words:\n"
$ T.unlines $ map showEntry $ take n
$ wordsByFrequency vocab
where
showEntry (t, n) = T.append t $ T.pack $ " - " ++ show n
The function T.pack
is used here to convert a value (not a literal) of type String
to type T.Text
.
The use of compiler extensions in other programming languages is somewhat controversial. In C or C++, for example, there’s the standard language definition, which is implemented by many different compilers, and there are specific compiler extensions. In Python or Ruby there are no standards, and that means that there are no extensions: there’s only one compiler, and you can legitimately use only what it implements.
In Haskell the situation is different: there’s a standard definition (the Haskell 2010 Report is the current version) and only one widely used compiler (the Glasgow Haskell Compiler), which implements many extensions beyond what’s defined in the Haskell Report. It’s generally considered good practice to use these extensions, because this gives the community the experience needed to decide whether to include them in the next standard or, in some cases, to exclude them from the compiler itself.
Exploring vocabulary: The overall solution
The overall solution is presented in the following listing. It features a LANGUAGE
directive in the first line that enables the GHC extension OverloadedStrings
in the source code; every other aspect of this code has already been discussed.
Listing 2. Analyzing words in a text file with a clearly divided pure part and I/O part (vocab3.hs)
{-# LANGUAGE OverloadedStrings #-} ①
import Data.Char ②
import Data.List
import Data.Ord
import qualified Data.Text as T
import qualified Data.Text.IO as TIO
import System.Environment
type Entry = (T.Text, Int) ③
type Vocabulary = [Entry]
extractVocab :: T.Text -> Vocabulary ④
extractVocab t = map buildEntry $ group $ sort ws ④
where
ws = map T.toCaseFold $ filter (not . T.null) $ map cleanWord $ T.words t ④
buildEntry ws@(w:_) = (w, length ws)
cleanWord = T.dropAround (not . isLetter)
allWordsReport :: Vocabulary -> T.Text ④
allWordsReport vocab = T.append "\nAll words:\n" ④
$ T.unlines $ map fst vocab
wordsCount :: Vocabulary -> Int ④
wordsCount vocab = sum $ map snd vocab ④
wordsCountReport :: Vocabulary -> T.Text ④
wordsCountReport vocab = T.append "\nTotal number of words: " ④
$ T.pack $ show $ wordsCount vocab
wordsByFrequency :: Vocabulary -> Vocabulary ④
wordsByFrequency = sortBy (comparing $ Down . snd) ④
frequentWordsReport :: Vocabulary -> Int -> T.Text ④
frequentWordsReport vocab n = T.append "\nFrequent words:\n"
$ T.unlines $ map showEntry $ take n
$ wordsByFrequency vocab
where
showEntry (t, n) = T.append t $ T.pack $ " - " ++ show n
processTextFile :: FilePath -> Int -> IO () ⑤
processTextFile fname n = do
text <- TIO.readFile fname
let vocab = extractVocab text
TIO.putStrLn $ allWordsReport vocab
TIO.putStrLn $ wordsCountReport vocab
TIO.putStrLn $ frequentWordsReport vocab n
main :: IO ()
main = do
args <- getArgs
case args of
[fname, n] -> processTextFile fname (read n)
_ -> putStrLn "Usage: vocab-builder filename number_of_frequent_words"
① Enable the GHC extension OverloadedStrings, which allows treating string literals as Text values
② Import the modules required in this program
③ Define type synonyms to make type annotations for functions easier to read and understand
④ The group of pure functions that do almost all the work in the program
⑤ The I/O action that brings everything together
Preparing reports with this program isn’t particularly impressive: we build them trivially by appending strings.
That’s all for this article. If you want to learn more about the book, check it out on liveBook here and see this slide deck.
About the author:
Since 2008, Vitaly Bragilevsky has been teaching Haskell and functional programming to undergraduate students at the Southern Federal University located in Rostov-on-Don, Russia. He is a member of the Haskell 2020 Committee, and has worked on the source code of the Glasgow Haskell Compiler (GHC) and the Idris compiler, both of which are implemented in Haskell.
Originally published at freecontent.manning.com.