ARTICLE

Basic Text Processing in Functional Style

13 min readMar 15, 2019

From Haskell in Depth by Vitaly Bragilevsky

___________________________________________________________________
Take 37% off Haskell in Depth. Just enter fccbragilevsky into the discount code box at checkout at manning.com.
___________________________________________________________________

This article explores text processing in the functional programming style.

Suppose you want to write a program that analyzes the vocabulary of a given text. There exist many sophisticated methods for such an analysis, but you want to be quick. The task is quite basic. Namely, you should:

Extract all words from the given text file and print them.
Count the size of the vocabulary.
Find the most frequently used words.

Such a program could be part of a larger social media text analyzer, which could be used to mine various pieces of information (ranging from level of education or social position to the risk of financial default) by analyzing the texts people post on their social media pages.

Or, more likely, you’ve gotten up in the middle of the night with a desire to explore the size of Shakespeare’s vocabulary. How many different words did Shakespeare use in Hamlet?

First attempt: Exploring the problem in a REPL with Strings

Let’s start by exploring Shakespeare’s vocabulary. Suppose you have a file, texts/hamlet.txt, with the full text of Hamlet. How many different words are there? You can use the GHCi REPL (read-eval-print loop) to find out without writing a program. You only need to import two modules for processing lists and characters, read the file into the String, and then manipulate its content:

ghci> :m + Data.List Data.Char
ghci> text <- readFile "texts/hamlet.txt"
ghci> ws = map head $ group $ sort $ map (map toLower) $ words text
ghci> take 7 ws
["&","'em?","'gainst","'tane","'tis","'tis,","'twas"]

The idea is to split the file’s content into a list of words, make all of them lowercase, sort them, and then remove repetitions by grouping the equal words and taking the first word in each group. The initial result shows that we’ve forgotten about leading (and trailing) punctuation, and we need to do some cleanup:

ghci> ws1 = map (takeWhile isLetter . dropWhile (not . isLetter)) $ words text  
ghci> ws = map head $ group $ sort $ filter (not . null) $ map (map toLower) ws1  
ghci> take 7 ws  
["a","abhominably","abhorred","abilitie","aboord","aboue","about"]  ghci> length ws  
4633

This is better, although looking at the other words in the resulting list, ws, reveals that this isn’t a correct way to clean words: for example, it’s removed all hyphenated words (due to takeWhile isLetter).

NOTE

I’m using the ghci> prefix for the code executed in the GHCi REPL. You can use the same by executing the following command:

$ ghci  
GHCi, version 8.2.2: http://www.haskell.org/ghc/  :? for help  
Prelude> :set prompt "ghci> "  
ghci>

Alternatively, you can tweak your .ghci file to make this change permanent. The GHC User Guide gives you more details about the location of .ghci file: you can write any GHCi command there; for example, defining values, importing modules, or executing some Haskell code.

Moving away from REPL and Strings

The REPL approach to solving problems quickly gets clumsy; let’s try to write a complete program to solve the same problem — extracting a vocabulary (a list of used words) from the given text file and computing the size of the vocabulary.

In this attempt we’ll also switch to the Data.Text datatype, which is much more suitable for handling textual data in terms of both performance and convenience. The resulting program is presented in the following listing.

Listing 1. Extracting a vocabulary: the second attempt (vocab1.hs)

import Data.Char
import Data.List
import qualified Data.Text as T          ①
import qualified Data.Text.IO as TIO     ①
import System.Environment                ②
  
main = do
  [fname] <- getArgs                     ③
  text <- TIO.readFile fname             ④
  let ws = map head $ group $ sort $ map T.toCaseFold $ filter (not . T.null)
           $ map (T.dropAround $ not . isLetter) $ T.words text    ⑤
  TIO.putStrLn $ T.unwords ws                                      ⑥
  print $ length ws

① Import the modules for working with Text

② Import the module for reading command-line arguments

③ Read the command-line arguments into a list of `String`s

④ Read the file content into the Text value

⑤ Process text into a list of words

⑥ Print all the words, delimited by spaces

Now you can read the filename from the command line without worrying too much about incorrect user input. This is done in the first line of the main function. The modules Data.Text and Data.Text.IO are usually imported with qualifiers to avoid name clashes with Prelude (the module which is imported by default). They come with the text package, which should be installed prior to compiling this program (although it may be installed already, depending on your GHC distribution).

The module Data.Text provides many functions analogous to the list functions from Prelude and Data.List, but it also adds specific text processing functions that were used in listing 1. For example:

toCaseFold :: Text -> Text  
dropAround :: (Char -> Bool) -> Text -> Text

The function toCaseFold converts the whole Text value to folded case, and does it significantly faster (and in a more correct way) than mapping with toLower over every character. The function dropAround removes leading and trailing characters that satisfy the given predicate (the function of type Char → Bool).

TIP

You can use the website https://hoogle.haskell.org to find functions by name or, even more usefully, by type annotation: for example, searching for “(Char → Bool) → Text → Text” gives you the dropAround function I’ve described, together with other functions for cleaning texts. You can also Google what you need, but adding the “hackage” or “stackoverflow haskell” keywords to your search query usually leads to much more relevant results.

Give it a try!

Running the program from listing 1 on the text of Shakespeare’s Hamlet results in something like this (with the part in the middle stripped away):

$ vocab1 texts/hamlet.txt  
a a'th a-crosse a-downe a-downe-a a-dreames a-foot a-sleepe a-while a-worke  
...  
4827

We won’t attempt to make the cleaning results better, because the main goal of this chapter is to discuss structuring functional programs. In fact, breaking text on words can’t be done reliably without diving deep into the Unicode rules on text boundary positions. If you’re interested, look at the documentation for the module Data.Text.ICU from the text-icu package; you’ll find fascinating details there on what a word is. Even then, you need to add some sort of semantic analysis to come up with a bulletproof solution that works beyond the English language. Don’t forget to check your own solution with several text files.

Designing a functional program as a set of IO actions

The sort of programming presented in the previous subsections resembles scripting more than actual programming. In Haskell we tend to express our ideas in terms of types and functions first and then proceed with implementing them. Remember that our task is to explore the vocabulary of a given text file, but let’s be more specific here: from now on we’ll regard a vocabulary as consisting of entries with words and numbers of occurrences (they can be used later for determining the most frequent words, for example). Now that we’ve agreed on the types of an entry (a word and its number of occurrences) and a vocabulary (a list of entries), we are ready to write a type-based outline for the program:

type Entry = (T.Text, Int)   -- one entry  
type Vocabulary = [Entry]    -- a list of entries     extractVocab :: T.Text -> Vocabulary  
printAllWords :: Vocabulary -> IO ()  
processTextFile :: FilePath -> IO ()  
main :: IO ()

This outline clearly shows that our plan is to read and process the text file (processTextFile) by:

Extracting a vocabulary from the file’s content
Using the vocabulary to print all words

But there’s more than that in this outline: we plan to read and process command-line arguments (the name of the file, which is a variable of type FilePath) in the main function, and, if this is done correctly, proceed with processing the text file (using processTextFile). This function reads the content of the given file (this is the second component of the user input in this program, after the command-line arguments) into a variable of type Text, extract the vocabulary (with the extractVocab function), and finally print it (with printAllWords).

Note the function extractVocab: it’s the only pure function in this program. We can see that from its type, T.Text → Vocabulary—there’s no IO there.

You can see this scenario visualized in the flowchart depicted in figure 1. I’ve tried to present all the components of the program there: user input, actions in the I/O part of the program, and their relations with the pure functions. Arrows in this flowchart represent moving data between the user and the program, and between functions within the program.

Figure 1. Extracting a vocabulary: program structure flowchart

The function extractVocab here does exactly what we did before in the let expression inside main, it takes a Text and returns a Vocabulary:

extractVocab :: T.Text -> Vocabulary  
extractVocab t = map buildEntry $ group $ sort ws    
  where      
    ws = map T.toCaseFold $ filter (not . T.null) $ map cleanWord $ T.words t      
    buildEntry ws@(w:_) = (w, length ws)      
    cleanWord = T.dropAround (not . isLetter)

Once we have a Vocabulary we can print it as follows:

printAllWords :: Vocabulary -> IO ()  
printAllWords vocab = do    
  putStrLn "All words: "    
  TIO.putStrLn $ T.unlines $ map fst vocab

I prefer to have a separate function for file processing to avoid many lines of code in the main function, where I’m going to read command-line arguments and check them for correctness:

processTextFile :: FilePath -> IO ()  
processTextFile fname = do    
  text <- TIO.readFile fname    
  let vocab = extractVocab text    
  printAllWords vocab     main = do    
  args <- getArgs    
  case args of      
    [fname] -> processTextFile fname      
    _ -> putStrLn "Usage: vocab-builder filename"

This program structure is flexible enough to accommodate several task changes. For example, it’s easy to print the total number of words in the text file and find the most frequently used words. These new goals can be expressed in types as follows:

printWordsCount :: Vocabulary -> IO ()  
printFrequentWords :: Vocabulary -> Int -> IO ()

Unfortunately, there’s a problem with this approach: we tend to stick with IO, and almost every function in the program is an I/O action. Next, I’ll show you how to do the same task in a completely different way.

Embracing pure functions

The problem with functions like printAllWords, printWordsCount, or printFrequentWords is that they are too tightly and unnecessarily coupled with I/O. Even their own names suggest that the same functionality can be achieved by combining the impure print with pure computations.

It’s generally agreed within the Haskell community that we can get the most out of the advantages of functional programming when we use pure functions as much as possible: they’re easier to combine with other functions, they can’t break something in other parts of the program, and we can even reason about their correctness.

Let’s try to follow the path of purity as we solve our main problem in this section — extracting a vocabulary — to see how pure functions allow us to extend the functionality of the program easily.

Putting I/O apart with pure functions

Remember that your initial goal was to write a program that was able to:

Extract the vocabulary of a text file.
Print all words used in the text.
Count and print the number of different words.
Find the specified number of the most frequently used words.

For example, when running it on Shakespeare’s Hamlet you’d like to get something like the following output:

All words:  
a  
...  
zone     Total number of words: 29575     Frequent words:  
the - 993  
and - 862  
to - 683  
of - 610  
i – 547

The task is more ambitious than what we’ve programmed this far, but we can handle it easily thanks to the clear separation between pure and impure parts of the program. This is the new outline, now with a big choice of pure functions:

extractVocab :: T.Text -> Vocabulary  
allWordsReport :: Vocabulary -> T.Text  
wordsCountReport :: Vocabulary -> T.Text  
frequentWordsReport :: Vocabulary -> Int -> T.Text  
processTextFile :: FilePath -> IO ()  
main :: IO ()

Every *Report function here is a function that takes a Vocabulary with additional arguments if needed and returns the resulting text in the form of T.Text ready for printing. We’ll use these functions in processTextFile. Preparing these reports can be done with two auxiliary pure functions:

wordsCount :: Vocabulary -> Int  
wordsByFrequency :: Vocabulary -> Vocabulary

The functions in the I/O part of the program (IO-actions) now become more trivial: the goal of processTextFile is reading the text file, calling extractVocab to extract the vocabulary, and printing the results of all the *Report functions. The complete flowchart is depicted in figure 2.

Figure 2. Extracting a vocabulary with pure functions: program structure flowchart

Nothing new happens when preparing the reports on the list of words used and their counts, but dealing with frequently occurring words is more interesting. Let’s discuss that now.

Computing the most frequent words by sorting them

Note that I’ve deliberately split the job between two functions, wordsByFrequency and frequentWordsReport:

wordsByFrequency :: Vocabulary -> Vocabulary  
frequentWordsReport :: Vocabulary -> Int -> T.Text

Is this useful in practice? The function wordsByFrequency can be used later—its goal is to sort the vocabulary by number of occurrences of each word—but the goal of frequentWordsReport is only to prepare the report for printing. It isn’t a good idea to add the Int argument to wordsByFrequency, because this greatly limits its potential for reuse.

Let’s move onto the implementation of computing the most frequent words. First, you need to sort the vocabulary entries by descending number of occurrences. This can be achieved using the sortBy function from the Data.List module, with the following type:

sortBy :: (a -> a -> Ordering) -> [a] -> [a]

This function requires a comparison function returning the Ordering datatype (whose values can be LT, EQ, or GT, meaning less than, equal, or greater than, respectively). The function sortBy sorts elements of the list in ascending order by default. To address the required descendancy, you can construct a comparison function by using the comparing function from the Data.Ord module over the values, wrapped with the type Down. This implementation reverses the results of comparing two values of the given type, providing you with a sort in descending order instead of the default ascending sortBy behavior:

wordsByFrequency :: Vocabulary -> Vocabulary  
wordsByFrequency = sortBy (comparing $ Down . snd)

Note also that calling the snd function over the vocabulary entry enables you to compare numbers of occurrences instead of words.

I recommend reading Roman Cheplyaka’s blog post on the descending sort in Haskell to see some of the difficulties behind this seemingly easy task:https://ro-che.info/articles/2016-04-02-descending-sort-haskell. This post explores the behavior of sortBy and compares it with the other sorting functions, such as sortOn and sortWith.

Now that you can sort the vocabulary entries as required, you can move on to preparing the corresponding report. One problem here’s how to combine string literals with T.Text values. Instead of looking for specific functions for doing that, you can teach the compiler to regard string literals as values of type T.Text, using the OverloadedStrings GHC extension. To enable this extension in the source code, start your file with the LANGUAGE pragma:

{-# LANGUAGE OverloadedStrings #-}

Now you can use string literals directly as values of type Text without converting them:

frequentWordsReport :: Vocabulary -> Int -> T.Text  frequentWordsReport vocab n = T.append "\nFrequent words:\n"                                
                              $ T.unlines $ map showEntry $ take n                                
                              $ wordsByFrequency vocab    
  where      
    showEntry (t, n) = T.append t $ T.pack $ " - " ++ show n

The function T.pack is used here to convert a value (not a literal) of type String to type T.Text.

The use of compiler extensions in other programming languages is somewhat controversial. In C or C++, for example, there’s the standard language definition, which is implemented by many different compilers, and there are specific compiler extensions. In Python or Ruby there are no standards, and that means that there are no extensions: there’s only one compiler, and you can legitimately use only what it implements.

In Haskell the situation is different: there’s a standard definition (the Haskell 2010 Report is the current version) and only one widely used compiler (the Glasgow Haskell Compiler), which implements many extensions beyond what’s defined in the Haskell Report. It’s generally considered good practice to use these extensions, because this gives the community the experience needed to decide whether to include them in the next standard or, in some cases, to exclude them from the compiler itself.

Exploring vocabulary: The overall solution

The overall solution is presented in the following listing. It features a LANGUAGE directive in the first line that enables the GHC extension OverloadedStrings in the source code; every other aspect of this code has already been discussed.

Listing 2. Analyzing words in a text file with a clearly divided pure part and I/O part (vocab3.hs)

{-# LANGUAGE OverloadedStrings #-}                                ①
  
import Data.Char                                                  ②
import Data.List
import Data.Ord
import qualified Data.Text as T
import qualified Data.Text.IO as TIO
import System.Environment
  
type Entry = (T.Text, Int)                                        ③
type Vocabulary = [Entry]
  
extractVocab :: T.Text -> Vocabulary                              ④
extractVocab t = map buildEntry $ group $ sort ws                 ④
   where
    ws = map T.toCaseFold $ filter (not . T.null) $ map cleanWord $ T.words t                                                         ④
    buildEntry ws@(w:_) = (w, length ws)
    cleanWord = T.dropAround (not . isLetter)
  
allWordsReport :: Vocabulary -> T.Text                            ④
allWordsReport vocab = T.append "\nAll words:\n"                  ④
                       $ T.unlines $ map fst vocab
  
wordsCount :: Vocabulary -> Int                                   ④
wordsCount vocab = sum $ map snd vocab                            ④
  
wordsCountReport :: Vocabulary -> T.Text                          ④
wordsCountReport vocab = T.append "\nTotal number of words: "     ④
                         $ T.pack $ show $ wordsCount vocab
  
wordsByFrequency :: Vocabulary -> Vocabulary                      ④
wordsByFrequency = sortBy (comparing $ Down . snd)                ④
  
frequentWordsReport :: Vocabulary -> Int -> T.Text                ④
frequentWordsReport vocab n = T.append "\nFrequent words:\n"
                              $ T.unlines $ map showEntry $ take n
                              $ wordsByFrequency vocab
  where
    showEntry (t, n) = T.append t $ T.pack $ " - " ++ show n
 
processTextFile :: FilePath -> Int -> IO ()                       ⑤
processTextFile fname n = do
  text <- TIO.readFile fname
  let vocab = extractVocab text
  TIO.putStrLn $ allWordsReport vocab
  TIO.putStrLn $ wordsCountReport vocab
  TIO.putStrLn $ frequentWordsReport vocab n
  
main :: IO ()
main = do
  args <- getArgs
  case args of
    [fname, n] -> processTextFile fname (read n)
     _ -> putStrLn "Usage: vocab-builder filename number_of_frequent_words"

① Enable the GHC extension OverloadedStrings, which allows treating string literals as Text values

② Import the modules required in this program

③ Define type synonyms to make type annotations for functions easier to read and understand

④ The group of pure functions that do almost all the work in the program

⑤ The I/O action that brings everything together

Preparing reports with this program isn’t particularly impressive: we build them trivially by appending strings.

That’s all for this article. If you want to learn more about the book, check it out on liveBook here and see this slide deck.

About the author:
Since 2008, Vitaly Bragilevsky has been teaching Haskell and functional programming to undergraduate students at the Southern Federal University located in Rostov-on-Don, Russia. He is a member of the Haskell 2020 Committee, and has worked on the source code of the Glasgow Haskell Compiler (GHC) and the Idris compiler, both of which are implemented in Haskell.

Originally published at freecontent.manning.com.