Ingesting Data from Files with Spark, Part 4: TXT

From Spark in Action, 2nd Ed. by Jean Georges Perrin


Save 37% off Spark in Action, 2nd Ed. Just enter code fccperrin into the discount code box at checkout at

This is the last in a series of 4 articles on the topic of ingesting data from files with Spark. This section deals with ingesting a TXT file.

As we mentioned, this is the final part of our data-ingestion series of short articles. So far, in part 1 we ingested from CSV, from JSON in part 2, and from XML in part 3. In this section we’re going to ingest data from a TXT (text) file.

Ingesting a text file

Text files are still used around there and, although they’re less popular in enterprise applications, you still get a few at times. The growing popularity of deep learning and artificial intelligence also drives more NLP (Natural Language Processing) activities. In this section, you won’t do any NLP, only ingest text files. To know more about NLP, you can refer to Manning’s Natural Language Processing in Action.

The task is to ingest Shakespeare’s Romeo & Juliet. Project Gutenberg ( hosts numerous books and resources in digital format.

Each line of the book becomes a record of our dataframe. No features need to be cut by sentence or word. Listing 1 shows an excerpt of the file you’re going to work on.

Getting the files You can download Romeo and Juliet from For this example, I used Spark v2.2.0 on MacOS X v 10.12.6 with Java 8. The dataset was downloaded in January 2018.

Listing 1 Abstract of Project Gutenberg’s version of Romeo and Juliet

Desired output

Listing 1 shows the first five rows of Romeo and Juliet after it has been ingested by Spark and transformed into a dataframe.

Listing 2 Romeo and Juliet in a dataframe


Listing 2 is the Java code needed to turn Romeo and Juliet into a dataframe.

Listing 3 —

❶ specify “text” when you want to ingest a text file

Unlike with other formats, there are no option to be set with text. It’s that easy! We’ve come to the end of our series and we hope that you’ve found it both enjoyable and informative. If you want to learn more about the book, check it out on liveBook here and see this slide deck.

About the author:
An experienced consultant and entrepreneur passionate about all things data, Jean Georges Perrin was the first IBM Champion in France, an honor he’s now held for ten consecutive years. Jean Georges has managed many teams of software and data engineers.

Originally published at

Written by

Follow Manning Publications on Medium for free content and exclusive discounts.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store