ARTICLE

Ingesting Data from Files with Spark, Part 4: TXT

3 min readApr 22, 2019

From Spark in Action, 2nd Ed. by Jean Georges Perrin

___________________________________________________________________

Save 37% off Spark in Action, 2nd Ed. Just enter code fccperrin into the discount code box at checkout at manning.com.
___________________________________________________________________

This is the last in a series of 4 articles on the topic of ingesting data from files with Spark. This section deals with ingesting a TXT file.

As we mentioned, this is the final part of our data-ingestion series of short articles. So far, in part 1 we ingested from CSV, from JSON in part 2, and from XML in part 3. In this section we’re going to ingest data from a TXT (text) file.

Ingesting a text file

Text files are still used around there and, although they’re less popular in enterprise applications, you still get a few at times. The growing popularity of deep learning and artificial intelligence also drives more NLP (Natural Language Processing) activities. In this section, you won’t do any NLP, only ingest text files. To know more about NLP, you can refer to Manning’s Natural Language Processing in Action.

The task is to ingest Shakespeare’s Romeo & Juliet. Project Gutenberg (http://www.gutenberg.org) hosts numerous books and resources in digital format.

Each line of the book becomes a record of our dataframe. No features need to be cut by sentence or word. Listing 1 shows an excerpt of the file you’re going to work on.

Getting the files You can download Romeo and Juliet from http://www.gutenberg.org/cache/epub/1777/pg1777.txt. For this example, I used Spark v2.2.0 on MacOS X v 10.12.6 with Java 8. The dataset was downloaded in January 2018.

Listing 1 Abstract of Project Gutenberg’s version of Romeo and Juliet

This Etext file is presented by Project Gutenberg, in
cooperation with World Library, Inc., from their Library of the
Future and Shakespeare CDROMS.  Project Gutenberg often releases
Etexts that are NOT placed in the Public Domain!!
…
ACT I. Scene I.
Verona. A public place.
 
Enter Sampson and Gregory (with swords and bucklers) of the house
of Capulet.
  
  Samp. Gregory, on my word, we'll not carry coals.
  Greg. No, for then we should be colliers.
  Samp. I mean, an we be in choler, we'll draw.
  Greg. Ay, while you live, draw your neck out of collar.
  Samp. I strike quickly, being moved.
  Greg. But thou art not quickly moved to strike.
  Samp. A dog of the house of Montague moves me.
…

Desired output

Listing 1 shows the first five rows of Romeo and Juliet after it has been ingested by Spark and transformed into a dataframe.

Listing 2 Romeo and Juliet in a dataframe

+--------------------+
|               value|
+--------------------+
|                    |
|This Etext file i...|
|cooperation with ...|
|Future and Shakes...|
|Etexts that are N...|
+--------------------+
only showing top 5 rows
  
root
 |-- value: string (nullable = true)

Code

Listing 2 is the Java code needed to turn Romeo and Juliet into a dataframe.

Listing 3 — TextToDataframeApp.java

package net.jgp.books.sparkWithJava.ch07.lab_400.text_ingestion;
  
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
  
public class TextToDataframeApp {
  
  public static void main(String[] args) {
    TextToDataframeApp app = new TextToDataframeApp();
    app.start();
  }
  
  private void start() {
    SparkSession spark = SparkSession.builder()
        .appName("Text to Dataframe")
        .master("local")
        .getOrCreate();
  
    Dataset<Row> df = spark.read().format("text") ❶
        .load("data/romeo-juliet-pg1777.txt");
  
    df.show(5);
    df.printSchema();
  }
}

❶ specify “text” when you want to ingest a text file

Unlike with other formats, there are no option to be set with text. It’s that easy! We’ve come to the end of our series and we hope that you’ve found it both enjoyable and informative. If you want to learn more about the book, check it out on liveBook here and see this slide deck.

About the author:
An experienced consultant and entrepreneur passionate about all things data, Jean Georges Perrin was the first IBM Champion in France, an honor he’s now held for ten consecutive years. Jean Georges has managed many teams of software and data engineers.

Originally published at freecontent.manning.com.

ARTICLE

Ingesting Data from Files with Spark, Part 4: TXT

Ingesting a text file

Desired output

Written by Manning Publications