Ingesting Data from Files with Spark, Part 2: JSON

From Spark in Action, 2nd Ed. by Jean Georges Perrin

This is the second in a series of 4 articles on the topic of ingesting data from files with Spark. This section deals with ingesting a JSON file.

Save 37% off Spark in Action: With examples in Java. Just enter code fccperrin into the discount code box at checkout at

In part 1, we discussed ingesting data from a CSV file. In this part, we’re going to discuss ingesting data from a JSON file.

Ingesting a JSON file

Over the last few years, JSON (JavaScript Object Notation) has become the new cool kid in town in terms of data exchange, mainly after REST (Representational State Transfer) supplanted SOAP (Simple Object Access Protocol) and WSDL (Web Services Description Language, written in XML) in web services-oriented architecture.

JSON is easier to read, less verbose, and brings less constraints than XML. It supports nested constructs like arrays and objects. You can find out more about JSON at

A sub-format of JSON is called JSON Lines. JSON Lines ( stores a record on one line, easing parsing and readability. Here’s a small example copied from their websites, as you can see it supports Unicode.

Before Spark v2.2.0, JSON Lines was the only JSON format that Spark could read.

For your first JSON ingestion, you’re going to use the foreclosure dataset from the city of Durham, NC from 2006 to 2016. You can freely download their datasets from their portal at

Open Durham is the open data portal of the city and county of Durham, NC. They use OpenDataSoft’s solution, which provides data as JSON Lines. For this example, I used Spark v2.2.0 on MacOS X v 10.13.2 with Java 8. The dataset was downloaded in January 2018.

Listing 1 shows three records. Listing 2 shows a pretty print of the first record.

Listing 1 Foreclosure data: two first records and the last record

Listing 2 shows an indented (pretty print via and Eclipse) version of the first record, and you can see the structure: field names, arrays, and nested structure.

Listing 2 Foreclosure data: pretty print of the first record

Desired output

Listing 3 shows the output of a dataframes’ data and schema after ingesting a JSON Lines document.

Listing 3 Displaying foreclosure records and schema

❶ The “fields” field is a structure with nested fields

❷ The dataframe can contain arrays

❸ Every field in which Spark can’t precisely identify the data type, it then uses a string.

When you see a piece of data like that, aren’t you tempted to group by the year to see the evolution of foreclosures or display each event on a map to see if there are areas more subject to foreclosures, and compare with average incomes in this area? This is good: let your inner-data scientist spirit come out!


As you can imagine, reading JSON isn’t much more complex that ingesting a CSV file, as you’ll see in listing 4.

Listing 4 —

❶ That’s it! This is the only change you have to do to ingest JSON.

Easier than CSV for sure! Stay tuned for part 3. If you’re interested in some more general information about the book, check it out on liveBook here and see this slide deck.

Originally published at

Follow Manning Publications on Medium for free content and exclusive discounts.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store