ARTICLE

Ingesting Data from Files with Spark, Part 3: XML

From Spark in Action, 2nd Ed. by Jean Georges Perrin

___________________________________________________________________

Save 37% off Spark in Action 2nd Ed. Just enter code fccperrin into the discount code box at checkout at manning.com.
___________________________________________________________________

This is the third in a series of 4 articles on the topic of ingesting data from files with Spark. This section deals with ingesting a XML file.

In part 1 we dealt with ingesting data from a CSV file, and in part 2 we ingested from a JSON file. In this part we’re going to talk about ingesting data from an XML file.

Ingesting an XML file

In this section, we’ll ingest an XML document containing the NASA patents, display some patents and the dataframes’ schema. Note that, in this context, the schema isn’t an XML Schema (or XSD), but a dataframe schema. Quite a few years ago, when I discovered XML, I thought it could become some unified lingua franca of data exchange. XML is:

  • Structured.
  • Extensible.
  • Self-describing.
  • Embeds validation rules through DTD (Document Type Definition) and XSD (XML Schema Definition).
  • XML is a W3 standard. You can read more about XML at: https://www.w3.org/XML/.
  • XML looks like HTML and any other markup-language since SGML:
<rootElement>
<element attribute="attribute’s value">
Some payload in a text element
</element>
<element type="without sub nodes"/>
</rootElement>

Unfortunately, XML is verbose and it’s harder to read than JSON. Nevertheless, XML is still widely used and Apache Spark ingests it nicely.

Figure 1 shows a fragment of the XML file and illustrates the process.

For this XML example, you’re going to ingest the NASA patents. NASA offers various open datasets at https://data.nasa.gov. Listing 1 shows a record of this file.

Listing 1 — NASA patents (excerpt)

<response>                                                        ❶ 
<row ❷
_id="1" ❸
_uuid="BAC69188-84A6-4D28-951E-FC687ACB6D4A" ❸
_position="1" ❸
_address="https://data.nasa.gov/resource/nasa-patents/1"> ❸
<center>NASA Ames Research Center</center>
<status>Issued</status>
<case_number>ARC-14048-1</case_number>
<patent_number>5694939</patent_number>
<application_sn>08/543,093</application_sn>
<title>Autogenic-Feedback Training Exercise Method &amp; System</title>
<patent_expiration_date>2015-10-03T00:00:00</patent_expiration_date>
</row>

</response>

❶ The root element of your list of patents

❷ The element (or tag) designing our record.

❸ Attributes are prefixed by one underscore (_)

Desired output

Listing 2 shows the output of a dataframes’ data and schema after ingesting the NASA patents as an XML document. You can see that the attributes are prefixed by an underscore (_) (attributes had already an underscore as a prefix in the original document, and they have two now) and the element’s name is used as a column name.

Listing 2 — NASA patents in a dataframe

Code

As usual, our code starts by a main() method, which is calling a start() method which creates a Spark session. Listing 3 is the Java code needed to ingest the NASA XML file, then display five records and its schema.

Listing 3 — XmlToDataframeApp.java

package net.jgp.books.sparkWithJava.ch07.lab_300.xml_ingestion;

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;

public class XmlToDataframeApp {

public static void main(String[] args) {
XmlToDataframeApp app = new XmlToDataframeApp();
app.start();
}

private void start() {
SparkSession spark = SparkSession.builder()
.appName("XML to Dataframe")
.master("local")
.getOrCreate();

Dataset<Row> df = spark.read().format("xml") ❶
.option("rowTag", "row") ❷
.load("data/nasa-patents.xml");

df.show(5);
df.printSchema();
}
}

❶ Specify XML as the format. Case doesn’t matter.

❷ The element or tag that indicates a record in the XML file.

I had to modify the original NASA document as it contained an element with the same name wrapping the records. Unfortunately, as of now, Spark can’t do it for us. The original structure was:

<response>
<row>
<row _id="1" …>

</row>

</row>
</response>

If the first child of response was rows or anything other than row, I wouldn’t have had to remove it (another option is to rename it).

As the parser isn’t part of the standard Spark distribution, you need to add it to the pom.xml, as described in listing 4. To ingest XML, use a product called spark-xml_2.11 (the artifact), by a company called Databricks, in version 0.4.1.

Listing 4 — pom.xml to ingest XML (excerpt)


<properties>

<scala.version>2.11</scala.version> ❶
<spark-xml.version>0.4.1</spark-xml.version> ❷
</properties>

<dependencies>

<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-xml_${scala.version}</artifactId> ❸
<version>${spark-xml.version}</version> ❹
<exclusions> ❺
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-simple</artifactId>
</exclusion>
</exclusions>
</dependency>

</dependencies>

❶ Scala version on which the XML is built

❷ Version of the XML parser

❸ Equivalent to spark-xml_2.11

❹ Equivalent to 0.4.1

❺ Optional: I took a habit of excluding the logger from other packages to have a better control over the one I use

More details on Spark XML can be found at https://github.com/databricks/spark-xml. Not too complex! Stay tuned for part 4. If you want to learn more about the book, check it out liveBook here and see this slide deck.

About the author:
An experienced consultant and entrepreneur passionate about all things data, Jean Georges Perrin was the first IBM Champion in France, an honor he’s now held for ten consecutive years. Jean Georges has managed many teams of software and data engineers.

Originally published at freecontent.manning.com.

Written by

Follow Manning Publications on Medium for free content and exclusive discounts.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store