The Majestic Role of the Dataframe in Spark
From Spark in Action, Second Ed. by Jean Georges Perrin
In this article, you’ll learn what a dataframe is, how it’s organized, and about immutability.
The essential role of the dataframe in Spark
A dataframe is both a data structure and an API. In this article, we’re going to learn about what the dataframe is and how Spark uses it.
Spark’s dataframe API is used within Spark SQL, streaming, machine learning, and GraphX to manipulate graph-based data structure within Spark, drastically simplifying the access to those technologies via a unified API. You won’t need to learn an API for each sub-library.
It’s probably odd to call a dataframe majestic, but this qualifier suits it perfectly. Like majestic art work attracts curiosity, like a majestic oak dominates the forest, like majestic walls protect the castle, the dataframe is majestic in the world of Spark.
Organization of a dataframe
In this section, you’ll learn about how a dataframe organizes data. A dataframe is a set of records organized into named columns. It’s equivalent to a table in a relational database or a in Java.
Figure 2 illustrates a dataframe.
Dataframes can be constructed from a wide array of sources, such as files, databases, or custom data sources. The key concept of the dataframe is its API, which is available in Java, Python, Scala, and R. In Java, a dataframe is represented by a dataset of rows: .
Dataframes include the schema in a form of , which can be used for introspection. Dataframes also include a method to debug more quickly your dataframes. Enough theory, let’s practice!
Immutability isn’t a swear word
Dataframes, as well as datasets and RDDs (resilient distributed datasets), are considered immutable storage.
Immutability is defined as unchangeable. When applied to an object, it means that its state can’t be modified after it’s created.
I think this is counter intuitive: when I first started working with Spark, I had a difficult time embracing the concept: “let’s work with this brilliant piece of technology designed for data processing, but where the data are immutable. You expect me to process data, but they can’t be changed?”
Figure 3 gives an explanation: in its first state, data are immutable, then you start modifying them, but Spark only stores the steps of your transformation, not every step of the transformed data.
Let me rephrase that: Spark stores the initial state of the data, in an immutable way, and then keeps the recipe (the list the transformation). The intermediate data aren’t stored.
The “why” is easy to understand: figure 3 illustrates a typical Spark flow with one node, but figure 4 illustrates with more nodes.
Immutability finds its important role in a distributed way:
- either one stores data and each modification is done immediately on every node, like with a relational database does, or
- one keeps the data in sync over the nodes and shares the transformation recipe with the different nodes.
Spark uses the second solution, as it’s faster to sync a recipe that the whole data over each node. Catalyst is the kid in charge of optimization in Spark processing. Immutability and recipes are cornerstone to this optimization engine.
Although immutability is brilliantly used by Spark as the foundation to optimize data processing, you won’t need to think too much about it as you develop applications.
And that’s where we will stop for now.
About the author:
An experienced consultant and entrepreneur passionate about all things data, Jean Georges Perrin was the first IBM Champion in France, an honor he’s now held for ten consecutive years. Jean Georges has managed many teams of software and data engineers.
Originally published at freecontent.manning.com.