From Data Pipelines with Apache Airflow by Bas P. Harenslak and Julian Rutger de Ruiter
This article gives an overview of the Apache Airflow architecture.
Airflow, as a platform, dominates the development of data pipelines. Its responsibilities include scheduling, restarting and backfilling of both partial and entire batch pipelines. Airflow isn’t used for streaming pipelines.
In conversations about Airflow we often discuss best practices, such as how to split up tasks over a pipeline; should we have fewer large tasks do lots of work or many smaller tasks do smaller work? Let’s look at a few of Airflow’s primary features.
Directed Acyclic Graph
Most workflow managers require graphs to be acyclic, meaning that they aren’t allowed to contain cycles. Acyclic workflows ensure a finite start and end, which can be scheduled to run at certain times. Because of this acyclic property, workflows are modelled as Directed Acyclic Graphs (DAG). Airflow allows you to create and schedule pipelines of tasks by creating DAGs. A DAG consists of operators and dependencies between them. Operators can hold any task in any technology, which is essential to the Airflow.
Properties of DAGs include:
- should be directed (i.e., the edges all have a direction indicating which task is dependent on which).
- must be acyclic (i.e., they can’t contain cycles).
- are graph structures (i.e., a collection of vertices and edges).
DAGs form the core structure around which Airflow structures its workflows. Most of the work we do in Airflow concerns building our workflows as DAGs which are executed by Airflow.
Airflow operates in the space of batch processes; a series of finite tasks with clearly defined start and end tasks, to run at certain intervals or triggers. Although the concept of workflows also exists in the streaming space, Airflow doesn’t operate there. A framework such as Apache Spark is often used as one single task in an Airflow workflow, triggered by Airflow, to run a given Spark job. When used in combination with Airflow, this is always a Spark batch job and not a Spark streaming job because the batch job is finite and a streaming job can run forever.
Defined in Python Code
The whole DAG is defined in a Python script. Tasks in DAGs can be generated from various sources such as a list of filenames, resulting in small and dynamic scripts. The platform is an open-source Python framework, which allows you to easily implement your own operators and hooks as well as the huge existing collection of operators and hooks. Such flexibility helped Airflow gain momentum.
Airflow distinguishes itself from other workflow systems by its “feature completeness” — it contains a long and growing list of operators and hooks, it can work distributed, it has an easy-to-use UI for managing workflows, and it allows for the creation of dynamic workflows with Python scripts.
Because Airflow workflows are defined in Python code, it allows for flexible and dynamic workflows. We can read a list of files, and use a
for loop to generate a task for each file using Airflow’s primitives. The result is a concise piece of code, to which you can apply the same programming principles as any other code in your project, such as formatting, linting and version control.
Opposites of configuration as code can be found in other workflow systems such as Oozie where workflows are configured in static XML files, or Azure Data Factory where pipelines can be configured in a web interface or ARM template. Every way to define workflows have pros and cons, but the philosophy of doing anything “as code” is a software engineering approach valued by many developers around the world. The ability to keep your configuration in the same repository, style and conventions as your application logic is often much appreciated. Keeping track of changes with version-controlled code and rolling back in case of issues make your code repeatable at any point in time, and the possibility to develop workflows in your favorite IDE, providing features such as autocompletion and automatic detection of issues is helpful during the development of workflows.
Scheduling and Backfilling
Airflow workflows can be started in various ways: by a manual action, external triggers or by scheduled intervals. In all ways a series of tasks are run every time the workflow is started. This might work fine for as long as you run your workflow, but at some point you might want to change the logic, for example if you want the daily workflow to aggregate differently in the “aggregate three input tables” task.
You can change your workflow to produce these new aggregates and run the new workflow from now on. It might be favorable to also run this new logic on all previously completed workflows, on all historical data. This is possible in Airflow with a mechanism called backfilling, running workflows back in time. This is only possible though if external dependencies such as the availability of data can be met. What makes backfilling useful is the ability to rerun partial workflows. If the fetching of data isn’t possible back in time, or is a lengthy process you’d like to avoid, you can rerun partial workflows with backfilling. Backfilling preserves all constraints of tasks and runs them, as if time is reversed to that point in time. With each task Airflow provides a runtime context and when rerunning historical tasks, Airflow provides the runtime context as if time was reverted. Besides backfilling, Airflow provides several constructs to manage the lifecycle and execution of tasks and workflows.
That’s all for this article.
If you want to learn more about the book, check it out on our browser-based liveBook reader here.