ARTICLE

Batch queries with Azure Data Lake Analytics

From Azure Storage, Streaming, and Batch Analytics by Richard Nuckolls

This article discusses using Azure Data Lake Analytics (ADLA) for batch queries.

_____________________________________________________________

Take 37% off Azure Storage, Streaming, and Batch Analytics. Just enter fccnuckolls into the discount code box at checkout at manning.com.
_____________________________________________________________

Azure Data Lake Analytics (ADLA) fills the role of batch processor in the Lambda batch layer. ADLA uses the unbounded fast storage available in Azure along with readily available processing nodes to make analyzing file-based data sets like analyzing database data sets. An ADLA job reads data files, filters the data, executes calculations, and outputs new data sets to files. You define a list of files to read, which make up a file set. The list can be a single file, an array of files, or a wildcard declaration to select multiple files and folders. The ADLA job defines the read, write, filter, and calculation operations with the U-SQL language, which is a unique coding language that combines aspects of SQL and C#.

Figure 1. Data flow through the ADLA cluster

U-SQL manages loading, filtering, and calculations on data with rowsets. A rowset describes the output of a data operation, either a load or a calculation. Rowsets are generated from loads of data within the file sets. The calculations are defined as expressions describing the shape of the data output, rather than the row-byrow operations to perform. This makes U-SQL a declarative language like T-SQL. All the U-SQL statements that define a batch process are collected into a U-SQL script, which is loaded and compiled in a U-SQL job. The compiler transforms the declarative U-SQL script into the row-by-row operations to perform in the job run. This compilation complicates debugging the intermediate results of the calculations, because the evaluation of the expressions isn’t complete until the job is complete. In practice, this means you get the output of the U-SQL job at the end, or not at all. Let’s see what this looks like by looking at an ADLA U-SQL job.

Elements of a Data Lake Analytics job

ADLA U-SQL jobs read data files, transform data, and write out data in a batch.

U-SQL jobs can read data files from an Azure Data Lake store (ADL) or Azure Blob storage. The U-SQL job writes files to ADL stores or Azure Blob storage. Figure 1 describes a job which reads multiple files from an ADL store and writes multiple files to an ADL store.

Image for post
Image for post
Figure 2. Reading and writing data files with ADLA

New files land in the Staging folder before being processed and saved in the Raw folder. This is how this process can work with U-SQL jobs driving the data processing and movement.

  1. Unprocessed files are stored in the Staging folder.
  2. Staging files are read into U-SQL jobs.
  3. Simple and multi-step U-SQL jobs process the data.
  4. Processed data is written out to the Raw folder.
Image for post
Image for post
Figure 3. Reading and writing data files with ADLA

To create this workflow, U-SQL jobs builds a unique application for each job, using a U-SQL script to define the work of the job. To build this application, the U-SQL job describes four elements.

  • A file reader, called an Extractor, matched to the type of files being read
  • A file writer, called an Outputter, matched to the type of file being written
  • A list of files to read and a list of files to write
  • A set of calculations to perform on the data

With these four elements, you can build complex file-data-transformation jobs. By combining file selectors, extractors, expressions, and outputters, you can define analysis jobs which run over small and large amounts of data. The U-SQL script is a file with a .usql extension that contains U-SQL commands. You upload the USQL script to a new ADLA U-SQL job via the Azure Portal, via Azure Powershell, or using another tool like Azure Data Factory. Let’s look at each element before putting them together in a U-SQL script.

Extractors

Extractors do the work of reading files. ADLA includes three extractors by default: CSV, TSV, and Text. You define a read expression by using the EXTRACT command. The EXTRACT command has four parts: a variable assignment, a schema definition, a FROM clause, and a USING clause. Listing 1 shows a typical Extractor expression.

The U-SQL statements in Listing 1 contains six important parts.

  1. DECLARE begins a variable declaration.
  2. Variable names begin with the @ sign.
  3. Standard C# types are available for variables.
  4. String and DateTime types enclose their values in double-quotes, integers don’t.
  5. Row set names also begin with the @ sign.
  6. All U-SQL statements end with a semicolon.

CSV EXTRACT statement.

DECLARE @in string = "/Staging/pitcherstats/v1/2019-10-11-12-0.csv";❶
@input = ❷

EXTRACT
Id Guid, ❸
Player string,
Node int,
[Value] decimal,
EventTime DateTime,
PartitionId int,
EventEnqueuedUtcTime DateTime,
EventProcessedUtcTime DateTime
FROM @in
USING Extractors.Csv(); ❺

❶ Assign this expression statement to a variable.

❷ Call EXTRACT.

❸ Define the field names and types that make up the expression schema.

❹ Specify a string variable listing the file to read.

❺ Use the Extractors built-in CSV class.

Extractors implement schema-on-read functionality. You define the field order and field type to be read from the file in the EXTRACT expression. The Extractor reads the file and does a conversion for you. U-SQL relies on the underlying C# Type.Parse() function to do the conversions.

Outputters

Outputters write the U-SQL job rowsets to files in Azure Data Lake store. ADLA includes three outputters by default: CSV, TSV, and Text. You define an output expression by using the OUTPUT command. The OUTPUT command has four parts: a variable assignment, a TO clause, and a USING clause. Listing 2 shows a typical Outputter expression.

CSV OUTPUT statement.

DECLARE @out string = "/Staging/pitcherstats/v1/2019-10-11-12-0.csv"; ❶
OUTPUT @Players ❷
TO @out USING
Outputters.Csv(); ❹

OUTPUT @Players TO "/Raw/pitcherstats/v1/pitcher/2019-10-11-12-0.tsv" ❺
USING Outputters.Tsv(); ❻

❶ Outputters write to a single file

❷ Reference a previously declared rowset

❸ TO using a file path variable

❹ Use the Outputters built-in CSV class

❺ A full path to Blob storage file

❻ Use the Outputters built-in TSV class

Outputters use the schema of the referenced rowset. Outputter file paths can use relative paths to the default Azure Data Lake store, or use direct paths to Data Lake stores or Blob storage. File selector variables work with Outputters too, but are limited to a single line. Only one wildcard is allowed in output file selectors.

File selectors

File selectors define the list of files read by the extractor. The simplest form specifies a single file by path and filename. The FROM clause of the Extractor function takes a string, string variable, or string array as an input, to select the files to read. The following statement creates a variable selecting a single file.

DECLARE @in string = "/Staging/pitcherstats/v1/2019-07-01-13-0.csv;

You build these variables as inputs to an Extractor function. The text get parsed by the function before being used to read the files. Because of this parsing, you can instruct the Extractor with special syntax. For example, you can pass in a wildcard for a portion of the filename.

Image for post
Image for post

Multiple wildcards work in input file paths, but only one wildcard is allowed in output file paths. You can also extract values from the filename using the wildcard brackets format.

DECLARE @in string = "/Staging/pitcherstats/v1/2019-07-01-{*}.csv"

Dates are commonly used for folder and filename syntax, which makes constructing a date extraction a straightforward example. The extracted value(s) becomes an extra field in the file schema available in the read expression.

DECLARE @in string = "/Staging/pitcherstats/v1/
{filedate:yyyy}-{filedate:MM}-{filedate:dd}-{*}.csv";

In this example, a field called filedate is constructed during file read. When the extractor uses this @in variable, it parses the filenames found in the folder path v1, and collects the data for filedate, which you define as a DateTime, and adds a new filedate field to each row read from the each file.

Image for post
Image for post
Table 1. Constructing a field from file name

The DECLARE command creates a new variable of the specified C# type. All variables are constants in U-SQL. You can’t change the value once declared, unless you use the EXTERNAL keyword. Using DECLARE EXTERNAL @variable declaration allows the variable to be overwritten.

DECLARE EXTERNAL @Year int = 2019;
// Year = 2019

This allows tools to inject their own parameters into the U-SQL script. Without a variable parameter injected, the script uses the original value. The injected variable parameters are added to the beginning of the U-SQL job by the tool which submits the job.

DECLARE @Year int = 2020;
DECLARE EXTERNAL @Year int = 2019;
// Year = 2020

Using an injected variable parameter looks like Listing 3.

Using injected variables.

--Start of file--
DECLARE @Year int = 2020; ❶
DECLARE EXTERNAL @Year int = 2019; ❷
// Year = 2020 ...
--End of file--

❶ Injected variable

❷ Overridden variable

Image for post
Image for post

Injecting variable parameters into U-SQL jobs are used with Azure Data Factory when you configure automation and scheduling for your analytics system. The Azure Data Factory pipeline submits your job file to ADLA.

Expressions

Expressions create and transform rowsets with the rowset data read by the Extractors. An expression can create new calculated columns, join multiple rowsets, and filter and group rows, as you would with other SQL languages. An expression references a previously declared rowset or defines one with a row enumeration. Listing 4 shows an example of creating a rowset using enumeration. This example defines a rowset with three rows, and defines a second rowset based on the first rowset, including an aggregation column.

Building rowset with expressions.

@PlayerList  = ❶
SELECT * FROM
(VALUES
("abera101", 13, 15.0, new DateTime(2019,9,30) ❷),
("abera101", 13, 18.0, new DateTime(2019,9,30) ❸),
("jstro102", 4, 220.10, new DateTime(2019,10,10) ❹)
) AS D( Player, Node, NodeValue, EventTime ❺);
@Players = ❻
SELECT Player, ❼
Node,
AVG(NodeValue) AS AvgValue, ❽
EventTime
FROM @PlayerList ❾
GROUP BY Player, Node, EventTime;

❶ Rowset 1 name, preceded by @ sign

❹ C# format for types

❺ Column names for rowset

❻ Rowset 2 name

❼ SELECT fields to include in rowset

❽ Calculated columns must include “AS [columnName]”

❾ FROM Rowset 1 Statement ends with semicolon

These are just two examples of the expressions you can write. Expressions in U-SQL can become quite complicated, and benefit from advanced knowledge of SQL, but don’t let that stop you! The ADLA compiler helps optimize the job, and you can refactor and rerun jobs. Remember, maintaining the original data files is one of the key tenants of Lambda architecture. As long as you follow this tenant, you can execute batch jobs again and again without risk to the data.

If you want to learn more about the book, check it out on our browser-based liveBook reader here.

Written by

Follow Manning Publications on Medium for free content and exclusive discounts.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store