From Azure Data Engineering by Richard Nuckolls
This article delves into Azure’s tools for data engineering and why you should consider using them.
What is data engineering?
Collecting data seems like a simple activity. Take reporting web site traffic as an example. A single user, during a site in a web browser, requests a page. A simple site could respond with an HTML file, a CSS file, and/or an image. This example could represent one event, three events, or four events. What if there’s a page redirect? This is another event. What if we want to log the time taken to query a database? What if we retrieve some items from cache? All of these pieces of data are common log data points today.
Now add more user interaction, like a comparison page with multiple sliders. Each move of the slide logs a value. Tracking user mouse movement returns hundreds of coordinates. Consider a connected sensor with a 100Hz sample rate. It can easily record over eight million measurements a day. When you start to scale to thousands and tens of thousands of simultaneous events every point in the pipeline must be optimized for speed until the data comes to rest. Once at rest the data must remain secure.
Data engineering is the practice of building data storage and processing systems. Robert Chang, in his book A Beginners Guide to Data Engineering, describes the work as designing, building, and maintaining data warehouses. Data engineering creates scalable systems which allow analysts and data scientists to extract meaningful information from the data.
What do data engineers do?
Most businesses have multiple sources generating data. Manufacturing companies track the output of the machines, the output of employees, the output of their shipping departments. Software companies track their user actions, their software bugs per release, their developer output per day. Service companies check number of sales calls, time to complete tasks, usage of parts stores, and cost per lead. Some of this is small scale; some of it is large scale.
Analysts and managers might operate on narrow data sets, but large enterprises increasingly want to find efficiencies across divisions, or find root causes in multifaceted systems failures. In order to extract value from these disparate sources of data, engineers have built large scale storage systems as a single repository of data. A software company may implement centralized error logging. The service company may integrate their CRM, billing, and finance systems. Engineers need to plan for the ingestion pipeline, the storage Systems, and reporting services across multiple groups of stakeholders.
As a first step data consolidation means a large relational database. Analysts review reports, CSV files, and even spreadsheets in Excel, in an attempt to get clean and consistent data. Often developers or database administrators prepare scripts to import the data into databases. In the best case experienced database administrators define common schema and plan partitioning and indexing. The database enters production. Data collection commences in earnest.
Typical data systems based on storing data in relational databases have problems with scale. A single database instance, which is the simplest implementation, always becomes a bottleneck given increased usage. A finite amount of CPU cores and drive space are available on a single database instance. Scaling up can only go part of the way before IO bottlenecks prevent meeting response time targets. Distributing the database tables across multiple servers, or sharding, can enable greater throughput and greater storage at the cost of greater complexity. Even with multiple shards, database queries under load display greater and greater latency. Eventually query latency grows too large to satisfy the requirements of the application.
The open source community answered the challenge of building web-scale data systems. Hadoop opens vast disk storage for access with ease of maintenance. Spark provides a fast and highly available logging endpoint. NoSQL gives users a way to access large stores of data quickly. Languages like Python and R make deep dives into huge flat files possible. Analysts and data scientists write algorithms and complex queries in order to draw conclusions from the data. But this new environment still requires system administrators to build and maintain servers in their data center.
How does Microsoft define data engineering?
The systems architecture using these new open source tools looks quite different from the traditional database-centric model. In his landmark book, Nathan Marz coined a new term: Lambda Architecture. He defined this new architecture as a “general-purpose approach to implementing an arbitrary function on an arbitrary dataset and having the function return its results with low latency” (Marz, 7). The goals of Lambda architecture cover many of the inherent weaknesses of the database centric model. Figure 1 shows a general view of the new approach to saving and querying data. Data flows into both the Speed layer and the Batch layer. The Speed layer prepares data views of the most recent period in real time. The Serving layer delivers data views over the entire period, updated at regular intervals. Queries get data from the Speed layer, the Serving layer, or both, depending on the time period queried.
Figure 1 describes an analytics system using a lambda architecture. Data flows through the system from acquisition to retrieval via two paths: batch and stream. All data lands in long term storage, and scheduled or ad-hoc queries generate refined data sets from the raw data. This is the batch process. Data with short time windows for data retrieval run through an immediate query process, generating refined data in near-real time. This is the stream process.
·Data is generated by applications, devices, or servers.
·Each new piece of data is saved to long-term file storage.
·Each new piece of data is also sent to a stream processor.
·A scheduled batch process reads the raw data.
·Both stream and batch processes save query output to a retrieval endpoint.
·Users query the retrieval endpoint.
Figure 2 shows the core principle of Lambda architecture: data flows one way. Only new data is added to the data store; raw data is never updated. Batch processes yield data sets by reading the raw data, and deposit the data sets in a retrieval layer. A retrieval layer handles queries.
Faults caused by human error account for the largest problems in operating an analytics system. Lambda architecture mitigates these errors by storing the original data immutably. An immutable data set, where data is written once, read repeatedly, and never modified, doesn’t suffer from corruption due to incorrect update logic; bad data can be excluded. Bad queries can also be corrected and run again. The output information remains one step removed from the source. In order to facilitate fast writes, new bits of data are only appended. Updates to existing data doesn’t happen. To facilitate fast reads, two separate mechanisms converge their outputs. The regularly scheduled batch process generates information as output from queries over the large data set. Between batch executions, incoming data undergoes a similar query to extract information. These two information sets together form the entire result set. An interface allows retrieving the combined result set. Because writes, reads, queries, and requests handling execute as distributed services across multiple servers, the Lambda architecture scales both horizontally and vertically. Engineers can add more servers that are also more powerful. Because all of the services operate as distributed nodes, hardware faults are simple to correct, and routine maintenance work has little impact on the overall system. Implementing a Lambda architecture achieves the goals of fault tolerance, low latency reads and writes, scalability, and easy maintenance.
Six functions make up the core of this architectural design pattern. Mike Wassen describes the architecture pattern for Microsoft in the Big Data Architecture Style Guide. Lambda architecture maps onto this style using the following functions.
Large scale data ingestion happens one of two ways: a continuous stream of discrete records, or a batch of records encapsulated in a package. Lambda architecture handles both methods with aplomb. Incoming data in packages is stored directly for later batch processing. Incoming data streams are processed immediately, and packaged for later batch processing. Eventually all data becomes input for query functions.
Distributed file systems decouple saving data from querying data. Data files are collected and served by multiple nodes. More storage is always available by adding more nodes. The Hadoop Distributed File System (HDFS) lies at the heart of most modern distributed storage systems designed for analytics.
A distributed query system handles partitioning a single query into multiple executable units and executing them over multiple files. In Hadoop analytics systems the MapReduce algorithm handles distributing a query job over multiple nodes as a two-step process. Each Hadoop cluster node maps (Map) requested data to a single file, and the query returns results from that file. The results from all the files are combined, and the resulting set of data is reduced (Reduce) to a set fulfilling the query. Multiple cluster nodes divide the Map tasks and Reduce tasks between them. The MapReduce algorithm enables efficient querying of large scale collections. New queries can be set for scheduled updates or submitted for a single result. Multiple query jobs can run simultaneously, each using multiple nodes.
A real time analysis engine monitors the incoming data stream and maintains a snapshot of the most recent data. This snapshot contains the new data from when the last scheduled query was executed. Queries update result sets in the data retrieval layer. Usually these queries duplicate the syntax or output of the batch queries over the same period.
A scheduling system runs queries using the distributed query system against the distributed file system. The output of these scheduled queries becomes the result set for analysis. More advanced systems include data transfers between disparate systems. The orchestration function typically moves result sets into the data retrieval layer.
Lastly, an interface for collating and retrieving results from the scheduled and snapshot data sets gives the end user a low latency endpoint for information. This layer often relies on ubiquitous SQL to return results to analysis tools. Together these functions fulfill the requirements of the data analysis system.
What tools does Azure provide for data engineering?
Cloud systems promise to solve challenges with processing large scale data sets.
·Processing power limitations of single-instance services
·Storage limitations and management of on-premises storage systems
·Technical management overhead of on-premises systems
By using Azure cloud technologies, many difficulties in building large scale data analytics systems on-premises are eliminated. By automating the setup and support of servers and applications, the expertise of system administrators can be used elsewhere. Ongoing expense of hardware can be minimized. Redundant systems are provisioned as easily as single instances. The packaged analytics system is easy to deploy.
Several cloud providers have abstracted the complexity of the Hadoop cluster and its associated services. Microsoft responded to the development of the Hadoop ecosystem with cloud-based systems hosted in Azure. The system is branded HDInsight. According to Jason Howell, HDInsight is “a fully managed, full spectrum, open source analytics service for enterprises.” The data engineer can build a complete data analytics system using HDInsight, and the common data tools associated with Hadoop. Many data engineers, like those familiar with Linux and Apache Software, choose HDInsight when building a new data warehouse in Azure. Configuration approaches, familiar tools, Linux-specific features, and Linux-specific training materials are some of the reasons engineers familiar with Linux choose HDInsight.
Microsoft also built a set of abstracted services in Azure which perform the functions required for a data analysis system, but without the distinct paradigm of Linux and Apache. Along with the services, Microsoft provides a reference architecture for building a big data system. The model guides engineers through some high-level technology choices when using the Microsoft tools.
“A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too large or complex for traditional database systems.”
— Mike Wasson, author of Big data architecture style
This model covers common elements of the Lambda architecture, including data storage, batch and stream processing, and variations on an analysis retrieval endpoint. The model describes additional elements that are necessary but undefined in the Lambda model. For robust and high performance ingestion, a message queue can pass data to both the stream process and the data store. A query tool for data scientists gives access to aggregate or processed information. An orchestration tool schedules data transfers and batch processing.
Microsoft lays out the skills and technologies used by data engineers using Azure services as part of its certification “Azure Data Engineer Associate”. Azure Data engineers are described as those who “design and implement the management, monitoring, security, and privacy of data using the full stack of Azure data services to satisfy business needs.”
That’s all for this article, but you can find out which technologies Microsoft developed for analytics in chapter 2 of Azure Data Engineering , and you can also check other chapters of the book out on liveBook here and see this slide deck for more general information about the book.
Originally published at https://freecontent.manning.com.