From Azure Data Engineering by Richard Nuckolls
This article discusses Azure Cloud Services and it’s components.
Microsoft Azure is a cloud services provider. This means Azure provides datacenter services and software which an enterprise traditionally hosted in their offices, in a data center building of their own, or in a hosting providers data center. Information technology resources that an enterprise hosts are referred to as “on-premise” resources. This distinguishes them from resources hosted “in the Cloud”. IT engineers usually have physical access to on-premise resources, but not to cloud resources.
Cloud services providers, like Microsoft Azure and Amazon Web Services, provide three main types of services, classified by the end-user management of the underlying operating system and software. The lowest level of abstraction provides Infrastructure as a Service (IaaS). IaaS services provide resources like virtual machines, routers and firewalls. The provider manages the hardware in their data center, and the end user manages the software and operating system. IaaS resources require technical and developer support to manage operating system and software installation, and create code to run on the servers. The next level of abstraction provides a Platform as a Service (PaaS). PaaS services provide server application hosting such as web servers, databases, and storage. The provider manages the hardware and operating system running in their data center, and manages server applications running on the operating system. The end user configures and uses the applications. PaaS resources require developer support, to create code to run on the server applications. The next level of abstraction provides a Software as a Service (SaaS). SaaS services provide user applications delivered over the internet. Typical SaaS applications include web-based email services or web-based file sharing services, which charge a subscription. The SaaS provider manages all aspects of the hardware, operating system, and software. The end user configures and uses the application. Microsoft has transitioned many of their operation systems, desktop and server applications to IaaS, Paas, or SaaS resources available in Azure.
Microsoft Azure offers both open-source and Microsoft technologies in its cloud services. Azure provides HDinsight for Hadoop engineers and data scientists. HDInsight manages containerized Hadoop processing nodes, with plenty of configuration access and overhead. Azure also provides DataBricks, a SaaS abstraction of the Apache Spark analytics engine. Both provide viable options for operating large analytics systems in the cloud. A third option exists for the Microsoft technologist. By using tight integrations provided by the Azure products, the Microsoft data engineer can build a sophisticated and flexible analytics system using familiar technologies like C#, SQL, and GIT. This article discusses these services and how to use them to build a complete analytics system in the cloud.
Let’s look at each of these services.
Azure Event Hubs provides a PaaS scalable message queuing endpoint, including built-in integrations with Azure Storage and Stream Analytics. Our analytics system uses Event Hubs as the entry point to our data processing pipeline. Using Event Hubs provides our system with a scalable and reliable buffer to handle spikes in the volume of incoming events. Event Hubs accepts both HTTP and Advanced Message Queuing Protocol (AMQP) packets for event messages. Plenty of clients are available for these protocols in your language of choice. These message queues can be read by one or more subscribers.
Events Hubs scale in two ways: first, the endpoint processes incoming messages with a throughput unit, a measure of maximum throughput at a fixed cost. Adding more throughput units allows a higher message rate, at a higher cost. Second, Event Hubs partitions the queue. Adding more partitions allows the Event Hub to buffer increased numbers of messages and parallel reads by subscribers.
Azure Stream Analytics processes streaming data. Streaming data is ordered by a time element, which is why it’s often referred to as events or event data. Stream Analytics accepts streams of data from Event and IoT Hubs and Blob Storage, and outputs processed information to one or more Azure endpoints. It uses a structured query language to query the data stream. The data process can be thought of as fishing a river with a net made of particular shapes. The data flows by the net, and the net captures the matching bits. The fisherman hauls in the net regularly to review his catch. Similarly, the queries pull result sets out of the stream as it flows by.
Stream Analytics scales in two ways. First, each Stream Analytics job can use one or more streaming units, a synthetic metric describing CPU and memory allocation. Each step in the job uses between one and six streaming units. Second, planning parallelism in the stream queries allows Stream Analytics to take advantage of the available parallel processes. For example, writing data to a file in Azure Storage or Data Lake Store can use multiple connections in parallel. Writing data to a SQL Server table uses a single connection, for now. At most, a single query operation can use six streaming units. Each Stream Analytics job can have more than one query operation. Planning the streaming unit allocation along with the query structure allows for maximum throughput.
Data Lake Store
Azure Data Lake Store stores files. It provides a folder structure interface over an Apache Hadoop file system, which supports petabytes of data. Multiple open source and native Azure cloud services integrate with Data Lake Store. Fine grained access via integration with Azure Active Directory make securing files a familiar exercise.
Data Lake Analytics
Azure Data Lake Analytics (ADLA) brings scalable batch processing to the Data Lake Store and Blob Storage. ADLA jobs use familiar SQL syntax to read files, query the data, and output results files over data sets of any size. Because ADLA uses a distributed query processor over a distributed file system, batch jobs can be executed over multiple nodes at once. Running a job with parallel processing takes moving a slider past one.
Azure Data Lake Analytics uses a new coding language called U-SQL. U-SQL is “not ANSI SQL.” (Rhys 1) For starters, WHERE clauses use C# syntax. Declarative statements can be extended with C# functions. Query data comes from tables or files.
SQL Data Warehouse
SQL Data Warehouse (SQLDW) bears superficial resemblance to a standard SQL Server database, like Azure SQL databases. Most functionality matches: CRUD actions, views, stored procedures. Minor changes are table creation, indexing, and partitioning . The naive user could create a table, insert some data, and use the Data Warehouse like SQL Server databases.
Harnessing the power of SQL Data Warehouse comes from understanding the distributed nature of the underlying technology. Data resides in sixty shards, managed automatically. Queries are distributed across compute nodes, from one to sixty based on your configuration. Data imports from multiple files are spread across available compute nodes. The user controls scaling compute capacity, but storage relies on Azure Storage for scaling and redundancy.
Azure Data Factory automates the data movement between layers. With it, you can schedule an ADLA batch job for creating aggregate data files. You can import those files into SQLDW, and execute stored procedures too. Data Factory connects to many different endpoints for input and output, and can build structured workflows for moving data between them. Data Factory operationalizes these repeated activities.
Azure offers Cloud Shell as an option for managing resources in Azure via the command line. You can access Cloud Shell from the Azure portal, or by connecting to https://shell.azure.com. With Azure Cloud Shell, you can run Powershell or Bash commands to manage your Azure resources.
Azure analytics system architecture
Imagine your company wants to analyze user behavior in their main website to provide relevant suggestions for further reading, to promote user retention and higher page views. A solution allows generating suggestions based on historical data, recent personalized actions, and machine learning algorithms. Further, the same system could also analyze error events in real-time and provide alerts. You can build a system in Azure which can do the analysis work. The rest of this article walks through use cases, technical tradeoffs, and design considerations when creating and operating each piece of a proposed analytics system which fulfills these functions. Before we dive deeply into each of the services, let’s take a look at the system as a whole. This architectural design uses the six Azure services discussed in this article.
- Events Hubs for real-time ingestion
- Stream Analytics for real-time query processing
- Data Lake Store for data retention and batch query processing support
- Data Lake Analytics for batch query processing
- Data Factory for batch scheduling and aggregate data movement
- SQL Data Warehouse for interactive queries
Walkthrough of processing a series of event data records
Figure 1 shows how all six services can be assembled into an analytics processing system to monitor error rates and provide “users also viewed” suggestions. In this system, incoming event data follows both a hot and cold path into the user query engine. To illustrate how the event data flows through both paths, lets trace the flow of a typical user action event through both paths. We can see how each path fulfills part of our imagined business requirements for this system.