The Modern Data Warehouse
The cloud requires to reconsider some of the choices made for on-premisses data handling. This module
introduces the concept of a data lake and the data lakehouse. It also introduces the different services in Azure
that can be used for data processing, and compares them to the
traditional on-premisses data stack. Finally, it provides a brief intro in Azure and the use of the Azure
- From traditional to modern data warehouse
- Comparing data warehouse with data lake
- Lambda architecture
- Overview of Big Data related Azure services
- Getting started with the Azure Portal
- LAB: Navigating the Azure Portal
Storing Data in Azure
This module discusses the different types of storage available in Azure Storage as well as data
lake storage. Also some of the tools to load and manage files in Azure storage and Data lake storage are
- Introduction Azure Blob Storage and Azure Data Lake Storage Gen2
- Comparison of Azure Blob Storage vs Azure Data Lake Storage Gen2
- Tools for uploading data
- Working with the Azure Storage Explorer and AzCopy
- LAB: Uploading data into Azure Storage
Introducing Azure Data Factory
When the data is stored and analyzed on-premises you typically use ETL tools such as SQL Server Integration Services for this.
But what if the data is stored in the Azure cloud? Then you can use Azure Data Factory, the cloud-based ETL service.
First we need to get used to the terminology, then we can start creating the proper objects in the portal.
- Data Factory V2 terminology
- Setup a Data Factory with GIT support
- Exploring the Data Factory portal
- Creating Linked Services and Datasets
- Copying data with the Data Factory wizard
- LAB: Migrating data with Data Factory Wizard
Authoring Pipelines in Azure Data Factory
This module dives into the process of building a Data Factory pipeline from scratch. The most common activities
are illustrated. The module also focusses on how
to work with variables and parameters to make the pipelines more dynamic.
- Adding activities to the pipeline
- Working with Expressions
- Variables and Parameters
- Debugging a pipeline
- LAB: Authoring and debugging an ADF pipeline
Creating Data Flows in Data Factory
With Data flows data can be transformed without the need to learn about another tool (such as Databricks or
Spark). Both Data flows as well as
the Power Query activity are covered.
- From ELT to ETL
- Creating Data Factory Data flows
- Exploring the Power Query activity
- LAB: Transforming data with a Data flow
Pipelines need an integration runtime to control where the code executes. This module provides an overview of the 3
types of Integration Runtimes: Azure, self-hosted runtimes and SSIS.
It also discusses the different type of Triggers that exist and how they can be used to schedule pipelines.
- Integration Runtime Overview
- The Azure Integration Runtime
- The Self-Hosted Integration Runtime
- Scheduling Pipelines using Triggers
- Monitoring pipeline executions
Azure SQL Database
An easy way to create a business intelligence solution in the cloud is by taking SQL Server -- familiar to
BI developers -- and run it in the cloud. Backup and high availability happen automatically, and we can use
all the skills and tools we used on a local SQL Server on this cloud based solution as well.
- Provisioning an Azure SQL Database
- Migrating an on-premises Data Warehouse to Azure SQL Database
- Ingesting Azure Blob Storage data
- Working with Columnstore Indexes
- LAB: Using Azure SQL Databases
Using the Serverless SQL pool in Azure Synapse Analytics
Once data has been loaded into the data lake, the next step is to cleanse the data, pre-aggregate the data and
perform other steps to make the data accessible to reporting and analytical tools.
Dependant on the transformations required and the skills of the data engineer, the SQL dialect common to the
Microsoft data stack (T-SQL) could play an important role.
This module first introduces the scenarios where the move from an Azure SQL Database into Synapse databases
could be useful, introduces briefly the two different types of SQL databases, and then focusses more deeply on
the Synapse Analytics Serverless databases.
- When Azure SQL Databases reach their limits
- Provisioned versus Serverless Synapse Analytics databases
- Creating and accessing Serverless databases
- Using OPENROWSET for data access
- Creating external tables
- LAB: Querying data via Azure Synapse Analytics Serverless databases
Using Azure Synapse Analytics Provisioned SQL Pools
Since Serverless SQL Pools don't store data in a proprietary format, they lack features such as indexes, update
statements etc. This is where
Provisioned SQL Pools in Azure Synapse Analytics (formerly known as Azure Data
Warehouse) can come to the rescue.
- Architecture of Provisioned SQL Pools
- Loading data via PolyBase
- CTAS and CETAS
- Setting up table distributions
- LAB: Loading and querying data in Provisioned SQL Pools
Getting Started with Spark
Although SQL is a very powerful language to access and manipulate data, it has its limitations. Complex data
wrangling, advanced statistics or machine learning are ill-suited tasks for SQL.
For this purpose Apache Spark is better suited. It's a divide-and-conquer framework for data access, transformation and
querying which relies on programming languages such as Python and Scala.
Spark can be used in Synapse Analytics as well as in Azure Databricks, a popular service which integrates with
Synapse Analytics pipelines as well.
- Introduction to Apache Spark
- Spark Cluster setup and configuration
- Getting started with Notebooks
- Introduction to Apache Spark on Databricks
- LAB: Getting started with Spark and Notebooks
Accessing Data in Synapse Analytics Spark
Apache Spark doesn't have a proprietary data storage option, but consumes and produces regular files stored in Azure
Storage. This module covers how to access and manipulate data stored in the Synapse Analytics data lake or other
Azure storage locations from Synapse Analytics Spark or DataBricks.
- The Spark Session and Context objects
- Connecting to Azure Blob Storage and Azure Data Lake Storage Gen2
- Connecting to relation databases through JDBC or ODBC
- Loading and saving data in Spark using DataFrames
- An introduction to transforming data using PySpark
- Using Spark SQL to query and transform data
- LAB: Processing data on a Spark cluster
Introducing Delta Lake
Delta Lake is an optimized storage layer that provides the foundation for storing data and tables in a Lakehouse Platform. Delta lake
is an open source platform that extends Parquet files with ACID transactions and metadata handling.
This chapter provides an introduction to Delta Lake and how it can be used to create a Lakehouse architecture.
- Introducing Delta Lake
- Creating Delta Tables
- Modifying data in a Delta Table
- Time Travel with Delta Tables
- Creating a Lake Database/Lakehouse in Azure Synapse Analytics
- Querying a Lake Database/Lakehouse using the Serverless SQL Pool
The Role of the Power BI Service
The Power BI Service (or the Analysis Services engine directly) plays an important role in the modern data
warehouse solution. This module describes briefly the Power BI Service architecture and how it integrates with
Azure Synapse Analytics.
- Overview of Power BI
- Power BI/Analysis Services in the modern data warehouse architecture
- Link Synapse Analytics with Power BI
Azure Data Explorer
In between large volumes of historical, long lived data stored in a data lake,
and streams of short living events processed with Azure Stream Analytics, lives the challenge of
working with large volumes of semi-structured telemetry and log data, where the analysis can have a
longer latency that with event processing, but requires more historical information than what event processing
technology can handle.
For this kind of data processing Azure Data Explorer is the ideal tool
- Data Explorer architecture
- Ingesting data in Data Explorer
- Querying and visualizing data with the Kusto query language
- Accessing Data Explorer from Data Factory and Power BI