Introduction Big Data
Before we can really get started we must first
discuss what we see as Big Data, and why we want to use Big Data technology. In this chapter we briefly
introduce each of the Azure technologies discussed.
- What is Big Data?
- Overview of Microsoft Azure
- The Azure Management Portals
- Key Azure services
Storing your data in Azure Storage
Azure Storage is like a sort of file share that can be used by many of the Azure services. Often the output
of one Azure service is stored in Azure Storage before being consumed by another component. In this module
will learn about the different types of storage available in Azure Storage. Also will you become familiar
some of the tools to load and manage files in Azure storage.
- Microsoft Azure Storage Concepts: Storage accounts and Containers
- Azure blob storage
- Tools for storing data in Azure Storage
Azure SQL Database
An easy way to create a business intelligence solution in the cloud is by taking SQL Server -- familiar to
BI developers -- and run it in the cloud. Backup and high availability happen automatically, and we can use
all the skills and tools we used on a local SQL Server on this cloud based solution as well.
- Azure SQL Database feature set
- Basic, Standard, Premium and Premium RS tier
- Comparing performance: DTUs, transaction rates and benchmarks
Azure Data Warehouse
Azure SQL Databases have their limitations in compute power since they run on a single machine, and their
size is limited to 4 Tb per database. Azure Data Warehouse is a service aiming at an analytical
workload on data volumes hunderds of times larger than what Azure SQL databases can handle. Yet at the same
we can keep on using the familiar T-SQL query language, or we can connect traditional applications such as
and Management Studio to interact with this service. But storage and compute can be scaled independently.
- What is Azure Data Warehouse?
- Creating and distributing tables
- Loading data via external tables and PolyBase
- Elasticity versus Performance tier
- Monitoring and performance tuning
Azure Analysis Services
Analysis Services is Microsoft's OLAP (cube) technology. The latest version, Analysis Services Tabular, can
also run as a database-as-a-service. This is ideal to load the cleaned, pre-processed data produced by
other Cortana Intelligence components and cache it. This leads to faster reporting. But the data can also
be enriched with KPIs, translations, derived measures etc. In this module we take a brief look at how an
Analysis Services model can be created and deployed to the cloud, but for a more in-depth discussion we
refer to the
Analysis Services Tabular training.
- Creating a cloud based Analysis Server
- Deploying Power BI models
- Deploying from Visual Studio
Azure Data Lake Store and Analytics
Azure Data Lake Store (ADLS) and Azure Data Lake Analytics (ADLA) are like bread and butter: you can use
but they are often used together. Azure Data Lake Store is comparable to Azure Storage, but it has a few
features which make
it better suited for Big Data projects. This makes it ideal for setting up a data lake. But to turn the
'raw' data in our
data lake into something 'pure' and consumable, we need to apply some cleansing and/or analytics upon this.
And that is where
Azure Data Lake Analytics comes into play. Using a 'Unified SQL' language (U-SQL) it allows us to use a
mixture of the relational
language SQL and the object-oriented c# language to convert raw data into analysis.
- What is a data lake?
- Setup Azure Data Lake Storage
- Loading data
- Setup Azure Data Lake Analytics
- Getting started with U-SQL
- EXTRACT, SELECT, INSERT and OUTPUT
- U-SQL projects in Visual Studio
- Running U-SQL jobs locally
Cosmos DB is a No-SQL solution, with a schema-on-read approach based on JSON. It is an extension of the
former DocumentDB database. It supports many APIs, such that you can treat it as a MongoDb, Cassandra,
graph database etc. Very flexible for application developers, and a great source for BI data!
- What is CosmosDB
- Setting up a database
- Resource Units
- Tools: emulator and data migration tool
Azure Data Catalog
How do you find back all the relevant data that your business stores, spread over the sometimes hundreds of
and reports? To help you in this task, you need a database of databases, which stores only meta-data such
table and column names, descriptions etc. This is exactly what the Azure Data Catalog is all about, and in
module you learn how to create, fill and query this catalog.
- What is a data catalog
- Creating an Azure Data Catalog
- The Azure Data Catalog portal
- Collecting and uploading meta-data
Azure Data Factory
Not only do we want to store data and run analysis on this, we also need a scheduler to move our data to
the proper services
and then run the relevant analysis on top of this. When the data is stored and analysed on premise we
use ETL tools such as SQL Server Integration Services for this. But what if the data is stored in the
Then we need Azure Data Factory, the cloud-based ETL service. First we need to get used to the terminology,
we can start creating the proper objects in the portal, using the wizard or in Visual Studio.
- Introducing Data Factories
- Creating linked services and data sets
- Combining activities into pipelines
- Build a complete flow with the wizard
- Using Visual Studio to create or modify data factories
- Monitoring and managing data factories
- Data Factory V2 improvements
Azure Event Hubs
All the topics covered so far mainly focus on analyzing data at rest. But what if you want to analyze a
never ending stream
of incoming events, such as in Internet-of-things (IoT) applications? In this module we focus on buffering
timestamping streams of incoming events. The next module is on Azure Stream Analytics and shows how to
these streams of events in an easy way. Microsoft extended the T-SQL language with a few temporal concepts
as sliding windows. With these we can develop an event processing application in a matter of minutes.
- Collecting streams of events
- Setup Azure Service Bus and Event hubs
- Managing Event Hubs
- Consumer groups
- Sending and consuming events
Azure Stream Analytics
- Real-time analytics and event handling
- Create Azure Stream Analytic jobs
- Configure security
- Connecting inputs and outputs
- Writing Stream Analytic queries
For many people Big Data processing is synonym with Hadoop. This open source big data eco system is very
popular, and is
part of the Azure stack under the name HDInsight. In this module we mainly focus on how to setup HDInsight,
the data storage options and illustrate the more popular Hadoop frameworks such as Hive, Pig and Spark.
is a big collection of complex tools, don't expect to become an expert in each of these. If you're new to
it gives you some overview such that you know what is possible. If you are a data scientist with Hadoop
is shows you enough to know how to get started with this on the Azure stack.
- Setting up an HDInsight cluster
- Tools for loading data
- Map-Reduce and YARN
DataBricks is an optimized and easier to use variant of Spark. In this chapter we take a brief look into the
setup and the use of Azure DataBricks.
- Setup of Azure DataBricks
- Working with notebooks
- Creating jobs
- Creating DataBricks dashboards
Analyzing nicely formatted tables is easy, but what if the data at hand are scanned invoices, security
camera footage etc? With Azure cognitive services we can convert difficult data formats such as photo,
audio or video into a structured representation, which can then be further used in remainder of the
- What are cognitive services
- Overview of the cognitive services
- Customizable versus non-customizable cognitive services
- Configuring LUIS for language understanding
Azure Machine Learning
Just remembering a bunch of things doesn't make somebody smart, but the skill to learn from 'old knowledge'
and apply this
on unseen situations is what makes somebody smart. That's exactly the purpose of machine learning. In Azure
Machine Learning Microsoft created a framework that is easy enough such that non-programmers can use a
GUI to build machine learning models. But machine learning experts can use their Python or R skills as well
do very advanced things in Azure Machine Learning that go beyond the scope of the GUI. Another great
of Azure Machine Learning is the deployment feature: Once you learned the right model, with a few click
zero coding!) you create a webservice such that you can call your model from nearly any applications!
- Getting started in ML Studio
- Accessing data sets
- Using R and Python scripts
- Exploring the different modeling techniques for classification, regression, clustering and
- Model training
- Scoring datasets
- Evaluate the models
- Create a scoring experiment
- Create and configure web service
Microsoft's Big Data solution is a collection of Azure services to load, store and analyze
large volumes of data in the cloud. Although each of these services can be used independent for one another,
will often be used together to process data in the cloud.
First we investigate how data can be stored. We look into the traditional solutions using Azure Storage and
Azure SQL Databases.
But we also investigate newer technologies such as Azure SQL Data Warehouse and Data Lake Storage for
with large and very large volumes of data. For data that is less structured the No-SQL Cosmos DB can be
The next step is analyzing the data. We discuss HDInsight with its traditional Hadoop technologies such as
Hive and Spark,
but we also touch upon Azure Data Lake Analytics which introduces U-SQL as its new data query language.
Machine Learning is crucial to do more advanced analysis on large volumes of data. Also Azure Stream
is discussed to analyze streams of events (together with Event Hubs to capture large volumes of incoming
DataBricks are introduced as well, as a more easy to use Spark alternative.
We must also pay attention to how data can be loaded into cloud storage in an automated fashion using Azure
Finally we take a brief look at how the results of these analyses can be used in Power BI as a reporting
tool. Also Azure Analysis Services comes into the picture, as we often need it as a fast and user-friendly
cache of the data.
All these technologies are introduced and demonstrated, but participants will also have hands-on labs on
each of these technologies.