Call Us: +32 2 466 00 16
Email: info@u2u.be
Follow Us:

Big Data Solutions on the Microsoft Azure Platform

5days
Training code
UADATA
Book this course

The modern data warehouse

The cloud requires to reconsider some of the choices made for on-premisses data handling. This module introduces the different services in Azure that can be used for data processing, and compares them to the traditional on-premisses data stack.

  • From traditional to modern data warehouse
  • Lambda architecture
  • Overview of Big Data related Azure services
  • Getting started with Azure

Staging data in Azure

This module discusses the different types of storage available in Azure Storage as well as data lake storage. Also some of the tools to load and manage files in Azure storage and Data lake storage are covered.

  • Introduction Azure Blob Storage
  • Compare Azure Data Lake Storage Gen 2 with traditional blob storage
  • Tools for uploading data
  • Storage Explorer, AZCopy, ADLCopy, PolyBase

Using Azure Data Factory for ETL

When the data is stored and analysed on on-premisses we typically use ETL tools such as SQL Server Integration Services for this. But what if the data is stored in the cloud? Then we need Azure Data Factory, the cloud-based ETL service. First we need to get used to the terminology, then we can start creating the proper objects in the portal.

  • Data Factory V2 terminology
  • The Data Factory wizard
  • Developing Data Factory pipelines in the browser
  • Creating Data Factory Data flows
  • Setup of Integration Runtimes
  • Debugging, scheduling and monitoring DF pipelines

Azure Data Warehouse

Azure SQL Databases have their limitations in compute power since they run on a single machine, and their size is limited to the Terabyte range. Azure Data Warehouse is a service aiming at an analytical workload on data volumes hundreds of times larger than what Azure SQL databases can handle. Yet at the same time we can keep on using the familiar T-SQL query language, or we can connect traditional applications such as Excel and Management Studio to interact with this service. Both storage and compute can be scaled independently.

  • Architecture of Azure Data Warehouse
  • Loading data via PolyBase
  • CTAS and CETAS
  • Setting up table distributions
  • Indexing
  • Partitioning
  • Performance monitoring and tuning

Advanced data processing with Databricks

Azure Databricks allows us to use the power of Spark without the configuration hassle of Hadoop clusters. Using popular languages such as Python, SQL and R data can be loaded, visualized, transformed and analyzed via interactive notebooks.

  • Introduction Azure Databricks
  • Cluster setup
  • Databricks Notebooks
  • Connecting to Azure Storage and Data Warehouse
  • Processing Spark Dataframes in Python
  • Using Spark SQL
  • Scheduling Databricks jobs

Modeling data with Azure Analysis Services

Analysis Services is Microsoft's OLAP (cube) technology. The latest version, Analysis Services Tabular, can also run as a database-as-a-service. This is ideal to load the cleaned, pre-processed data produced by other Azure services and cache it. This leads to faster reporting. But the data can also be enriched with KPIs, translations, derived measures etc. In this module we take a brief look at how an Analysis Services model can be created and deployed to the cloud, but for a more in-depth discussion we refer to the Analysis Services Tabular training.

  • Online Analytical Processing
  • Analysis Services Tabular
  • Creating a model on top of Azure Storage or Azure Data Warehouse
  • Model deployment
  • Processing
  • Model management

Getting started with Python

This training has no Python prerequisites. So the first module introduces the basics of Python.

  • Introducing the Python programming language
  • Python environments
  • Interactive development with Azure notebooks
  • Variables and objects
  • Common data structures: Lists, tuples, sets and dictionaries
  • Functions
  • Creating and using classes

Data processing with SciPy

In data science its crucial to deal with tables: Loading, manipulating, data quality checks, … Dataframes can help out with that, and in this module the two most important Python packages for data manipulation are inspected: Numpy and Pandas.

  • Numerical Python: Numpy
  • Numpy data structures
  • Pandas DataFrames
  • Loading data with pandas
  • Data manipulations with Pandas

Data inspection

Some pictures express more than a 1000 words. This holds in data science as well, so visualizing data is a crucial data science skill. Matplotlib is the most popular library for this. But there are additional libraries which build further upon this.

  • Introducing the matplotlib package
  • Using pyplot
  • Enriching plots: Title, axis and legend
  • Visualizing images
  • Additional visualization packages

Machine learning introduction

Before ML can be applied the key concepts of machine learning need to be discussed.

  • Which questions can machine learning answer?
  • Machine learning methodology
  • Data preparation
  • Classes of machine learning algorithms
  • Model evaluation

Machine Learning with scikit-learn

Many business problems can be tackled by basic machine learning techniques. In this module the most common machine learning approaches such as linear regression and random forests are implemented, as well as model inspection.

  • Machine learning specific data preprocessing
  • Overview of the scikit-learn library
  • Classification using decision trees, logistic regression and support vector machines
  • Model tuning: working with hyper-parameters
  • Building regression models with linear regression, SVM's and Neural networks
  • Unsupervised learning: Clustering

Azure Machine Learning Services

Machine learning on a local machine and a small dataset is one thing, running this on larger datasets or more CPU-hungry techniques can become a challenge. Another problem is deploying your model: How can we easily call the resulting model from within other applications? Azure Machine Learning Services helps answering these questions.

  • Azure ML service overview
  • Create a ML service workspace
  • Setting up computes and datastores
  • Creating and querying experiments
  • Deploying and using models
  • Creating and registering images
  • Deploy images as web services

Getting started with Deep Learning

From all the machine learning techniques there is one that gets popular for more challenging problems: Multiple layers of neural networks, better known as deep learning. For problems such as image recognition, speech understanding etc. this is currently the way to go. But it’s from a mathematical point of view a very challenging technique. In this module the basics of deep learning are introduced.

  • From Neural networks to Deep learning
  • Overview of deep learning frameworks
  • Getting started with the Keras framework

Microsoft's Big Data solution is a collection of scalable Azure services to load, store and analyze data in the cloud. Although each of these services can be used independent for one another, they will often be used together to process data in the cloud. This course consists of 2 parts: 3 days of Data Engineering and 2 days Data Science.

In the data engineering part the focus is on preparing data for reporting and analysis: How can data be loaded from on-premisses into the cloud, what are the different storage options in the cloud and how can data be transformed to simplify reporting and analysis.

In the Data Science part machine learning plays a vital role. Step-by-step Python is introduced as data science language. Then Python gets combined with the Azure Machine Learning Service for building, deploying and running machine learning models in the Azure cloud.

This course focusses on developers, administrators and project managers who are developing new data centric applications in the Microsoft Azure cloud. Some familiarity with relational database systems such as SQL Server is handy. Prior knowledge of Azure nor Python is not required.

© 2019 U2U All rights reserved.