Big Data processing is being democratised. Tools such as Azure Synapse Analytics and Azure Databricks, mean you do not need to be a Java expert to be a Big Data Engineer anymore. Microsoft and Databricks has made your life easier! While it is easier, there is still a lot to learn and knowing where to start can be quite daunting.
On the first day we will introduce the Spark Engines in Azure Databricks and Synapse Analytics and then discuss how to develop in-memory elastic scale data engineering pipelines. We will talk about shaping and cleaning data, the languages, notebooks, ways of working, design patterns and how to get the best performance. You will build an engineering pipeline with Python, then with Scala via Azure Data Factory, then we’ll get it into context in a full solution. We will also talk about Data Lakes – how to structure and manage them over time in order to maintain an effective data platform.
Second day we will then shift gears, taking the data we prepared and enriching it with additional data sources before modelling it in a classic relational warehouse. We will take a look at various patterns of performing data engineering to cater for scenarios such as real-time streaming, de-centralised reporting, rapidly evolving data science labs and huge data warehouses in specialised storage such as Azure SQL Datawarehouse. By the end of the day, you will understand how Spark sits at the core of data engineering workloads and is a key component in both the Modern Azure Warehousing and the Data Lakehouse.
- An understanding of ETL processing either ETL or ELT on either on-premises or in a big data environment.
- A basic level of python will help, but is not required.
- A laptop with a subscription to Azure
Module 1: Intro to Big Data processing
- Engineering Vs Data Science
- Getting set up
- Introduce you to the skills required
- Introduction to Spark
- Exploring Azure Databricks and Synapse Analytics
Module 2: The languages
- The languages (Scala/Python/R/Java)
- Introduction to Scala
- Introduction to PySpark
- PySpark deep dive
- Working with the additional Spark APIs
Module 3: Managing the Spark Engine
- Orchestrating Pipelines
- Troubleshooting Query Performance
- Source Controlling Notebooks
- Cluster Sizing
- Installing packages on our cluster / All clusters
Module 4: Data Engineering
- Cloud ETL Patterns
- Design patterns
- Loading Data
- Schema Management
- Transforming Data
- Storing Data
- Managing Lakes
Module 5: Delta Lake
- Introduction to Delta, What is is how it works
- Data Lake management
- Problems with Hadoop based lakes
- Creating a Delta Table
- The Transaction Log
- Managing Schema change
- Time travelling
Module 6: Bring it all together
- How this all fits in to a wider architecture.
- Projects we have worked on.
- Managing in production
- Deploying with Azure DevOps
This course is designed to take a data professional from Zero to Hero in just 2 days. You will leave this course with all the skills you need to get started on your Big Data Journey. You will learn by experimentation, this is a lab heavy training session:
- Getting set up (Building a new instance, getting connected, creating a cluster)
- Creating all the required assets.
- Running a notebooks
- An introduction to the key packages we will be working with.
- Cleaning data
- Transforming data
- Creating a notebook to move data from blob and clean it up.
- Scheduling a notebook to run with Azure Data Factory
- Creating a streaming application
- Delta Lake