Detaljer
Big Data processing is being democratised. Tools such as Azure Synapse Analytics and Azure Databricks, mean you do not need to be a Java expert to be a Big Data Engineer anymore. Microsoft and Databricks has made your life easier! While it is easier, there is still a lot to learn and knowing where to start can be quite daunting.
On the first day we will introduce the Spark Engines in Azure Databricks and Synapse Analytics and then discuss how to develop in-memory elastic scale data engineering pipelines. We will talk about shaping and cleaning data, the languages, notebooks, ways of working, design patterns and how to get the best performance. You will build an engineering pipeline with Python, then with Scala via Azure Data Factory, then we’ll get it into context in a full solution. We will also talk about Data Lakes – how to structure and manage them over time in order to maintain an effective data platform.
Second day we will then shift gears, taking the data we prepared and enriching it with additional data sources before modelling it in a classic relational warehouse. We will take a look at various patterns of performing data engineering to cater for scenarios such as real-time streaming, de-centralised reporting, rapidly evolving data science labs and huge data warehouses in specialised storage such as Azure SQL Datawarehouse. By the end of the day, you will understand how Spark sits at the core of data engineering workloads and is a key component in both the Modern Azure Warehousing and the Data Lakehouse.
Prerequisites:
- An understanding of ETL processing either ETL or ELT on either on-premises or in a big data environment.
- A basic level of python will help, but is not required.
- A laptop with a subscription to Azure
Module 1: Intro to Big Data & The Lakehouse
- A Cloud Primer
- IAAS & PAAS
- Warehouse vs. Lakehouse
- Introduction to Spark
- Engineering Vs Data Science
- Introduce you to the skills required
Module 2: Intro to Spark
- Notebooks
- Spark in Action
- The key components
- Physical Sparkitecture
- The languages (Scala/Python/R/Java)
- Synapse vs Databricks
Module 3: Reading data with Spark
- Data frames
- Our friend the CSV
- Working with SQL
- Inferring Schema
- Reading files
- Common File Formats
- Working with unknown data
- Data validation
- Data Storage Types
Module 4: PySpark Deep Dive & Transforming Data
- Which language should you use?
- PySpark
- Spark Constructs
- Complex data structures
- SQL API – Spark SQL
- Dataframe Joins
- Working with Dates
- User defined functions (UDFs)
Module 5: Building Data Lakes
- Designing a Lake Structure
- Lake Zones and Folder structure
- Lake File Layouts
- Writing to a Data Lake
- Optimizations
- Controlling output files
Module 6: Spark Engineering
- Cluster Sizing
- What happens during Transformations?
- Spark UI
- Framework Patterns
- Parameterized Notebooks
- Orchestrating Pipelines
Module 7: Delta Lake
- Introduction to Delta, What is is how it works
- Data Lake management
- Problems with Hadoop based lakes
- Creating a Delta Table
- The Transaction Log
- Managing Schema change
- Time travelling
- Delta Management
Module 8: Spark Streaming
- Streaming Concepts
- Delta Table Streams
- ETL without the Watermarks
- Streaming Delta Upsert
- Monitoring Streaming
Module 9: Bring it all together
- How this all fits in to a wider architecture.
- Projects we have worked on
- Managing in production
- Source Controlling Notebooks
- Deploying with Azure DevOps
This course is designed to take a data professional from Zero to Hero in just 2 days. You will leave this course with all the skills you need to get started on your Big Data Journey. You will learn by experimentation, this is a lab heavy training session:
- Getting set up (Building a new instance, getting connected, creating a cluster)
- Creating all the required assets.
- Running a notebooks
- An introduction to the key packages we will be working with.
- Cleaning data
- Transforming data
- Creating a notebook to move data from blob and clean it up.
- Scheduling a notebook to run with Azure Data Factory
- Creating a streaming application
- Delta Lake