This event has passed

Data Engineering with Spark

Underviser
Start
28. november 2022 08:30
Slut
29. november 2022 16:30
Adresse
Kanalvej 7, 2800 Kongens Lyngby   Vis kort

11.000 DKK

Ikke på lager

Status

Detaljer

Big Data processing is being democratised. Tools such as Azure Synapse Analytics and Azure Databricks, mean you do not need to be a Java expert to be a Big Data Engineer anymore. Microsoft and Databricks has made your life easier! While it is easier, there is still a lot to learn and knowing where to start can be quite daunting.

On the first day we will introduce the Spark Engines in Azure Databricks and Synapse Analytics and then discuss how to develop in-memory elastic scale data engineering pipelines. We will talk about shaping and cleaning data, the languages, notebooks, ways of working, design patterns and how to get the best performance. You will build an engineering pipeline with Python, then with Scala via Azure Data Factory, then we’ll get it into context in a full solution. We will also talk about Data Lakes – how to structure and manage them over time in order to maintain an effective data platform.

Second day we will then shift gears, taking the data we prepared and enriching it with additional data sources before modelling it in a classic relational warehouse. We will take a look at various patterns of performing data engineering to cater for scenarios such as real-time streaming, de-centralised reporting, rapidly evolving data science labs and huge data warehouses in specialised storage such as Azure SQL Datawarehouse. By the end of the day, you will understand how Spark sits at the core of data engineering workloads and is a key component in both the Modern Azure Warehousing and the Data Lakehouse.

Prerequisites:

  • An understanding of ETL processing either ETL or ELT on either on-premises or in a big data environment.
  • A basic level of python will help, but is not required.
  • A laptop with a subscription to Azure

Module 1: Intro to Big Data & The Lakehouse

  • A Cloud Primer
  • IAAS & PAAS
  • Warehouse vs. Lakehouse
  • Introduction to Spark
  • Engineering Vs Data Science
  • Introduce you to the skills required

Module 2: Intro to Spark

  • Notebooks
  • Spark in Action
  • The key components
  • Physical Sparkitecture
  • The languages (Scala/Python/R/Java)
  • Synapse vs Databricks

Module 3: Reading data with Spark

  • Data frames
  • Our friend the CSV
  • Working with SQL
  • Inferring Schema
  • Reading files
  • Common File Formats
  • Working with unknown data
  • Data validation
  • Data Storage Types

Module 4: PySpark Deep Dive & Transforming Data

  • Which language should you use?
  • PySpark
  • Spark Constructs
  • Complex data structures
  • SQL API – Spark SQL
  • Dataframe Joins
  • Working with Dates
  • User defined functions (UDFs)

Module 5: Building Data Lakes

  • Designing a Lake Structure
  • Lake Zones and Folder structure
  • Lake File Layouts
  • Writing to a Data Lake
  • Optimizations
  • Controlling output files

Module 6: Spark Engineering

  • Cluster Sizing
  • What happens during Transformations?
  • Spark UI
  • Framework Patterns
  • Parameterized Notebooks
  • Orchestrating Pipelines

Module 7: Delta Lake

  • Introduction to Delta, What is is how it works
  • Data Lake management
  • Problems with Hadoop based lakes
  • Creating a Delta Table
  • The Transaction Log
  • Managing Schema change
  • Time travelling
  • Delta Management

Module 8: Spark Streaming

  • Streaming Concepts
  • Delta Table Streams
  • ETL without the Watermarks
  • Streaming Delta Upsert
  • Monitoring Streaming

Module 9: Bring it all together

  • How this all fits in to a wider architecture.
  • Projects we have worked on
  • Managing in production
  • Source Controlling Notebooks
  • Deploying with Azure DevOps

This course is designed to take a data professional from Zero to Hero in just 2 days. You will leave this course with all the skills you need to get started on your Big Data Journey. You will learn by experimentation, this is a lab heavy training session:

  • Getting set up (Building a new instance, getting connected, creating a cluster)
  • Creating all the required assets.
  • Running a notebooks
  • An introduction to the key packages we will be working with.
  • Cleaning data
  • Transforming data
  • Creating a notebook to move data from blob and clean it up.
  • Scheduling a notebook to run with Azure Data Factory
  • Creating a streaming application
  • Delta Lake

Yderligere information

Længde

2 dage