Detaljer
Big Data processing is being democratised. Tools such as Azure Databricks, mean you do not need to be a Java expert to be a Big Data Engineer anymore. Databricks has made your life easier! While it is easier, there is still a lot to learn and knowing where to start can be quite daunting.
On the first day we will introduce Azure Databricks then discuss how to develop in-memory elastic scale data engineering pipelines. We will talk about shaping and cleaning data, the languages, notebooks, ways of working, design patterns and how to get the best performance. You will build an engineering pipeline with Python, then with Scala via Azure Data Factory, then we’ll get it into context in a full solution. We will also talk about Data Lakes – how to structure and manage them over time in order to maintain an effective data platform.
Second day we will then shift gears, taking the data we prepared and enriching it with additional data sources before modelling it in a relational warehouse. We will take a look at various patterns of performing data engineering to cater for scenarios such as real-time streaming, de-centralised reporting, rapidly evolving data science labs and huge data warehouses in specialised storage such as Azure SQL Datawarehouse. By the end of the day, you will understand how Azure Databricks sits at the core of data engineering workloads and is a key component in Modern Azure Warehousing.
Prerequisites:
- An understanding of ETL processing either ETL or ELT on either on-premises or in a big data environment.
- A basic level of python will help, but is not required.
- A laptop with a subscription to Azure
Module 1: Intro to Big Data processing
- Engineering Vs Data Science
- Getting set up
- Exploring Azure Databricks
- Introduce you to the skills required
- Introduction to Spark
- Introduce Azure Databricks
Module 2: The languages
- The languages (Scala/Python/R/Java)
- Introduction to Scala
- Introduction to PySpark
- PySpark deep dive
- Working with the additional Spark APIs
Module 3: Managing Databricks
- Managing Secrets
- Orchestrating Pipelines
- Troubleshooting Query Performance
- Source Controlling Notebooks
- Cluster Sizing
- Installing packages on our cluster / All clusters
Module 4: Data Engineering
- Cloud ETL Patterns
- Design patterns
- Loading Data
- Schema Management
- Transforming Data
- Storing Data
- Managing Lakes
Module 5: Databricks Delta
- Introduction to Delta, What is is how it works
- Data Lake management
- Problems with Hadoop based lakes
- Creating a Delta Table
- The Transaction Log
- Managing Schema change
- Time travelling
Module 6: Bring it all together
- How this all fits in to a wider architecture.
- Projects we have worked on.
- Managing Databricks in production
- Deploying with Azure DevOps