Course

Spark, Hadoop, and Snowflake for Data Engineering

Duke University

Spark, Hadoop, and Snowflake for Data Engineering is a comprehensive course offered by Duke University. This course is designed for first- and second-year undergraduates, high school students, and professionals interested in programming, engineering, or science. It aims to equip learners with the skills to build efficient and scalable data pipelines, optimize data engineering with clustering and scaling, and implement DataOps and DevOps practices for continuous integration and deployment of data-driven applications.

The course delves into essential data engineering platforms such as Hadoop, Spark, and Snowflake, providing learners with the knowledge to optimize and manage these platforms. Additionally, learners will explore Databricks for executing data analytics and machine learning tasks, honing their Python data science skills with PySpark. The course also covers the key concepts of MLflow, an open-source platform for managing the end-to-end machine learning lifecycle, and teaches learners how to integrate it with Databricks.

Throughout the course, learners will gain methodologies to hone their project management and workflow skills for data engineering, including applying Kaizen, DevOps, and DataOps methodologies and best practices. The comprehensive content is complemented by quizzes to test learners' knowledge, ensuring they are well-prepared to become proficient data engineers ready to tackle the challenges of today's data-driven world.

Certificate Available ✔

Get Started / More Info

Spark, Hadoop, and Snowflake for Data Engineering

This course consists of four modules covering essential data engineering platforms such as Hadoop, Spark, and Snowflake, as well as methodologies for project management and workflow skills. Learners will gain hands-on experience with Databricks, PySpark, and MLFlow, preparing them to become proficient data engineers.

Overview and Introduction to PySpark

Overview and Introduction to PySpark module provides a comprehensive introduction to essential big data platforms such as Hadoop and Spark. Learners will delve into concepts such as Resilient Distributed Datasets (RDD), Spark SQL, and PySpark Dataframe, gaining practical experience through demos and practice sessions.

Snowflake

The Snowflake module covers the fundamentals of Snowflake, including its architecture, layers, and web interface. Learners will explore creating and accessing tables, working with warehouses, and detailed views inside Snowflake, as well as utilizing Python connectors and Snowsight.

Azure Databricks and MLFLow

The Azure Databricks and MLFLow module provides insights into accessing and using Databricks, exploring features, and working with PySpark. Additionally, learners will understand MLOps, explore MLFlow frameworks, and run end-to-end MLFlow projects on Databricks.

DataOps and Operations Methodologies

The DataOps and Operations Methodologies module introduces learners to Kaizen, DevOps, and DataOps methodologies, along with continuous integration and deployment practices. They will explore GitHub CodeSpaces, Sagemaker Studio Lab, and gain practical experience in building NLP and microservices.

Course

Spark, Hadoop, and Snowflake for Data Engineering

Course Modules

Overview and Introduction to PySpark

Snowflake

Azure Databricks and MLFLow

DataOps and Operations Methodologies

More Machine Learning Courses

Machine Learning with TensorFlow on Google Cloud em Português Brasileiro

Custom and Distributed Training with TensorFlow

Intro to TensorFlow en Español

Preparing for AI-900: Microsoft Azure AI Fundamentals exam