Spark, Hadoop, and Snowflake for Data Engineering is a comprehensive course offered by Duke University. This course is designed for first- and second-year undergraduates, high school students, and professionals interested in programming, engineering, or science. It aims to equip learners with the skills to build efficient and scalable data pipelines, optimize data engineering with clustering and scaling, and implement DataOps and DevOps practices for continuous integration and deployment of data-driven applications.
The course delves into essential data engineering platforms such as Hadoop, Spark, and Snowflake, providing learners with the knowledge to optimize and manage these platforms. Additionally, learners will explore Databricks for executing data analytics and machine learning tasks, honing their Python data science skills with PySpark. The course also covers the key concepts of MLflow, an open-source platform for managing the end-to-end machine learning lifecycle, and teaches learners how to integrate it with Databricks.
Throughout the course, learners will gain methodologies to hone their project management and workflow skills for data engineering, including applying Kaizen, DevOps, and DataOps methodologies and best practices. The comprehensive content is complemented by quizzes to test learners' knowledge, ensuring they are well-prepared to become proficient data engineers ready to tackle the challenges of today's data-driven world.
Certificate Available ✔
Get Started / More InfoThis course consists of four modules covering essential data engineering platforms such as Hadoop, Spark, and Snowflake, as well as methodologies for project management and workflow skills. Learners will gain hands-on experience with Databricks, PySpark, and MLFlow, preparing them to become proficient data engineers.
Overview and Introduction to PySpark module provides a comprehensive introduction to essential big data platforms such as Hadoop and Spark. Learners will delve into concepts such as Resilient Distributed Datasets (RDD), Spark SQL, and PySpark Dataframe, gaining practical experience through demos and practice sessions.
The Snowflake module covers the fundamentals of Snowflake, including its architecture, layers, and web interface. Learners will explore creating and accessing tables, working with warehouses, and detailed views inside Snowflake, as well as utilizing Python connectors and Snowsight.
The Azure Databricks and MLFLow module provides insights into accessing and using Databricks, exploring features, and working with PySpark. Additionally, learners will understand MLOps, explore MLFlow frameworks, and run end-to-end MLFlow projects on Databricks.
The DataOps and Operations Methodologies module introduces learners to Kaizen, DevOps, and DataOps methodologies, along with continuous integration and deployment practices. They will explore GitHub CodeSpaces, Sagemaker Studio Lab, and gain practical experience in building NLP and microservices.
Machine Learning with TensorFlow on Google Cloud em Português Brasileiro is a comprehensive course covering essential machine learning concepts and practical applications...
This course teaches advanced TensorFlow techniques, including custom training loops, graph mode, and distributed training, for more control and efficiency in building...
A comprehensive introduction to TensorFlow and Keras in Spanish, covering data pipeline design, neural network training, and deployment on Google Cloud AI Platform....
Prepare for the AI-900 Microsoft Azure AI Fundamentals exam with this comprehensive course. Test your knowledge, practice exam skills, and get insights into the...