Introduction
- Knowledge and Experience
- What are the topics we will cover?
Chapter 1 - The Theory.
- What Is a Data Pipeline?
- Data Pipelines built with Passion and Creativity
- Storage and File Types
- Access
- Repeatable
- Resilient
- Scalable
- In Summary
Chapter 2 - Data Pipeline Basics
- Project Structure
- Data Pipeline Code Structure
- Code Readability and Organization
- Tests.
- Documentation
- Containerzation
- Architecture First
- Review
Chapter 3 - Pipeline Architecture
- Architecture Applied to Data
- Data Size and Velocity
- Calculating Compute Requirements
- Calculating Storage Requirements
- Understanding the End Result
- Understanding Cost
- Code Architecture
- Batch vs Streaming Architecture
- Puzzle Pieces
- Summary
Chapter 4 - Storage
- Access Patterns
- SQL/NoSQL Databases vs Files.
- File Types
- Row vs Columnar Storage.
- Common file types in data engineering.
- Parquet.
- Avro.
- Orc.
- CSV / Flat-file.
- JSON
- Compression.
- Storage location.
- Partitions.
Chapter 5 - Compute and Resources
- Overview
- RAM/Memory
- CPU/Cores
- Storage
- Cluster/Nodes
Chapter 6 - Mastering SQL
- Introduction To SQL
- Does the type of database matter?
- The fundamentals of SQL/Databases.
- OLTP vs. OLAP
- Table design/layout.
- Table Design in Real Life.
- Understanding Indexing Basics.
- How to write fast/tune queries.
- Where to look for common problems.
- SQL Fundementals
- Python + SQL
- SQL Summary
Chapter 7 - Data Warehousing / Data Lakes
- Data Warehouse vs Data Lake vs Lake House
- Data Modeling in Data Warehouses, Data Lakes, and Lake Houses.
- Facts and Dimensions.
- Constraints and Schema.
- Data Types.
- Column Names.
- The Role of ID’s in a Data Warehouses or Data Lake.
- CDC / History Tracking.
- Summary
Chapter 8 - Data Modeling
- Data Types and Schema.
- Data Types.
- Example
- Data Size.
- Constraints.
- Data Definitions.
- Modeling Data Logically.
- Logical data models lead to physical relationships.
- Grain of Data.
- Uniqueness of Data.
- Access Patterns.
- Example
- Talking to the Business.
- Normal Forms.
- De-Duplication of Data.
- Join Integrity.
- Keys - Primary and Foreign.
- The Idea Behind Keys.
- Relational Databases (SQL) vs Data Lake (File Based) Modeling.
- The number of Fact tables and Dimensions and normalization.
- File size and table size matter in the new File-Based Data Lakes.
- Partitions vs Indexes.
- Walking the data model line between old and new.
Chapter 9 - Data Quality
- What is Data Quality.
- Reasoning about data.
- Double meanings.
- Data value quality.
- Measures of Data Quality.
- Correct Header or Column Names.
- Correct File Formatting.
- Correct data types.
- Values ranges and values integrity.
- Data Quality Applied
Chapter 10 - DevOps for Data Engineers
- DevOps applied to Data Engineering
- Dockerfiles and Docker-compose.
- Unit Testing.
- CI/CD.
- Automation is the name of the game.
- CI for Data Engineering