Welcome to the code repository for the upcoming book "Hudi In Action". This repository contains hands-on examples, tutorials, and code samples that demonstrate Apache Hudi's capabilities for building robust data lakes.
"Hudi In Action" is a comprehensive guide to Apache Hudi, covering everything from basic concepts to advanced production patterns. The book provides practical examples and real-world scenarios to help you master Hudi for your data engineering needs.
hudiinaction/
βββ chapter02/ # Getting Started with Hudi
β βββ hudi_pipeline_quickstart.scala # Comprehensive Hudi tutorial
β βββ trips_0.gz # NYC Taxi dataset sample (~1M rows)
β βββ README.md # Chapter-specific instructions
βββ README.md # This file
Each chapter contains its own README with specific learning objectives, setup instructions, and detailed guidance.
Before running the examples, ensure you have:
- Apache Spark 3.5+ with Scala 2.12
- Java 8 or 11
- Apache Hudi 1.0.2+ (included via Spark packages)
- At least 4GB RAM available for Spark
- 2+ CPU cores recommended
- ~2GB disk space for sample data and tables
- Clone this repository
- Navigate to the chapter you want to explore
- Follow the chapter-specific README for detailed setup instructions
- Each chapter is self-contained with its own dataset and examples
Found an issue or want to improve the examples? Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes with clear commit messages
- Submit a pull request
This code is provided as supplementary material for "Hudi In Action". Please refer to the book's license terms for usage restrictions.
For questions about the book or code examples:
- Check the Issues page
- Refer to the Apache Hudi documentation
- Visit the Apache Hudi community
Happy learning with Apache Hudi! π