Welcome to the AirLife ETL Pipeline workshop! This repository contains the skeleton code for building a simple Extract, Transform, Load (ETL) pipeline for aircraft and airport data.
By the end of this 3-hour workshop, you will have:
- Extracted airport data from a CSV file
- Fetched live flight data from the OpenSky Network API
- Cleaned and transformed the data using Python/pandas
- Loaded the data into a PostgreSQL database
- Verified your pipeline works end-to-end
ETL-AirLife/
├── README.md # This file
├── requirements.txt # Python dependencies
├── main.py # Main pipeline orchestrator
├── database_setup.sql # SQL script to create tables
├── data/
│ └── airports.csv # Sample airport data (50 airports)
└── src/
├── extract_data.py # Data extraction functions
├── transform_data.py # Data cleaning and transformation
└── load_data.py # Database loading functions
Make sure you have installed:
- Python 3.7 or higher
- PostgreSQL 12 or higher
- Git
-
Fork this repository to your GitHub account
-
Clone your fork locally:
git clone https://github.com/YOUR_USERNAME/ETL-AirLife.git cd ETL-AirLife -
Install Python dependencies:
pip install -r requirements.txt
-
Create PostgreSQL database:
# Connect to PostgreSQL psql -U your_username -d postgres # Create database CREATE DATABASE airlife_db; # Exit and reconnect to new database \q psql -U your_username -d airlife_db # Create tables \i database_setup.sql
Edit the database configuration in src/load_data.py:
DATABASE_CONFIG = {
'username': 'your_username', # Replace with your PostgreSQL username
'password': 'your_password', # Replace with your PostgreSQL password
'host': 'localhost',
'port': '5432',
'database': 'airlife_db'
}The repository contains skeleton code with TODO comments. Your job is to implement the missing functionality:
- Implement
extract_airports()to read CSV data - Implement
extract_flights()to fetch data from OpenSky Network API - Handle errors gracefully (network issues, API limits)
- Implement
clean_airports()to remove invalid data - Implement
clean_flights()to standardize API data - Convert units (altitude meters to feet)
- Handle missing values appropriately
- Implement
load_to_database()using pandas to_sql() - Implement
verify_data()to check data was loaded correctly - Update database connection configuration
- Uncomment the function calls once each component works
- Test the full pipeline end-to-end
- Add error handling for robustness
Each module can be tested independently:
# Test extraction
python src/extract_data.py
# Test transformation
python src/transform_data.py
# Test loading (after implementing database config)
python src/load_data.py
# Run full pipeline
python main.pyThe data/airports.csv file contains 50 airports including:
- Major European airports (CDG, LHR, FRA, etc.)
- Valid coordinates and IATA codes
- Some invalid data for testing your cleaning logic
The OpenSky Network API provides real-time flight data over Europe with:
- Aircraft identifiers and callsigns
- Current positions (latitude, longitude, altitude)
- Ground speed and heading information
API Rate Limits: The OpenSky Network has rate limits. If you get errors:
- Wait a few seconds between requests
- Test with smaller geographic areas first
- Use the
test_api_connection()function to debug
Database Connection: If you can't connect to PostgreSQL:
- Check that PostgreSQL service is running
- Verify your username/password
- Make sure the
airlife_dbdatabase exists - Ensure tables are created with
database_setup.sql
Import Errors: Make sure you're in the project root directory when running scripts
Your ETL pipeline is working when:
- ✅
python main.pyruns without errors - ✅ Airport data is loaded into the
airportstable - ✅ Flight data (if API accessible) is loaded into the
flightstable - ✅ You can run SQL queries on your loaded data
- ✅ Your code handles errors gracefully
Once your data is loaded, try these queries:
-- Count total airports
SELECT COUNT(*) FROM airports;
-- Show airports by country
SELECT country, COUNT(*) as airport_count
FROM airports
GROUP BY country
ORDER BY airport_count DESC;
-- Show current flights (if any)
SELECT callsign, origin_country, altitude
FROM flights
WHERE altitude > 10000
LIMIT 5;- OpenSky Network API Documentation
- Pandas Documentation
- SQLAlchemy to_sql() Guide
- PostgreSQL Documentation
If you're stuck:
- Read the TODO comments carefully - they contain hints
- Test each module individually before running the full pipeline
- Use the test functions provided (like
test_api_connection()) - Check the error messages - they usually point to the problem
- Ask your instructor or classmates
After completing this workshop, you'll be ready for the larger AirLife project where you'll design your own startup's complete data pipeline with more advanced features like:
- Multiple data sources
- Complex transformations
- Production-ready error handling
- Data quality monitoring
- Automated scheduling
Good luck building your first ETL pipeline! 🚀