Skip to content
View yih5025's full-sized avatar
💭
Study
💭
Study

Highlights

  • Pro

Organizations

@ubicompteam @Code-SCH @SCHU-CoinGame @PhotoMuze

Block or report yih5025

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
yih5025/README.md

Ilhan Yu - Data Engineer

Projects

1. WeInvesting - Real-Time Investment Intelligence Platform

Description: A comprehensive real-time financial information platform that integrates multiple data sources to provide investors with market data, cryptocurrency prices, economic indicators, news, and social sentiment analysis—all in one unified dashboard.

Tech Stack:

Key Features:

  • Zero-cost data collection using 7 free-tier APIs (NewsAPI, Finnhub, Alpha Vantage, FRED, etc.)
  • Real-time data streaming via WebSocket with Redis Pub/Sub architecture
  • Comprehensive coverage: S&P 500 stocks, 415+ cryptocurrencies, ETFs, and economic indicators
  • 30 database tables across 4 domains (stocks, crypto, news, economic data)
  • AI-powered social sentiment analysis from X/Twitter and Truth Social
  • 5-node K3s cluster for distributed data pipeline execution
  • 13,115 lines of orchestrated DAG code managing automated data collection

📊 Data Collection (Apache Airflow)

Built a robust data pipeline using Apache Airflow to orchestrate automated collection from multiple financial APIs into a PostgreSQL database.

Infrastructure & Scale:

  • 5-node K3s cluster running Apache Airflow 2.8
  • 35 Python DAG files and 60 SQL scripts
  • 30 database tables organized across 4 primary domains
  • Daily throughput of 1,000+ data points

Data Domains:

  1. Stock Data (9 tables): Company overviews, balance sheets, earnings calendars, market news
  2. Cryptocurrency (9 tables): Real-time pricing from Bithumb and CoinGecko with cross-platform mapping
  3. News & Sentiment (8 tables): Market news, social media posts with sentiment scores
  4. Economic Indicators (4 tables): CPI, inflation rates, federal funds rate, treasury yields

API Rate Limit Optimization:

  • Implemented dual-token rotation with 15-minute delays for X API
  • Progressive batching strategy (20 symbols daily) with checkpoint-based resumption
  • Real-time rate limit detection with automatic 30-second backoff
  • Weekday-specialized queries increasing efficiency by 3x

Scheduling Model:

  • Hourly: 5 real-time data collection DAGs
  • Daily: 15 batch processing DAGs (03:00 UTC)
  • Weekly: 8 DAGs (Sundays, 05:00 UTC)
  • Monthly: 7 DAGs (1st of month)

Achievement: Built an enterprise-grade data pipeline with $0 infrastructure cost by leveraging free-tier APIs and optimized rate limit handling.

🖥️ Airflow Web UI

WeInvesting Airflow Web UI


⚙️ Backend (FastAPI)

Developed a high-performance FastAPI-based RESTful API server that aggregates and serves real-time financial data through 24+ endpoints.

Tech Stack:

  • FastAPI v0.104.1 with async/await architecture
  • PostgreSQL with 27+ ORM models and optimized indexing
  • Redis v5.0.1 for caching and Pub/Sub messaging
  • WebSockets v13.0.1 for bidirectional real-time streaming
  • Kubernetes/K3s deployment in investment-assistant namespace

Real-Time Streaming Architecture:

  • 5 WebSocket channels: crypto prices, S&P 500 updates, market data, ETF tracking
  • Redis Pub/Sub for server-initiated data broadcasting
  • Automatic dead connection cleanup with heartbeat monitoring

API Categories (All under /api/v1 prefix):

  • Market News: 6 endpoints with search, filtering, and statistics
  • S&P 500: Market overview, symbol details, interactive charts
  • Cryptocurrency: Real-time pricing for 415+ coins, investment analysis, ecosystem metrics
  • Economic Data: CPI, Federal Funds Rate, Treasury Yields with historical trends
  • Social Sentiment: X/Twitter and Truth Social posts with sentiment analysis

Performance Optimizations:

  • Two-layer Redis caching with 60-second TTL
  • Compound database indexes on symbols and timestamps
  • Batch query operations reducing database round-trips
  • Async I/O throughout the entire stack
  • Connection pooling with SQLAlchemy v2.0.23

Project Scale:


💻 Frontend (React + TypeScript)

Built a modern, responsive web application providing an intuitive dashboard for real-time investment intelligence.

Tech Stack:

  • React 18 with TypeScript and Vite build tool
  • Tailwind CSS and Radix UI for modern, accessible design
  • Recharts and Lightweight Charts for financial data visualization
  • React Query for server state and Zustand for client state management
  • Socket.IO Client for WebSocket real-time updates
  • Framer Motion for smooth animations

Five Primary Modules:

  1. Home Dashboard: Real-time market overview, earnings calendar, IPO tracker
  2. Market Intelligence: Live stock (S&P 500, NASDAQ), crypto (Bitcoin, Ethereum), and ETF data with technical charts
  3. SNS Analytics: X/Twitter post analysis with AI-powered investor sentiment detection
  4. Financial News: Real-time news feed with summaries and ticker connections
  5. Economic Indicators: Macroeconomic dashboard tracking rates, inflation, GDP with historical trends

Key Features:

  • Mobile-first responsive design (optimized for 320px–768px)
  • Dark mode support with system preference detection
  • Real-time data updates via WebSocket streams
  • Performance optimizations: lazy loading, memoization, code splitting
  • Browser history management with state restoration
  • XSS prevention and environment-based security

Live Platform: https://weinvesting.site

📱 App Screenshots

WeInvesting Home Dashboard WeInvesting Market Intelligence WeInvesting SNS Analytics
홈 페이지 화면 (실적발표 캘린더) 암호화폐 상세보기 페이지 화면 트럼프 sns 분석 페이지 화면

🔗 GitHub Repositories:

Data Collection (DAGs) | Backend API | Frontend

👨🏻‍🔧 Data Pipeline Architecture

WeInvesting Data Pipeline Architecture

🏗️ Infrastructure Architecture

WeInvesting Infrastructure Architecture


2. Cryptocurrency Mock Investment Game

Description: A web-based mock investment game that uses real-time cryptocurrency data for fun and educational investing.

Tech Stack:

Key Features:

  • Real-time data ingestion of up to 900,000 records per minute
  • Automated pipeline for data collection
  • Serverless backend to reduce costs and simplify management
  • To address cloud service cost issues, I replaced AWS EMR with an on-premises distributed computing environment, reducing server costs by 94% ($140 → $7 per month)
  • Collected clickstream data to derive user data analysis

🎥 Game Demo Video: Game Video

🏆 Award: Game Development Contest top prize

More Details: Tech Blog Post

👨🏻‍🔧 Designed Data Pipe Line Architecture

CoinGame DataPipeline Architecture

📈 In-Game Screenshot

CoinGame In-Game Screenshot

CoinGame In-Game Screenshot


2. Large-Scale IoT Data Analysis Paper

Title: "사물인터넷 환경 저비용 대용량 데이터 분석 시나리오 설계 및 성능 비교"

Conference: KCC2024 (한국컴퓨터종합학술대회)

🏆 Award: Best Student Paper Award at KCC2024

Tech Stack:

Key Highlights:

  • Built a low-cost Raspberry Pi cluster using Hadoop (HDFS, Yarn) and Spark (DataFrame)
  • Collected and analyzed 22-year particulate matter data from AirKorea
  • Compared performance of Spark and Pandas on Raspberry Pi cluster vs. a standard PC
  • Achieved Best Student Paper at the Korean Computer Congress 2024

Paper Link: DBpia

Low-Cost IoT Device Based Cluster Setup

Results:

I computed the time that took to analyze certain size of data in four different settings. The settings are:

  • Spark on single RPi
  • Spark on RPi cluster
  • Spark on a normal desktop PC
  • Pandas on a normal desktop PC

Single RPi could complete the task of analyzing big data, though pandas on a normal desktop PC failed due to OOM. Spark on a normal desktop PC showed the best performance, followed by Spark on RPi cluster, then Spark on single RPi.

So then I used less amount of data for experiment - Pandas on a normal desktop PC showed best performance, followed by Spark on a normal desktop PC, then Spark on a RPi cluster, and then Spark on single RPi.

I tried linear regression to predict PM using Scikit-learn and Spark ML. Similar to the experiment result above, when it comes to big data, Pandas and Scikit-learn failed due to OOM, and Spark on RPi cluster succesfully completed machine learning as well as data analysis.

Conclusion

  • Spark on RPi cluster showed fine performance based on relatively low cost
  • Resolved the necessity of costly servers in IoT environments for big data analysis and machine learning

KCC2024 Poster


3. The Korean air pollution data Visualization

Description: A web-based application that collects and visualizes air pollution data from Korea’s public API (such as AirKorea). The site displays monthly trends in pollutants—including PM10, PM2.5, SO₂, NO₂, CO, and O₃—and provides users with the latest air quality indices for various regions. This helps users easily compare regional air quality and understand historical pollution patterns.

Tech Stack: Java Spring InfluxDB Grafana HTML5 CSS3 JavaScript

Key Features

  • Visualization: Graphical display of monthly air pollution trends for various regions.
  • Real-Time Data Updates: Fetches and updates current PM10, PM2.5, and other pollutant data.
  • Regional Comparison: Compare air quality indices across multiple regions in Korea.
  • Historical Data Analytics: Analyze past pollution trends and patterns.
  • Seaching Regional Data: Loads data into MYSQL and displays only data for a specific region in a table format.

🔗 GitHub Repository: Korea Air Pollution Repo

💨 WebPage Screenshot

Air Pollution Image 1 Air Pollution Image 2


4. Assistive Device for the Visually Impaired Using NFC, OCR and Object Detection(Voice Memo.)

Description: Voice Memo. is an assistive technology platform designed to help visually impaired individuals easily access information about objects. By leveraging object detection, OCR, and voice recording, the application allows users to capture information about items (via NFC stickers) and listen to audio memos detailing object-specific information.

🎥 Demo Video: Demo Video

Tech Stack: Java Spring Flutter Python Google ML Fusion360

Key Features

  • Google ML API: for object detection and OCR
  • Raspberry Pi & NFC Modules: for IoT-based voice memo storage
  • TCP Socket Communication: for data transfer between devices

🔗 More Details: Tech Blog - VoiceMemo

Service Architecture

📱 App Feature Description


💬 About Me

  • I’m a Data Engineer with experience building real-time data pipelines using Kafka, Spark, and Hadoop.
  • I also developed backend services, including Spring-based systems and AWS Lambda/API Gateway, to ensure scalability and reliability.
  • On the frontend side, I’m comfortable with JavaScript, HTML, and CSS.
  • For database solutions, I’ve worked with DynamoDB, InfluxDB, and MySQL, focusing on efficient data modeling and queries.

Tech Focus:

  • Data Engineering: Kafka, Spark, Hadoop, Designed ELT/ETL Data Pipeline
  • Data Analysis: Spark DataFrame, Pandas
  • Backend: Spring, AWS Lambda, API Gateway
  • Frontend: JavaScript, HTML, CSS
  • Databases: DynamoDB, InfluxDB, MySQL

🌱 Mission

"I aim to use diverse data sources to build practical services that truly solve real-world problems and help people."

Currently, I want to develop a platform that makes it easy for anyone to collect and handle complex data, bridging the gap between raw information and real-life solutions. I hope to focus on accessible data engineering that benefits people’s daily needs without being overly complicated.


🏆 Achievements

  • Game Development Contest Winner: Developed a real-time cryptocurrency mock investment game, earning top prize.
  • App Development Contest Winner: Developed a "1 Dollor Breakfast" users monitoring app, earning top prize.
  • SW Idea Contest Winner: Suggested the idea of AI special agreement suggestion service for "Jeonse" contracts, earning second prize.
  • Best Student Paper Award at KCC2024: Authored a paper on low-cost IoT-based big data analysis with Raspberry Pi clusters, Spark & Hadoop.

🤝 Contact

Pinned Loading

  1. SCHU-CoinGame/BackEnd SCHU-CoinGame/BackEnd Public

    BackEnd Repository

    Python

  2. SCHU-CoinGame/DataEngineering SCHU-CoinGame/DataEngineering Public

    DataEngineering repository

    Python

  3. APM_Viz APM_Viz Public

    APM_Viz Repo

    Java

  4. Korea-Air-Pollution Korea-Air-Pollution Public

    Korea air pollution repo

    HTML

  5. js_game js_game Public

    using js make little game

    CSS

  6. InfluxDB_to_Mobius_Data InfluxDB_to_Mobius_Data Public

    In this repository, I had get in Mobius server data, using REST API get method, and upload InfluxDB

    Java