This project covers the process of cleaning, transforming, and organizing raw data into a more suitable format for analysis, modeling and visualization
This repository documents a structured workflow for exploring, cleaning, and transforming a medical dataset (med_data). The project demonstrates step-by-step processes using both base R and the tidyverse for efficient, readable data analysis.
The workflow includes:
-
Initial dataset inspection to understand structure and contents
-
Slicing and subsetting data for focused analysis
-
Renaming variables for clarity, conciseness, and unit specification
-
Filtering based on demographic and clinical conditions
-
Feature engineering to derive new, useful variables
-
Factor handling and visualization for categorical data analysis
Examine row and column counts
View column names, first and last rows, and summary statistics
Establish an initial understanding of dataset structure and variable distributions
Select specific rows and columns using base R and tidyverse syntax
Create subsets for targeted analysis
Apply clear, short, and consistent names
Add unit labels where necessary
Use medically relevant abbreviations (e.g., sbp for systolic blood pressure)
Create subsets for:
Male participants
Older adults (≥ 60 years)
Young adults (20–40 years)
Enable targeted exploratory analysis for specific population groups
Convert glucose from g/dL to mmol/L using:
glu(mmol/L) = glu(g/dL) ÷ 18
Remove redundant variables after conversion
Standardize units for consistency across the dataset
Count category frequencies for race
Convert race to factor type and reorder levels (Asian, Black, White, Other)
Visualize age distribution by race using boxplots
These plots compare the dataset before and after converting race into an ordered factor.
Provide a learning reference for tidyverse-style data preprocessing
Produce publication-ready summaries and visualizations
Encourage consistent naming and unit standardization in medical datasets
R (≥ 4.0.0)
Suggested packages:
tidyverse

