This repository started as a way to apply what I learned about data science to a random dataset delivered to my inbox each week by Jeremy Singer-Vine at DataIsPlural.
The following template is what I use for each project, and each mini-project contains a jupyter notebook, a README containing the answers to these questions, and other items I found useful when finding and solving puzzles in the data.
- Load the dataset.
- View the first few rows to understand the structure.
- Check the number of rows, columns, and data types.
- For numerical data: mean, median, min, max, standard deviation.
- For categorical data: unique values, count of each value.
- Check for any missing values.
- Note down columns with a high percentage of missing values.
- Histograms for numerical data.
- Bar plots for categorical data.
- Correlation plots or heatmaps for numerical data.
- Is there a categorical column that can be predicted based on other features?
- Example: Predicting if a person will default on a loan.
- Is there a numerical column that can be predicted?
- Example: Predicting house prices based on features.
- Are there any interesting patterns or groups in the data?
- Example: Customer segmentation.
- Are there any outliers or anomalies in the data?
- Example: Detecting fraudulent transactions.
- Impute or drop based on the nature of the dataset.
- Create new features that might be helpful.
- Example: Extracting day, month, year from a date column.
- Convert categorical data into numerical format using encoding techniques like One-Hot Encoding or Label Encoding.
- Standardize or normalize data if necessary.
- Based on the problem (classification, regression, clustering).
- Split the data into training and testing sets.
- Train the model on the training data.
- Evaluate the model using appropriate metrics.
- For classification: accuracy, precision, recall, F1 score, ROC curve.
- For regression: MSE, RMSE, MAE, R^2.
- Use techniques like Grid Search or Random Search to find the best parameters.
- Briefly describe the dataset and the problem.
- Include visualizations and summary statistics.
- Describe steps taken and why.
- Describe the algorithm, hyperparameters, and evaluation metrics.
- Summarize the results and any potential improvements.