Skip to content

Projects in this repository contain analysis, exploration, visualization, and modeling of various datasets

Notifications You must be signed in to change notification settings

evan-anthony/mini-projects

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 

Repository files navigation

Data Science Mini-Projects

This repository started as a way to apply what I learned about data science to a random dataset delivered to my inbox each week by Jeremy Singer-Vine at DataIsPlural.

The following template is what I use for each project, and each mini-project contains a jupyter notebook, a README containing the answers to these questions, and other items I found useful when finding and solving puzzles in the data.

Data Science Project Template

1. Inspecting the Dataset:

a. Initial Glance:

  • Load the dataset.
  • View the first few rows to understand the structure.
  • Check the number of rows, columns, and data types.

b. Summary Statistics:

  • For numerical data: mean, median, min, max, standard deviation.
  • For categorical data: unique values, count of each value.

c. Missing Values:

  • Check for any missing values.
  • Note down columns with a high percentage of missing values.

d. Visualization:

  • Histograms for numerical data.
  • Bar plots for categorical data.
  • Correlation plots or heatmaps for numerical data.

2. Identify Potential Problems or Questions:

a. Classification:

  • Is there a categorical column that can be predicted based on other features?
  • Example: Predicting if a person will default on a loan.

b. Regression:

  • Is there a numerical column that can be predicted?
  • Example: Predicting house prices based on features.

c. Clustering:

  • Are there any interesting patterns or groups in the data?
  • Example: Customer segmentation.

d. Anomaly Detection:

  • Are there any outliers or anomalies in the data?
  • Example: Detecting fraudulent transactions.

3. Data Preprocessing:

a. Handle Missing Data:

  • Impute or drop based on the nature of the dataset.

b. Feature Engineering:

  • Create new features that might be helpful.
  • Example: Extracting day, month, year from a date column.

c. Categorical Data:

  • Convert categorical data into numerical format using encoding techniques like One-Hot Encoding or Label Encoding.

d. Scaling:

  • Standardize or normalize data if necessary.

4. Modeling:

a. Choose a Relevant Algorithm:

  • Based on the problem (classification, regression, clustering).

b. Train/Test Split:

  • Split the data into training and testing sets.

c. Model Training:

  • Train the model on the training data.

d. Model Evaluation:

  • Evaluate the model using appropriate metrics.
  • For classification: accuracy, precision, recall, F1 score, ROC curve.
  • For regression: MSE, RMSE, MAE, R^2.

e. Hyperparameter Tuning (Optional):

  • Use techniques like Grid Search or Random Search to find the best parameters.

5. Documenting Findings:

a. Introduction:

  • Briefly describe the dataset and the problem.

b. Data Exploration:

  • Include visualizations and summary statistics.

c. Data Preprocessing:

  • Describe steps taken and why.

d. Modeling:

  • Describe the algorithm, hyperparameters, and evaluation metrics.

e. Conclusion:

  • Summarize the results and any potential improvements.

About

Projects in this repository contain analysis, exploration, visualization, and modeling of various datasets

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published