Curriculum Classifier

Here’s a refined and more polished presentation of your document:

Curriculum Classifier

REST API for Classifying Curricula Using Machine Learning Algorithms

This project demonstrates a REST API that processes PDF resumes to return the field of occupation and seniority level, along with the accuracy of the classification. The machine learning model was trained with 26 occupation categories and over 3,000 resumes.

Features

PDF Processing: Extracts text from PDF resumes with high efficiency.
Machine Learning Classification: Classifies resumes into 26 occupation fields and multiple levels of seniority.
REST API: Provides a simple API for integration with other systems.
Model Persistence: Uses joblib to save and load trained models for deployment.
Comprehensive Documentation: Includes API documentation and performance benchmarks.

Getting Started

Prerequisites

Python 3.8 or higher
Pip (Python package manager)

Installation

Clone the repository:

git clone https://github.com/viniciusmecosta/cvClassifier.git
cd cvClassifier

Install dependencies:
```
pip install -r requirements.txt
```
Run the API:
```
uvicorn server:app --reload
```
Access the API:
- The API will be available at http://127.0.0.1:8000.
- Interactive API documentation (Swagger UI) is available at http://127.0.0.1:8000/docs.

Dataset

The model was trained on a custom dataset tailored for this use case. You can find the dataset on Kaggle:

Resume Occupation and Seniority Dataset

Benchmarks and Algorithms Comparison

PDF Extraction Libraries Benchmark

Benchmark: Time taken to process all PDFs in the dataset.
Libraries Evaluated: Tika, PyMuPDF, Textract, Pypdfium2.
Library Chosen: pdftotext for superior performance in processing time.

Classification of Seniority and Area of Expertise

Seniority Classification

The seniority classifier was trained using text and seniority fields from the dataset.

Area of Expertise Classification

The area of expertise classifier was trained using both the class number and text fields.

Preprocessing

The following preprocessing steps were implemented:

Spacy:
- Stopwords Removal
- Lemmatization
- Tokenization
- Large model used: en_core_web_lg
Re:
- Removal of hyperlinks
CSV:
- Mapping of area of expertise classifications to numerical values for training.

Model Evaluation and Selection

Models Evaluated

Logistic Regression
Support Vector Machine
Random Forest
k-Nearest Neighbors
Bernoulli Naive Bayes
Naive Bayes
CatBoost
XGBoost

Model Chosen

XGBoost was selected for its superior performance in both seniority and area of expertise classification.

Vectorizer and Model Persistence

Vectorizer: CountVectorizer from sklearn was used to vectorize the text data.
Model Persistence: The trained model was exported and loaded using joblib for deployment and inference.

Results

Accuracies

Seniority Accuracy

Area of Expertise Accuracy

Confusion Matrices

Confusion Matrix for Area of Expertise

Confusion Matrix for Seniority

Conclusion

This project demonstrates a solid proof of concept for a REST API that classifies resumes into specific fields of occupation and levels of seniority using machine learning algorithms. Trained on a custom dataset of over 3,000 resumes spanning 26 occupations, the model provides accurate classifications while offering valuable insights into PDF extraction libraries and machine learning models.

Key Takeaways

Effective PDF Processing: After evaluating multiple libraries for PDF extraction, pdftotext was chosen for its superior performance.
Comprehensive Preprocessing: Text processing using Spacy (stopwords removal, lemmatization, tokenization) and hyperlink removal with Re ensured clean, relevant data for training.
Model Evaluation and Selection: Among the evaluated models, XGBoost delivered the best results, offering high accuracy in both seniority and area of expertise classifications.
Data Vectorization and Persistence: The use of CountVectorizer for vectorization and joblib for model persistence streamlined deployment and inference, ensuring scalability and efficiency.
Accuracy and Performance: The accuracies and confusion matrices demonstrate the model's effectiveness and reliability in both seniority and area of expertise classifications.

This project highlights the potential of automated resume classification and provides a comprehensive learning experience in handling real-world data, evaluating multiple libraries, and implementing a complete machine learning pipeline.

Free Software, Hell Yeah!

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
app		app
metadata		metadata
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
server.py		server.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

Curriculum Classifier

REST API for Classifying Curricula Using Machine Learning Algorithms

Features

Getting Started

Prerequisites

Installation

Dataset

Benchmarks and Algorithms Comparison

PDF Extraction Libraries Benchmark

Classification of Seniority and Area of Expertise

Seniority Classification

Area of Expertise Classification

Preprocessing

Model Evaluation and Selection

Models Evaluated

Model Chosen

Vectorizer and Model Persistence

Results

Accuracies

Seniority Accuracy

Area of Expertise Accuracy

Confusion Matrices

Confusion Matrix for Area of Expertise

Confusion Matrix for Seniority

Conclusion

Key Takeaways

About

Uh oh!

Uh oh!

Languages

Uh oh!

Uh oh!

viniciusmecosta/CvClassifier

Folders and files

Latest commit

History

Repository files navigation

Curriculum Classifier

REST API for Classifying Curricula Using Machine Learning Algorithms

Features

Getting Started

Prerequisites

Installation

Dataset

Benchmarks and Algorithms Comparison

PDF Extraction Libraries Benchmark

Classification of Seniority and Area of Expertise

Seniority Classification

Area of Expertise Classification

Preprocessing

Model Evaluation and Selection

Models Evaluated

Model Chosen

Vectorizer and Model Persistence

Results

Accuracies

Seniority Accuracy

Area of Expertise Accuracy

Confusion Matrices

Confusion Matrix for Area of Expertise

Confusion Matrix for Seniority

Conclusion

Key Takeaways

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages