Here’s a refined and more polished presentation of your document:
This project demonstrates a REST API that processes PDF resumes to return the field of occupation and seniority level, along with the accuracy of the classification. The machine learning model was trained with 26 occupation categories and over 3,000 resumes.
- PDF Processing: Extracts text from PDF resumes with high efficiency.
- Machine Learning Classification: Classifies resumes into 26 occupation fields and multiple levels of seniority.
- REST API: Provides a simple API for integration with other systems.
- Model Persistence: Uses
joblibto save and load trained models for deployment. - Comprehensive Documentation: Includes API documentation and performance benchmarks.
- Python 3.8 or higher
- Pip (Python package manager)
-
Clone the repository:
git clone https://github.com/viniciusmecosta/cvClassifier.git cd cvClassifier -
Install dependencies:
pip install -r requirements.txt
-
Run the API:
uvicorn server:app --reload
-
Access the API:
- The API will be available at
http://127.0.0.1:8000. - Interactive API documentation (Swagger UI) is available at
http://127.0.0.1:8000/docs.
- The API will be available at
The model was trained on a custom dataset tailored for this use case. You can find the dataset on Kaggle:
Resume Occupation and Seniority Dataset
- Benchmark: Time taken to process all PDFs in the dataset.
- Libraries Evaluated: Tika, PyMuPDF, Textract, Pypdfium2.
- Library Chosen:
pdftotextfor superior performance in processing time.
The seniority classifier was trained using text and seniority fields from the dataset.
The area of expertise classifier was trained using both the class number and text fields.
The following preprocessing steps were implemented:
-
Spacy:
- Stopwords Removal
- Lemmatization
- Tokenization
- Large model used:
en_core_web_lg
-
Re:
- Removal of hyperlinks
-
CSV:
- Mapping of area of expertise classifications to numerical values for training.
- Logistic Regression
- Support Vector Machine
- Random Forest
- k-Nearest Neighbors
- Bernoulli Naive Bayes
- Naive Bayes
- CatBoost
- XGBoost
XGBoost was selected for its superior performance in both seniority and area of expertise classification.
- Vectorizer:
CountVectorizerfromsklearnwas used to vectorize the text data. - Model Persistence: The trained model was exported and loaded using
joblibfor deployment and inference.
This project demonstrates a solid proof of concept for a REST API that classifies resumes into specific fields of occupation and levels of seniority using machine learning algorithms. Trained on a custom dataset of over 3,000 resumes spanning 26 occupations, the model provides accurate classifications while offering valuable insights into PDF extraction libraries and machine learning models.
-
Effective PDF Processing: After evaluating multiple libraries for PDF extraction,
pdftotextwas chosen for its superior performance. -
Comprehensive Preprocessing: Text processing using
Spacy(stopwords removal, lemmatization, tokenization) and hyperlink removal withReensured clean, relevant data for training. -
Model Evaluation and Selection: Among the evaluated models, XGBoost delivered the best results, offering high accuracy in both seniority and area of expertise classifications.
-
Data Vectorization and Persistence: The use of
CountVectorizerfor vectorization andjoblibfor model persistence streamlined deployment and inference, ensuring scalability and efficiency. -
Accuracy and Performance: The accuracies and confusion matrices demonstrate the model's effectiveness and reliability in both seniority and area of expertise classifications.
This project highlights the potential of automated resume classification and provides a comprehensive learning experience in handling real-world data, evaluating multiple libraries, and implementing a complete machine learning pipeline.
Free Software, Hell Yeah!





