This project tackles the challenge of resume categorization using machine learning and deep learning techniques. Companies often face the daunting task of sifting through numerous resumes for each job opening. This app aims to automate and streamline this process by predicting the job category a given resume belongs to. By using a trained model, the app can quickly suggest the appropriate job category for each resume, saving time and resources for recruiters.
- Python
- Scikit-learn (Machine Learning Models: KNN, SVC, Random Forest)
- TensorFlow/Keras (Deep Learning Models: MLP, RNN, LSTM, BI-LSTM)
- Streamlit (Web Application)
- NLTK/Regex (Text Preprocessing)
- Pandas/Numpy (Data Handling)
- PyPDF2/python-docx (File Parsing)
Resume-Categorization/
├── app/ # Streamlit application files
│ ├── main.py # Main Streamlit app script
│ ├── knn.pkl # Serialized KNN model
│ ├── svc.pkl # Serialized SVC model
│ ├── rf.pkl # Serialized Random Forest model
│ ├── mlp.pkl # Serialized MLP model
│ ├── ensemble.pkl # Serialized Ensemble model
│ ├── tfidf.pkl # Serialized TF-IDF vectorizer
│ └── encoder.pkl # Serialized Label Encoder
├── notebooks/ # Jupyter notebooks for development
│ └── Resume_Categorization.ipynb # Main project notebook
├── data/ # Dataset files
│ └── UpdatedResumeDataSet.csv # Resume dataset
├── requirements.txt # Python dependencies
└── README.md # This file
-
Clone the repository:
git clone https://github.com/your-username/Resume-Categorization.git cd Resume-Categorization -
Create a virtual environment (recommended):
python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
-
Install dependencies:
pip install -r requirements.txt
To run the Streamlit application:
streamlit run app/main.py- K-Nearest Neighbors (KNN)
- Support Vector Machine (SVC)
- Random Forest
- Multilayer Perceptron (MLP)
- Recurrent Neural Network (RNN)
- Long Short-Term Memory (LSTM)
- Bidirectional LSTM (BI-LSTM)
- Voting Classifier (Combines KNN, SVC, and Random Forest)
The project uses the Resume Dataset from Kaggle. This dataset consists of resumes categorized into 25 distinct job fields.
-
Text Preprocessing Pipeline
- URL removal
- Special character removal
- Whitespace normalization
- Stopword removal
-
Multiple Model Comparison
- Compare predictions from different models
- View accuracy metrics for each approach
-
File Upload Support
- Accepts PDF, DOCX, and TXT files
- Automatic text extraction
-
Interactive Web Interface
- Easy-to-use file upload
- Clear results display
- Model performance comparison
Contributions are welcome! Please open an issue or submit a pull request for any improvements.
🌟 Happy resume categorizing! 🌟