This repository contains the source code for the Upstage: Making AI Beneficial competition solution, achieving a score of 0.966. The solution leverages a transformer-based model (SAINT) for sequential learning tasks, with additional features like pseudo-labeling and ensembling for improved performance.
- Requirements
- Directory Structure
- Training
- Inference
- Ensembling
- Weights and Reproducibility
- Acknowledgments
To run the code, ensure you have the following dependencies installed:
- Python 3.8+
- PyTorch 1.10+
- NumPy
- Pandas
- Scikit-learn
- Joblib
- TQDM
- Termcolor
Install the required packages using:
pip install -r requirements.txtsub_and_weights directory contains the model weights and submission files generated during training and inference and the hyperparameter as images.
.
├── README.md
├── data
│ ├── sample_output.csv
│ ├── test.csv
│ └── train.csv
├── ensemble.py
├── requirements.txt
├── sub_and_weights
│ ├── subs
│ │ ├── ...
│ ├── v1
│ │ ├── ...
│ ├── v1_pseudo
│ │ ├── ...
│ ├── v2
│ │ ├── ...
│ ├── v2_pseudo
│ │ ├── ...
│ ├── v3
│ │ ├── ...
│ ├── v5
│ │ ├── ...
│ ├── v5_pseudo
│ │ ├── ...
│ ├── v6
│ │ ├── ...
│ ├── v6_pseudo
│ │ ├── ...
│ ├── v6_with_adam
│ │ ├── ...
│ └── v7
│ ├── ...
├── test.py
├── test.sh
├── train.py
└── train.sh
To train the model, use the following command:
bash train.sh <weights_dir> <data_dir>weights_dir: Directory to save the model weights (without a trailing/).data_dir: Directory containing the competition data (with a trailing/).
bash train.sh ./sub_and_weights/weights ./data/- The training script supports pseudo-labeling by providing the
--pseudoargument. - Training hyperparameters (e.g., epochs, batch size) can be adjusted in the script.
The training process follows a structured pipeline involving:
-
Data Preparation:
- The dataset is grouped by
student_id, where sequences of interactions are extracted. - Features like
feature_1,feature_2,question_id,bundle_id, and response correctness are preprocessed. - Pseudo-labeling is applied for additional training data if specified.
- The dataset is grouped by
-
Dataset and Dataloader:
- A custom
SAINTDatasetclass is implemented to handle variable-length sequences and padding. - The dataset is divided into training and validation sets based on a predefined number of students.
DataLoaderis used for efficient batch processing.
- A custom
-
Model Architecture:
- The model uses a Transformer-based sequential encoder.
- Feature embeddings are used for categorical variables like
question_id,bundle_id,feature_3, etc. - A Feed-Forward Network (FFN) and LSTM layers are incorporated to enhance sequence learning.
- Dropout and layer normalization are used for better generalization.
-
Training Loop:
- The model is trained using CrossEntropyLoss.
- Optimization is handled by Adam or SGD (with an option for Adam specified in arguments).
- Cosine Annealing LR scheduler is used for learning rate adjustments.
- KFold cross-validation ensures robustness in training.
To run inference and generate predictions, use the following command:
bash test.sh <weights_dir> <data_dir> <output_dir>weights_dir: Directory containing the saved model weights (without a trailing/).data_dir: Directory containing the competition data (with a trailing/).output_dir: Directory to save the output submission files (without a trailing/).
bash test.sh ./sub_and_weights/weights ./data/ ./sub_and_weights/subsTo ensemble multiple submission files, use the following command:
python3 ensemble.py --sub_dir <SUB_DIR> -d <DATA_DIR> -o <OUT_FILE>--sub_dir SUB_DIR: Directory containing the submission files (with a trailing/).-d DATA_DIR, --data_dir DATA_DIR: Directory containing the competition data (with a trailing/).-o OUT_FILE, --out_file OUT_FILE: Output file name (with.csvextension).
python3 ensemble.py --sub_dir ./sub_and_weights/subs -d ./data/ -o final_output.csv- The provided weights in
./sub_and_weights/v*/*.pthare genuine and were used to achieve the final score. - Due to the stochastic nature of training, reproducing the exact results may be challenging. It is recommended to use the provided weights for inference.
- This solution is based on the SAINT (Sequential Attention-based Interaction Network for Knowledge Tracing) model.
- Special thanks to the competition organizers and the open-source community for providing valuable resources and tools.
Thank you for using this repository! If you have any questions or suggestions, feel free to open an issue or reach out. Good luck with your projects!