Upstage: Making AI Beneficial (0.966) Solution Source Code

This repository contains the source code for the Upstage: Making AI Beneficial competition solution, achieving a score of 0.966. The solution leverages a transformer-based model (SAINT) for sequential learning tasks, with additional features like pseudo-labeling and ensembling for improved performance.

Requirements

To run the code, ensure you have the following dependencies installed:

Python 3.8+
PyTorch 1.10+
NumPy
Pandas
Scikit-learn
Joblib
TQDM
Termcolor

Install the required packages using:

pip install -r requirements.txt

Directory Structure

sub_and_weights directory contains the model weights and submission files generated during training and inference and the hyperparameter as images.

.
├── README.md
├── data
│   ├── sample_output.csv
│   ├── test.csv
│   └── train.csv
├── ensemble.py
├── requirements.txt
├── sub_and_weights
│   ├── subs
│   │   ├── ...
│   ├── v1
│   │   ├── ...
│   ├── v1_pseudo
│   │   ├── ...
│   ├── v2
│   │   ├── ...
│   ├── v2_pseudo
│   │   ├── ...
│   ├── v3
│   │   ├── ...
│   ├── v5
│   │   ├── ...
│   ├── v5_pseudo
│   │   ├── ...
│   ├── v6
│   │   ├── ...
│   ├── v6_pseudo
│   │   ├── ...
│   ├── v6_with_adam
│   │   ├── ...
│   └── v7
│       ├── ...
├── test.py
├── test.sh
├── train.py
└── train.sh

Training

To train the model, use the following command:

bash train.sh <weights_dir> <data_dir>

Arguments:

weights_dir: Directory to save the model weights (without a trailing /).
data_dir: Directory containing the competition data (with a trailing /).

Example:

bash train.sh ./sub_and_weights/weights ./data/

Notes:

The training script supports pseudo-labeling by providing the --pseudo argument.
Training hyperparameters (e.g., epochs, batch size) can be adjusted in the script.

Training Methods

The training process follows a structured pipeline involving:

Data Preparation:
- The dataset is grouped by student_id, where sequences of interactions are extracted.
- Features like feature_1, feature_2, question_id, bundle_id, and response correctness are preprocessed.
- Pseudo-labeling is applied for additional training data if specified.
Dataset and Dataloader:
- A custom SAINTDataset class is implemented to handle variable-length sequences and padding.
- The dataset is divided into training and validation sets based on a predefined number of students.
- DataLoader is used for efficient batch processing.
Model Architecture:
- The model uses a Transformer-based sequential encoder.
- Feature embeddings are used for categorical variables like question_id, bundle_id, feature_3, etc.
- A Feed-Forward Network (FFN) and LSTM layers are incorporated to enhance sequence learning.
- Dropout and layer normalization are used for better generalization.
Training Loop:
- The model is trained using CrossEntropyLoss.
- Optimization is handled by Adam or SGD (with an option for Adam specified in arguments).
- Cosine Annealing LR scheduler is used for learning rate adjustments.
- KFold cross-validation ensures robustness in training.

Inference

To run inference and generate predictions, use the following command:

bash test.sh <weights_dir> <data_dir> <output_dir>

Arguments:

weights_dir: Directory containing the saved model weights (without a trailing /).
data_dir: Directory containing the competition data (with a trailing /).
output_dir: Directory to save the output submission files (without a trailing /).

Example:

bash test.sh ./sub_and_weights/weights ./data/ ./sub_and_weights/subs

Ensembling

To ensemble multiple submission files, use the following command:

python3 ensemble.py --sub_dir <SUB_DIR> -d <DATA_DIR> -o <OUT_FILE>

Arguments:

--sub_dir SUB_DIR: Directory containing the submission files (with a trailing /).
-d DATA_DIR, --data_dir DATA_DIR: Directory containing the competition data (with a trailing /).
-o OUT_FILE, --out_file OUT_FILE: Output file name (with .csv extension).

Example:

python3 ensemble.py --sub_dir ./sub_and_weights/subs -d ./data/ -o final_output.csv

Weights and Reproducibility

The provided weights in ./sub_and_weights/v*/*.pth are genuine and were used to achieve the final score.
Due to the stochastic nature of training, reproducing the exact results may be challenging. It is recommended to use the provided weights for inference.

Acknowledgments

This solution is based on the SAINT (Sequential Attention-based Interaction Network for Knowledge Tracing) model.
Special thanks to the competition organizers and the open-source community for providing valuable resources and tools.

Thank you for using this repository! If you have any questions or suggestions, feel free to open an issue or reach out. Good luck with your projects!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Upstage: Making AI Beneficial (0.966) Solution Source Code

Table of Contents

Requirements

Directory Structure

Training

Arguments:

Example:

Notes:

Training Methods

Inference

Arguments:

Example:

Ensembling

Arguments:

Example:

Weights and Reproducibility

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
sub_and_weights		sub_and_weights
.gitattributes		.gitattributes
README.md		README.md
ensemble.py		ensemble.py
requirements.txt		requirements.txt
test.py		test.py
test.sh		test.sh
train.py		train.py
train.sh		train.sh

morizin/Making-AI-Beneficial

Folders and files

Latest commit

History

Repository files navigation

Upstage: Making AI Beneficial (0.966) Solution Source Code

Table of Contents

Requirements

Directory Structure

Training

Arguments:

Example:

Notes:

Training Methods

Inference

Arguments:

Example:

Ensembling

Arguments:

Example:

Weights and Reproducibility

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages