Project files for pattern recognition group assignment
Currently contains the following files:
data/WikiEssentials_L4.7z: output file of the WikiVitalArticles program. Each document is included in its entirety (but split by paragraph).preprocess_utils.py: preprocessing functions for Wiki data.model_utils.py: various utility functions used for modeling (e.g. loading embeddings).1_preprocess_raw_data.py: preprocessing of raw input data. Currently shortens each article to first 8 sentences.2_baseline_model.py: tokenization, vectorization of input data and baseline model (1-layer NN with softmax classifier).
- Download and install Anaconda Python 3
- Download latest version of Rstudio. Need this to run python scripts in Rstudio.
- In a terminal, go to this repository's folder and set up the Conda environment
conda env create -f environment.yml- Install PyTorch with cuda 9.2 support
conda activate VitalWikiClassifier
conda install pytorch torchvision cudatoolkit=9.2 -c pytorch -c defaults -c numba/label/dev- In R, install the
reticulatelibrary:
install.packages("reticulate")- Check the
.Rprofilefile to ensure that R knows where to find your anaconda distribution.