tldr: We propose InterpreXis, a novel approach to finding human-interpretable concepts inside contextual word embeddings. InterpreXis involves training linear classifiers to identify interpretable axis groups, which can be used for
downstream tasks such as text classification and visualization.
- duplicate
secrets_example.json+ renamesecrets.json - copy and paste OpenAI key in
secrets.json
- open
final_pipeline.ipynb - run the first few cells until you see this:
change the classification to your desired category (animals, art, cities, clinical). - do not run the "create dataset" or "token and create DistilBERT embeddings" sections. scroll until you see this:

- run the rest of the notebook and everything should proceed smoothly!
- the final dataset is located in
data/final_data.csv
(if you download the whole repo, it should be automatically detected when running final_pipeline.ipynb)
the main files you will need to run our final code are described above. a brief summary of the repo structure is included below:
data/: this folder contains the various files we used to construct our final dataset, as well as other datasets we experimented with while developing our methodologyimg/: this folder contains figures generated by our code to show the results of different experimentsoutputs/: this folder contains files with the textual output from running our code (e.g., llm outputs, statistics, etc.)new-method-exp/: this folder contains files with earlier experiments/iterations of our final methodologyold-method-exp/: this folder contains files with the experiments/code needed to run our intial methodology (see Sec. 3 of our paper)