This repository contains the source code I wrote for the RAIS project at the CVES group under Prof. Lu. It was used in my report here.
This project aimed to investigate the methodologies to create Reproducible AI Software (called RAIS). This project will identify a list of essential factors that may contribute (or hurt) reproducibility. Based on this list, RAIS will evaluate reproducibility by examining software repositories and their history to detect the events when software’s reproducibility start declining and issue alerts. RAIS will use large language models (LLMs) to analyze documentations, comments, source code, and reports in order to understand the contents, validate consistency, and suggest improvements.
- First, clone the RAIS repository to your local machine using the following command:
git clone https://github.com/AkshathRaghav/RAIS.git
- Navigate to the cloned repository:
cd RAIS - Create a Python virtual environment (venv):
python3.9 -m venv venv
- Activate the virtual environment:
- On Windows:
venv\Scripts\activate
- On macOS and Linux:
source venv/bin/activate
- On Windows:
make install- Locate the depot/ folder. Download it as a zipped file.
Extract it using:
unzip depot.zip 'depot/*' -d .
OR
- Download the .tar.gz.
Extract it using:
tar -xzvf depot.tar.gz
The extracted depot follows this structure. If you do not want to use the data above, you can initialize it using the Depot class. More information below.
# - depot/
# - papers/
# - authors/
# - repository/
# - owner/
# - organization/
# - member/
(The goal was to enable abstractability of previously scraped data!)
depot/mapping.jsoncontains the file_paths of everything related to an ML project
{
"https://arxiv.org/abs/2305.19256": {
"paper": "../depot/papers/2305.19256/Ambient Diffusion: Learning Clean Distributions from Corrupted Data.pdf",
"paper_metadata": "../depot/papers/2305.19256/metadata.json",
"authors": [
"../depot/papers/authors/Giannis Daras.json",
"../depot/papers/authors/Kulin Shah.json",
"../depot/papers/authors/Yuval Dagan.json",
"../depot/papers/authors/Aravind Gollakota.json",
"../depot/papers/authors/Alexandros G. Dimakis.json",
"../depot/papers/authors/Adam Klivans.json"
]
},
"giannisdaras/ambient-diffusion": {
"branch": "main",
"tree": "../depot/repository/giannisdaras_ambient-diffusion/tree.json",
"repo_metadata": "../depot/repository/giannisdaras_ambient-diffusion/metadata.json",
"commit_history": "../depot/repository/giannisdaras_ambient-diffusion/commit_history.json",
"star_history": "../depot/repository/giannisdaras_ambient-diffusion/star_history.json",
"tree_formatted": "../depot/repository/giannisdaras_ambient-diffusion/tree_formatted.txt",
"owner_metadata": "../depot/repository/owner/giannisdaras.json",
"organization_metadata": null,
"members_metadata": []
}
},
....depot/repostorywill contain folders in the form of{owner_name}_{repo_name}/organization/ -> Induvidual organization datamembers/ -> All members data across projectsowner/ -> Data of owners
depot/paperwill contain folders in the form of{doi}/authors/ -> Induvidual author metadata
- Under
src/, I've listed some testing files underexperiment.*alias. You can use those to get started. - Under
src/backend/, I've modularized the component files.evaluator/contains the files for extracting all datatools/contains other files for functions like logging and storing
src/backend/evaluator/github/github.pycontains extra code and functions which can be used on a Github object to- Find any files
- Extract the code of the files from the web
- Extract all the lexical components of python files
- Extract Markdown file headers
- etc. I'm using them for evaluating the pipeline.
src/backend/evaluator/huggingfaceis not complete. It can only extract file tree for now.src/backend/evaluator/papergets all the metadata I found important. A lot of the extracted data might be overheads unrelated to a specific project being scraped, but I believe there could be a use for them later.