Reproducible Artificial Intelligence Software

This repository contains the source code I wrote for the RAIS project at the CVES group under Prof. Lu. It was used in my report here.

Overview

This project aimed to investigate the methodologies to create Reproducible AI Software (called RAIS). This project will identify a list of essential factors that may contribute (or hurt) reproducibility. Based on this list, RAIS will evaluate reproducibility by examining software repositories and their history to detect the events when software’s reproducibility start declining and issue alerts. RAIS will use large language models (LLMs) to analyze documentations, comments, source code, and reports in order to understand the contents, validate consistency, and suggest improvements.

Getting Started

Cloning the Repository

First, clone the RAIS repository to your local machine using the following command:
```
git clone https://github.com/AkshathRaghav/RAIS.git
```

Setting Up a Python Virtual Environment

Navigate to the cloned repository:
```
cd RAIS
```
Create a Python virtual environment (venv):
```
python3.9 -m venv venv
```
Activate the virtual environment:
- On Windows:
```
venv\Scripts\activate
```
- On macOS and Linux:
```
source venv/bin/activate
```

Installing Dependencies

make install

Getting Data from Google Cloud

Downloading the Data

Locate the depot/ folder. Download it as a zipped file. Extract it using:
```
unzip depot.zip 'depot/*' -d .        
```

OR

Download the .tar.gz. Extract it using:
```
tar -xzvf depot.tar.gz 
```

Data Overview

The extracted depot follows this structure. If you do not want to use the data above, you can initialize it using the Depot class. More information below.

  # - depot/
  #   - papers/
  #     - authors/
  #   - repository/
  #     - owner/
  #     - organization/ 
  #     - member/

(The goal was to enable abstractability of previously scraped data!)

depot/mapping.json contains the file_paths of everything related to an ML project

  {
    "https://arxiv.org/abs/2305.19256": {
      "paper": "../depot/papers/2305.19256/Ambient Diffusion: Learning Clean Distributions from Corrupted Data.pdf",
      "paper_metadata": "../depot/papers/2305.19256/metadata.json",
      "authors": [
        "../depot/papers/authors/Giannis Daras.json",
        "../depot/papers/authors/Kulin Shah.json",
        "../depot/papers/authors/Yuval Dagan.json",
        "../depot/papers/authors/Aravind Gollakota.json",
        "../depot/papers/authors/Alexandros G. Dimakis.json",
        "../depot/papers/authors/Adam Klivans.json"
      ]
    },
    "giannisdaras/ambient-diffusion": {
      "branch": "main",
      "tree": "../depot/repository/giannisdaras_ambient-diffusion/tree.json",
      "repo_metadata": "../depot/repository/giannisdaras_ambient-diffusion/metadata.json",
      "commit_history": "../depot/repository/giannisdaras_ambient-diffusion/commit_history.json",
      "star_history": "../depot/repository/giannisdaras_ambient-diffusion/star_history.json",
      "tree_formatted": "../depot/repository/giannisdaras_ambient-diffusion/tree_formatted.txt",
      "owner_metadata": "../depot/repository/owner/giannisdaras.json",
      "organization_metadata": null,
      "members_metadata": []
    }
  },
  ....

depot/repostory will contain folders in the form of
- {owner_name}_{repo_name}/
- organization/ -> Induvidual organization data
- members/ -> All members data across projects
- owner/ -> Data of owners
depot/paper will contain folders in the form of
- {doi}/
- authors/ -> Induvidual author metadata

Codebase Overview

Under src/, I've listed some testing files under experiment.* alias. You can use those to get started.
Under src/backend/, I've modularized the component files.
- evaluator/ contains the files for extracting all data
- tools/ contains other files for functions like logging and storing
src/backend/evaluator/github/github.py contains extra code and functions which can be used on a Github object to
- Find any files
- Extract the code of the files from the web
- Extract all the lexical components of python files
- Extract Markdown file headers
- etc. I'm using them for evaluating the pipeline.
src/backend/evaluator/huggingface is not complete. It can only extract file tree for now.
src/backend/evaluator/paper gets all the metadata I found important. A lot of the extracted data might be overheads unrelated to a specific project being scraped, but I believe there could be a use for them later.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
assets		assets
src		src
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
requirements.txt		requirements.txt
requirements_dev.txt		requirements_dev.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Reproducible Artificial Intelligence Software

Overview

Getting Started

Cloning the Repository

Setting Up a Python Virtual Environment

Installing Dependencies

Getting Data from Google Cloud

Downloading the Data

Data Overview

Codebase Overview

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

AkshathRaghav/RAIS

Folders and files

Latest commit

History

Repository files navigation

Reproducible Artificial Intelligence Software

Overview

Getting Started

Cloning the Repository

Setting Up a Python Virtual Environment

Installing Dependencies

Getting Data from Google Cloud

Downloading the Data

Data Overview

Codebase Overview

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages