Skip to content

AkshathRaghav/RAIS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Reproducible Artificial Intelligence Software

maskformer

This repository contains the source code I wrote for the RAIS project at the CVES group under Prof. Lu. It was used in my report here.

Overview

This project aimed to investigate the methodologies to create Reproducible AI Software (called RAIS). This project will identify a list of essential factors that may contribute (or hurt) reproducibility. Based on this list, RAIS will evaluate reproducibility by examining software repositories and their history to detect the events when software’s reproducibility start declining and issue alerts. RAIS will use large language models (LLMs) to analyze documentations, comments, source code, and reports in order to understand the contents, validate consistency, and suggest improvements.

Getting Started

Cloning the Repository

  1. First, clone the RAIS repository to your local machine using the following command:
    git clone https://github.com/AkshathRaghav/RAIS.git

Setting Up a Python Virtual Environment

  1. Navigate to the cloned repository:
    cd RAIS
  2. Create a Python virtual environment (venv):
    python3.9 -m venv venv
  3. Activate the virtual environment:
    • On Windows:
      venv\Scripts\activate
    • On macOS and Linux:
      source venv/bin/activate

Installing Dependencies

make install

Getting Data from Google Cloud

Downloading the Data

  • Locate the depot/ folder. Download it as a zipped file. Extract it using:
    unzip depot.zip 'depot/*' -d .        

OR

  • Download the .tar.gz. Extract it using:
    tar -xzvf depot.tar.gz 

Data Overview

The extracted depot follows this structure. If you do not want to use the data above, you can initialize it using the Depot class. More information below.

  # - depot/
  #   - papers/
  #     - authors/
  #   - repository/
  #     - owner/
  #     - organization/ 
  #     - member/

(The goal was to enable abstractability of previously scraped data!)

  1. depot/mapping.json contains the file_paths of everything related to an ML project
  {
    "https://arxiv.org/abs/2305.19256": {
      "paper": "../depot/papers/2305.19256/Ambient Diffusion: Learning Clean Distributions from Corrupted Data.pdf",
      "paper_metadata": "../depot/papers/2305.19256/metadata.json",
      "authors": [
        "../depot/papers/authors/Giannis Daras.json",
        "../depot/papers/authors/Kulin Shah.json",
        "../depot/papers/authors/Yuval Dagan.json",
        "../depot/papers/authors/Aravind Gollakota.json",
        "../depot/papers/authors/Alexandros G. Dimakis.json",
        "../depot/papers/authors/Adam Klivans.json"
      ]
    },
    "giannisdaras/ambient-diffusion": {
      "branch": "main",
      "tree": "../depot/repository/giannisdaras_ambient-diffusion/tree.json",
      "repo_metadata": "../depot/repository/giannisdaras_ambient-diffusion/metadata.json",
      "commit_history": "../depot/repository/giannisdaras_ambient-diffusion/commit_history.json",
      "star_history": "../depot/repository/giannisdaras_ambient-diffusion/star_history.json",
      "tree_formatted": "../depot/repository/giannisdaras_ambient-diffusion/tree_formatted.txt",
      "owner_metadata": "../depot/repository/owner/giannisdaras.json",
      "organization_metadata": null,
      "members_metadata": []
    }
  },
  ....
  1. depot/repostory will contain folders in the form of
    • {owner_name}_{repo_name}/
    • organization/ -> Induvidual organization data
    • members/ -> All members data across projects
    • owner/ -> Data of owners
  2. depot/paper will contain folders in the form of
    • {doi}/
    • authors/ -> Induvidual author metadata

Codebase Overview

  1. Under src/, I've listed some testing files under experiment.* alias. You can use those to get started.
  2. Under src/backend/, I've modularized the component files.
    • evaluator/ contains the files for extracting all data
    • tools/ contains other files for functions like logging and storing
  3. src/backend/evaluator/github/github.py contains extra code and functions which can be used on a Github object to
    • Find any files
    • Extract the code of the files from the web
    • Extract all the lexical components of python files
    • Extract Markdown file headers
    • etc. I'm using them for evaluating the pipeline.
  4. src/backend/evaluator/huggingface is not complete. It can only extract file tree for now.
  5. src/backend/evaluator/paper gets all the metadata I found important. A lot of the extracted data might be overheads unrelated to a specific project being scraped, but I believe there could be a use for them later.

About

Analyzing core reproducibility factors in open-source AI software

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •