Research Software Graph (rs-graph) is a software repository which contains the following:
-
A Python library for collecting, processing, and standardizing scientific publication and associated research software/code from multiple sources.
-
Select publications, analysis scripts, and presentations which utilize the created dataset(s). Notably, this includes our preprint Code Contribution and Credit in Science which investigates how code contributions are recognized and rewarded in scientific publications.
If you are interested in:
- the created dataset(s), see the Data section
- the code for specific publications, see the Publications section
- the Python library (either for use or for contributing), see the Data Processing section
In short, the current version of the library produces a dataset which contains the following details for the following entities:
- Documents (e.g., peer-reviewed articles, preprints), with:
- DOI
- title
- publication date
- cited-by count
- topics/fields/domains
- Researchers (e.g., scientists), with:
- name
- works count
- h-index
- Document Contributors (e.g., authors), with:
- position (e.g., first author, last author)
- corresponding author status
- Repositories (e.g., GitHub repositories), with:
- owner
- name
- description
- stargazers count
- created datetime
- README content
- Developer Accounts (e.g., GitHub accounts), with:
- username
- name
- Repository Contributors (e.g., code contributors), with:
- link between a repository and a developer account
- Document-Repository Links (e.g., links between a document and a repository), with:
- link between a document and a repository (i.e., a paper and its associated code repository)
- Researcher-Developer Account Links (e.g., links between a researcher and a developer account), with:
- link between a researcher and a developer account (i.e., a scientist and their GitHub account)
- entity matching provenance and confidence details
While there is more information available in the dataset, these are the primary entities and their associated details.
The initial release of data from our processing pipeline is stored in Harvard Dataverse: https://doi.org/10.7910/DVN/KPYVI1
This dataset was used to create the preprint manuscript Code Contribution and Credit in Science.
To access the dataset, please create an account on Harvard Dataverse and download the
rs-graph-v1-redacted.db and/or rs-graph-v1-prod.db SQLite database file(s).
If you simply want access to the article-repository pairs and their associated metadata, the rs-graph-v1-redacted.db
SQLite database file is sufficient. If you want the full dataset, including the full list of repository contributors and
their predicted author associations, you will need to request access to the rs-graph-v1-prod.db file.
Once downloaded, the database can be used like any other SQLite database.
The database schema/models are available at: ./rs_graph/db/models.py
With pandas:
import pandas as pd
from sqlalchemy import text, create_engine
# Create a connection to the database
engine = create_engine("sqlite:///rs-graph-v1-redacted.db")
# Helper function to make this easy
def read_table(table: str) -> pd.DataFrame:
return pd.read_sql(text(f"SELECT * FROM {table}"), engine)
# Load all article info
articles = read_table("document")
print(articles.head(10))
print("-" * 80)
# Load all article-repository links and repository info
article_repo_links = read_table("document_repository_link")
repositories = read_table("repository")
# Merge
merged = article_repo_links.merge(
articles.rename(
columns={
"id": "document_id",
"title": "article_title",
}
),
on="document_id",
).merge(
repositories.rename(
columns={
"id": "repository_id",
"name": "repository_name",
"description": "repository_description",
}
),
on="repository_id",
)
# Only display a few columns
merged[[
"article_title",
"repository_name",
"repository_description",
"cited_by_count",
]].head(10)With SQLModel after installing the rs-graph package:
from rs_graph.db import models as db_models
from sqlmodel import Session, create_engine, select
engine = create_engine("sqlite:///rs-graph-v1-redacted.db")
with Session(engine) as session:
# Get the first 10 articles in the database
statement = select(db_models.Document).limit(10)
articles = session.exec(statement).all()
for article in articles:
print(article)
print("---")
print("-" * 80)
# Get the first 10 articles, article-repository-links, and repositories
statement = select(
db_models.Document,
db_models.DocumentRepositoryLink,
db_models.Repository,
).join(
db_models.DocumentRepositoryLink,
db_models.Document.id == db_models.DocumentRepositoryLink.document_id,
).join(
db_models.Repository,
db_models.DocumentRepositoryLink.repository_id == db_models.Repository.id,
).limit(10)
results = session.exec(statement).all()
for article, link, repo in results:
print(article, link, repo)
print("---")Note: You will need to request access to the rs-graph-v1-prod.db SQLite database file from
Harvard Dataverse https://doi.org/10.7910/DVN/KPYVI1 to regenerate the manuscript.
The rs-graph-v1-redacted.db file is not sufficient to regenerate the manuscript.
The manuscript is located at: ./publications/qss-code-authors/qss-code-authors.qmd
This is a Quarto Markdown file which contains not only the text of the manuscript, but in addition, the code to load the data from the rs-graph-v1-prod.db SQLite database, and conduct all analyses and create all visualizations.
From start to finish, to serve the web version of the manuscript and enable hot-reloading, run:
micromamba create --name rs-graph python=3.12 -y
micromamba activate rs-graph
micromamba install -c conda-forge just -y
micromamba install -c conda-forge quarto -y
just install
rs-graph-data download --dataverse-token {{ YOUR_DATAVERSE_API_TOKEN }}
just quarto-serveNotes:
- We use
micromambafor environment management, feel free to use whichever environment manage you prefer, however you will still need to installjustandquartovia Homebrew / Conda / from source. - The
rs-graph-data downloaddata pulls down all required files for the research. Including the rs-graph-v1 database and supporting data for the author-developer account matching model. - The
just quarto-servecommand will start a local server and open the manuscript in your default browser. You can make changes to the manuscript and see them reflected in real-time in your browser.- If the hot-reload fails to render after a while, kill the process and try again.
- Important: by default, the manuscript will be rendered using a very small sample (2%) of the data to speed up the rendering process and iterative writing. If you want to render the full dataset, change the
USE_SAMPLEflag in the Quarto Markdown file toFalse.
First, clone and set up your environment:
git clone https://github.com/evamaxfield/rs-graph.git
cd rs-graph
micromamba create --name rs-graph python=3.12 -y
micromamba activate rs-graph
micromamba install -c conda-forge just -y
just install
rs-graph-data download --dataverse-token {{ YOUR_DATAVERSE_API_TOKEN }}Notes:
- We use
micromambafor environment management, feel free to use whichever environment manage you prefer, however you will still need to installjustvia Homebrew / Conda / from source. - The
rs-graph-data downloaddata pulls down all required files for the research. Including the rs-graph-v1 database and supporting data for the author-developer account matching model.
After setting up your environment, you can process a prelinked-dataset (i.e. JOSS, PLOS, Papers with Code), with:
rs-graph-pipelines prelinked-dataset-ingestion {dataset_name}Currently supported datasets can be found in the lookup table in the source code.
There are some optional arguments as well:
-
--use-coiledwill use coiled.io to run the non-database writing portions of the pipeline in parallel on multiple Dask clusters. Database writes still occur locally on your machine so you can use the data while processing. Defaults to False (process everything locally). -
--use-prodwill use the "production" database for storage rather than the "development" database. Defaults to False (use development database). -
--github-tokens-fileallows you to provide the path to a YAML file which contains the GitHub tokens to use for the GitHub API. Defaults to.github-tokens.yml. See gh-tokens-loader for details about the GitHub Tokens file specification.
The current version of the library processes pre-linked datasets. That is, datasets which have known links between scientific articles (preprints, peer-reviewed publications, software articles) and their associated code repositories (e.g., GitHub repositories which store the analysis scripts, the created tool described by the publication). We are interested in adding other dataset "sources" in the future, but for now, we are focused on these pre-linked datasets.
The processing pipeline in short looks something like the following:
First, gather article-repository pairs from the specified dataset source (e.g., JOSS, PLOS, Papers with Code). Then, for each article-repository pair:
- Parse the software repository "code host" (e.g., GitHub, GitLab), filter out any non-GitHub hosted repositories (for now)
- Filter out any pairs which already exist in the database (done via checks to article DOI and repository owner and name)
- Process the article: first checking for an updated DOI via the Semantic Scholar API, then retrieving article metadata (title, publication date, author information, funding) via the OpenAlex API
- Process the repository: retrieving repository metadata (README content, description, stargazers count, contributor information) via the GitHub API
- Match the retrieved article authors to the retrieved repository developer accounts
- Store all retrieved and processed information in the database, deduplicating where possible
If you are interested in contributing to the data processing library, please reach out via email at [email protected] or open an issue on the GitHub repository. We would love to collaborate with you and help you get started. We may already be working on something you are interested in and would love to have you join us!