A comprehensive Robotic Process Automation (RPA) pipeline for curating systematic literature review datasets to enable Large Language Model (LLM) automation of article selection tasks.
This project addresses a critical challenge in systematic literature reviews: the time-intensive manual process of article selection and metadata curation. Traditional systematic reviews can take 1-3 years and require reviewing thousands of articles. Our solution creates high-quality annotated datasets from published systematic reviews using automated metadata extraction techniques.
- 16 systematic review datasets processed and curated
- 32,614 total articles with extracted metadata
- 99% article recovery rate from academic databases
- 97% automation success rate for metadata extraction
- 8 academic database sources integrated
The curated datasets enable researchers to train and evaluate LLMs for automating systematic review processes, potentially reducing review time from years to weeks while maintaining scientific rigor.
| Component | Technology | Purpose |
|---|---|---|
| Web Automation | Selenium WebDriver | Browser automation and navigation |
| HTML Parsing | BeautifulSoup4 | Content extraction and parsing |
| Bibliography Processing | Pybtex | BibTeX format handling |
| Data Processing | Pandas | Data manipulation and analysis |
| Text Processing | NLTK | Natural language processing |
| Cross-platform Support | Python 3.8+ | Windows and Linux compatibility |
Scripts/
├── core/ # Core infrastructure
│ ├── SRProject.py # Base systematic review class
│ ├── os_path.py # Cross-platform path management
│ └── __init__.py
├── datasets/ # Individual dataset processors (16 datasets)
│ ├── ArchiML.py # Architecture & Machine Learning (2,766 articles)
│ ├── CodeClone.py # Code Clone Detection (10,454 articles)
│ ├── GameSE.py # Game Software Engineering (1,520 articles)
│ ├── ModelingAssist.py # Modeling Assistance (3,002 articles)
│ └── ... (12 more datasets)
├── extraction/ # Metadata extraction pipeline
│ ├── findMissingMetadata.py # Core extraction logic
│ ├── webScraping.py # Selenium-based scraping
│ ├── htmlParser.py # HTML content parsing
│ └── searchInSource.py # Source-specific search
├── specialized/ # Specialized processors
│ ├── GameSE_abstract.py # Abstract-level analysis
│ ├── GameSE_title.py # Title-level analysis
│ └── Demo.py, IFT3710.py # Course-specific demos
├── utilities/ # Helper scripts
│ ├── convert_encoding.py # Character encoding conversion
│ ├── get_non_matching_titles.py # Quality control
│ └── rename_html.py # File management
├── data/ # Data files and caches
├── testing/ # Test scripts
├── logs/ # Log files and documentation
└── main.py # Main pipeline entry point
- Python 3.8+ with pip
- Firefox browser (for Selenium WebDriver)
- Academic database access (institutional subscriptions recommended)
- Windows 10/11 or Ubuntu Linux
-
Clone the repository
git clone [repository-url] cd "Projet Curation des métadonnées"
-
Install dependencies
pip install -r requirements.txt
-
Configure paths
- Edit
Scripts/core/os_path.pyfor your system paths - Ensure Firefox and geckodriver are properly installed
- Edit
# Process a single dataset
python Scripts/main.py ArchiML
# Process multiple datasets
python Scripts/main.py CodeClone ModelingAssist GameSE
# Process all default datasets
python Scripts/main.py| Dataset | Domain | Articles | Status | Description |
|---|---|---|---|---|
| ArchiML | ML Architecture | 2,766 | Complete | Architecture and Machine Learning integration |
| CodeClone | Code Analysis | 10,454 | Complete | Code clone detection and management |
| CodeCompr | Code Understanding | 1,508 | Complete | Source code comprehension techniques |
| GameSE | Gaming | 1,520 | Complete | Game software engineering practices |
| ModelingAssist | Modeling Tools | 3,002 | Complete | Model-driven development assistance |
| Behave | Behavioral SE | 1,043 | Complete | Behavioral software engineering |
| DTCPS | Cyber-Physical | 4,007 | Complete | Digital twin cyber-physical systems |
| ESM_2 | Empirical Methods | 1,134 | Complete | Experience sampling methodology |
| ESPLE | Empirical SE | 991 | Complete | Empirical software engineering |
| ModelGuidance | Model-Driven | 2,105 | Complete | Model-driven development guidance |
| OODP | Design Patterns | 1,826 | Complete | Object-oriented design patterns |
| SecSelfAdapt | Security | 1,962 | Complete | Security in self-adaptive systems |
| SmellReprod | Code Quality | 2,067 | Complete | Code smell reproduction studies |
| TestNN | Neural Testing | 2,533 | Complete | Neural network testing approaches |
| TrustSE | Trust & Security | 2,564 | Complete | Trust in software engineering |
Our RPA pipeline extracts metadata from 8 major academic databases:
- IEEE Xplore - Technical publications and conferences
- ACM Digital Library - Computing and information technology
- ScienceDirect - Elsevier's multidisciplinary publications
- SpringerLink - Academic books and journals
- Scopus - Citation and abstract database
- Web of Science - Multidisciplinary citation database
- arXiv - Preprint repository for STEM fields
- PubMed Central - Biomedical literature archive
# Enable/disable metadata extraction
do_extraction = True # Set to False for testing without web scraping
# Process specific datasets
args = ['ArchiML', 'CodeClone', 'ModelingAssist']
# Run identifier for batch processing
run = 999Edit Scripts/core/os_path.py for your environment:
# Main project path
MAIN_PATH = "C:\\Users\\...\\Projet Curation des métadonnées"
# Extracted content cache
EXTRACTED_PATH = "C:\\Users\\...\\Database"All datasets follow this standardized schema:
| Field | Type | Description |
|---|---|---|
key |
String | Unique article identifier |
project |
String | Dataset name |
title |
String | Article title |
abstract |
String | Article abstract |
keywords |
String | Article keywords (semicolon-separated) |
authors |
String | Author list (semicolon-separated) |
venue |
String | Publication venue |
doi |
String | Digital Object Identifier |
| Field | Type | Description |
|---|---|---|
screened_decision |
String | Initial screening decision |
final_decision |
String | Final inclusion decision |
mode |
String | Review mode (new_screen, snowballing) |
inclusion_criteria |
String | Inclusion criteria description |
exclusion_criteria |
String | Exclusion criteria description |
reviewer_count |
Integer | Number of reviewers |
| Field | Type | Description |
|---|---|---|
source |
String | Academic database source |
year |
String | Publication year |
meta_title |
String | Source dataset title |
link |
String | Source URL |
publisher |
String | Publisher information |
metadata_missing |
String | Missing metadata indicators |
# Load systematic review dataset
sr_project = ArchiML() # Example dataset- Duplicate title detection and resolution
- Data schema normalization
- Character encoding standardization
# Enable web scraping
do_extraction = True
completed_df = findMissingMetadata.main(sr_project.df, do_extraction, run, dataset_name)- Unicode character normalization
- Illegal character removal
- Format standardization
- Title matching validation
- Missing metadata reporting
- Statistical analysis
# Export to TSV format
ExportToCSV(sr_project)- Create dataset class in
Scripts/datasets/
class NewDataset(SRProject):
def __init__(self):
super().__init__()
self.project_name = "NewDataset"
# Define inclusion/exclusion criteria
# Set source file paths- Add to main.py
from Scripts.datasets.NewDataset import NewDataset
# Add to main() function
elif arg == "NewDataset":
sr_project = NewDataset()- Article Recovery: 99% of target articles successfully located
- Metadata Extraction: 97% automation success rate
- Title Matching Accuracy: >95% using fuzzy matching algorithms
- Edit distance algorithms for title similarity
- Cross-reference verification when multiple sources available
- Format standardization across all datasets
- Comprehensive error logging for manual review
This project supports research in:
- Systematic Literature Review Automation
- Large Language Model Training for Academic Tasks
- Robotic Process Automation in Research
Note: This pipeline requires academic database access and appropriate institutional subscriptions for optimal functionality. The system is designed for research and educational purposes.