This python script scrapes all the license files and automates the task of detecting broken links, timeout error and other link issues
- Pre-requisite
- Installation
- Usage
- Integrating with CI
- Unit Testing
- Troubleshooting
- Code of Conduct
- Contributing
- License
- Python3
- UTF-8 supported console
There are two suggested ways of installation. Use User, if you are interested in just running the script. Use Development, if you are interested in developing the script
- Clone the repo
git clone https://github.com/creativecommons/cc-link-checker.git 
- Install dependencies
Using Pipfile (requires pipenv): pipenv install
We recommend using pipenv to create a virtual environment and install dependencies
- Clone the repo
git clone https://github.com/creativecommons/cc-link-checker.git 
- Create virtual environment and install all dependencies
- Normal
pipenv install --dev 
- Use syncto install last successful environment. For example:pipenv sync --dev 
 
- Normal
- Run the script:
pipenv run link_checker 
pipenv run link_checker -husage: link_checker [-h] {deeds,legalcode,rdf,index,combined,canonical} ...
Check for broken links in Creative Commons license deeds, legalcode, and rdf
optional arguments:
  -h, --help            show this help message and exit
subcommands (a single subcomamnd is required):
  {deeds,legalcode,rdf,index,combined,canonical}
    deeds               check the links for each license's deed
    legalcode           check the links for each license's legalcode
    rdf                 check the links for each license's RDF
    index               check the links within index.rdf
    combined            Combined check (deeds, legalcode, rdf, and index)
    canonical           print canonical license URLs
Also see the help output each subcommand
pipenv run link_checker deeds -husage: link_checker deeds [-h] [-q] [--root-url ROOT_URL] [--limit LIMIT] [-v]
                          [--local] [--output-errors [output_file]]
optional arguments:
  -h, --help            show this help message and exit
  -q, --quiet           decrease verbosity (can be specified multiple times)
  --root-url ROOT_URL   set root URL (default: 'https://creativecommons.org')
  --limit LIMIT         Limit check lists to specified integer (default: 10)
  -v, --verbose         increase verbosity (can be specified multiple times)
  --local               process local filesystem legalcode files to determine
                        valid license paths (uses LICENSE_LOCAL_PATH environment
                        variable and falls back to default:
                        '../creativecommons.org/docroot/legalcode')
  --output-errors [output_file]
                        output all link errors to file (default: errorlog.txt) and
                        create junit-xml type summary (test-summary/junit-xml-
                        report.xml)
pipenv run link_checker legalcode -husage: link_checker legalcode [-h] [-q] [--root-url ROOT_URL] [--limit LIMIT] [-v]
                              [--local] [--output-errors [output_file]]
optional arguments:
  -h, --help            show this help message and exit
  -q, --quiet           decrease verbosity (can be specified multiple times)
  --root-url ROOT_URL   set root URL (default: 'https://creativecommons.org')
  --limit LIMIT         Limit check lists to specified integer (default: 10)
  -v, --verbose         increase verbosity (can be specified multiple times)
  --local               process local filesystem legalcode files to determine
                        valid license paths (uses LICENSE_LOCAL_PATH environment
                        variable and falls back to default:
                        '../creativecommons.org/docroot/legalcode')
  --output-errors [output_file]
                        output all link errors to file (default: errorlog.txt) and
                        create junit-xml type summary (test-summary/junit-xml-
                        report.xml)
pipenv run link_checker rdf -husage: link_checker rdf [-h] [-q] [--root-url ROOT_URL] [--limit LIMIT] [-v]
                        [--local] [--local-index] [--output-errors [output_file]]
optional arguments:
  -h, --help            show this help message and exit
  -q, --quiet           decrease verbosity (can be specified multiple times)
  --root-url ROOT_URL   set root URL (default: 'https://creativecommons.org')
  --limit LIMIT         Limit check lists to specified integer (default: 10)
  -v, --verbose         increase verbosity (can be specified multiple times)
  --local               process local filesystem legalcode files to determine
                        valid license paths (uses LICENSE_LOCAL_PATH environment
                        variable and falls back to default:
                        '../creativecommons.org/docroot/legalcode')
  --local-index         process local filesystem index.rdf (uses
                        INDEX_RDF_LOCAL_PATH environment variable and falls back
                        to default: './index.rdf')
  --output-errors [output_file]
                        output all link errors to file (default: errorlog.txt) and
                        create junit-xml type summary (test-summary/junit-xml-
                        report.xml)
pipenv run link_checker index -husage: link_checker index [-h] [-q] [--root-url ROOT_URL] [--limit LIMIT] [-v]
                          [--local-index] [--output-errors [output_file]]
optional arguments:
  -h, --help            show this help message and exit
  -q, --quiet           decrease verbosity (can be specified multiple times)
  --root-url ROOT_URL   set root URL (default: 'https://creativecommons.org')
  --limit LIMIT         Limit check lists to specified integer (default: 10)
  -v, --verbose         increase verbosity (can be specified multiple times)
  --local-index         process local filesystem index.rdf (uses
                        INDEX_RDF_LOCAL_PATH environment variable and falls back
                        to default: './index.rdf')
  --output-errors [output_file]
                        output all link errors to file (default: errorlog.txt) and
                        create junit-xml type summary (test-summary/junit-xml-
                        report.xml)
pipenv run link_checker combined -husage: link_checker combined [-h] [-q] [--root-url ROOT_URL] [--limit LIMIT] [-v]
                             [--local] [--local-index]
                             [--output-errors [output_file]]
optional arguments:
  -h, --help            show this help message and exit
  -q, --quiet           decrease verbosity (can be specified multiple times)
  --root-url ROOT_URL   set root URL (default: 'https://creativecommons.org')
  --limit LIMIT         Limit check lists to specified integer (default: 10)
  -v, --verbose         increase verbosity (can be specified multiple times)
  --local               process local filesystem legalcode files to determine
                        valid license paths (uses LICENSE_LOCAL_PATH environment
                        variable and falls back to default:
                        '../creativecommons.org/docroot/legalcode')
  --local-index         process local filesystem index.rdf (uses
                        INDEX_RDF_LOCAL_PATH environment variable and falls back
                        to default: './index.rdf')
  --output-errors [output_file]
                        output all link errors to file (default: errorlog.txt) and
                        create junit-xml type summary (test-summary/junit-xml-
                        report.xml)
pipenv run link_checker canonical -husage: link_checker canonical [-h] [-q] [--root-url ROOT_URL] [--limit LIMIT] [-v]
                              [--local] [--include-gnu]
optional arguments:
  -h, --help           show this help message and exit
  -q, --quiet          decrease verbosity (can be specified multiple times)
  --root-url ROOT_URL  set root URL (default: 'https://creativecommons.org')
  --limit LIMIT        Limit check lists to specified integer
  -v, --verbose        increase verbosity (can be specified multiple times)
  --local              process local filesystem legalcode files to determine valid
                       license paths (uses LICENSE_LOCAL_PATH environment variable
                       and falls back to default:
                       '../creativecommons.org/docroot/legalcode')
  --include-gnu        include GNU licenses in addition to Creative Commons
                       licenses
Due to the script capability to scrape licenses from local storage, it can be used as CI in 2 easy steps:
- 
Clone this repo in the CI container git clone https://github.com/creativecommons/cc-link-checker.git ~/cc-link-checker
- 
Run the link_checker.pyin local(--local) and output error(--output-error) modepython link_checker.py --local --output-errors 
The configuration for GitHub Actions, for example, is present here.
Unit tests have been written using pytest framework. The tests can be run using:
- Install dev dependencies
- macOS with Homebrew
pipenv install --dev --python /usr/local/opt/[email protected]/libexec/bin/python 
- General
pipenv install --dev 
 
- macOS with Homebrew
- Run unit tests
pipenv run pytest -v 
- Python Guidelines — Creative Commons Open Source
- Black: the uncompromising Python code formatter
- flake8: a python tool that glues together pep8, pyflakes, mccabe, and third-party plugins to check the style and quality of some python code.
- isort: A Python utility / library to sort imports.
- 
UnicodeEncodeError:This error is thrown when the console is not UTF-8 supported. 
- 
Failing Lint build: Ensure style/syntax is correct: pipenv run black .pipenv run isort .pipenv run flake8 .
The Creative Commons team is committed to fostering a welcoming community. This project and all other Creative Commons open source projects are governed by our Code of Conduct. Please report unacceptable behavior to [email protected] per our reporting guidelines.
We welcome contributions for bug fixes, enhancement and documentation. Please see CONTRIBUTING.md while contributing..