Skip to content

2019 03 06 Arch Infra Decsions MARC Data Flow SOLR

Chad Nelson edited this page Mar 6, 2019 · 5 revisions

Airflow approach

Jobs scheduled on periodic basis - not real time Reach out to data sources, extraxt them from current place, transform them, load them somewhere else

OAI Harvest

Series of python scripts, sometimes call out to other scripts.

Steps

  • Date of last harvest is maintain in Airflow
  • Sickle (python lib) does the OAI harvesting
  • Sickle hits Ex Libris alma endpoint.
  • In memory, iterates over all records fomr this harvest (transparently handling resumption_tokens) and splits deletes from update/creates, and when all OAI records are consumed, writes two xml files, one for deletes, and one for creates/updates.
  • Prepares solr for updates
    • don't serve up data until we're ready - pause replication
  • Use traject to send new records to solr
    • airflow configuration tracks which solr to send to
    • traject needs to be installed locally - grabbed the tul_cob version
    • rvm, bundler, etc
    • run traject on create/updates?
    • run airflow task to delete records?
    • send commit message to solr
    • once those have completed successfully, archive 2 xml files.
    • update last harvest date
    • turn solr replication back on

Broader Questions

  • What does the broader infrastructure look like?
    • 3 tier system environment
      • qa (CD on merge to master)
      • stage (CD on merge to release-candidate)
      • prod (CD on release)
  • How do we deploy Airflow?
    • Terraform building the box
    • ansible adds airflow and known dags
    • circle_ci/travis handles CD to stage based on tags

Outstanding Broader questions

  • What happens when a single record fails
  • What happens when Alma OAI fails
  • How do we handle solr changes?
  • How do we use our patched traject?
  • How to decide which version of tul_cob/ traject configs to use?
  • RVM / bundler stuff seems like it should be handle Ansible?
  • What does notification look like? Slack, email?
  • Is a separate repo for each dag what we actually need? Can we have one big dag?

Clone this wiki locally