-
Notifications
You must be signed in to change notification settings - Fork 0
2019 03 06 Arch Infra Decsions MARC Data Flow SOLR
Chad Nelson edited this page Mar 6, 2019
·
5 revisions
Jobs scheduled on periodic basis - not real time Reach out to data sources, extraxt them from current place, transform them, load them somewhere else
Series of python scripts, sometimes call out to other scripts.
- Date of last harvest is maintain in Airflow
- Sickle (python lib) does the OAI harvesting
- Sickle hits Ex Libris alma endpoint.
- In memory, iterates over all records fomr this harvest (transparently handling resumption_tokens) and splits deletes from update/creates, and when all OAI records are consumed, writes two xml files, one for deletes, and one for creates/updates.
- Prepares solr for updates
- don't serve up data until we're ready - pause replication
- Use traject to send new records to solr
- airflow configuration tracks which solr to send to
- traject needs to be installed locally - grabbed the tul_cob version
- rvm, bundler, etc
- run traject on create/updates?
- run airflow task to delete records?
- send commit message to solr
- once those have completed successfully, archive 2 xml files.
- update last harvest date
- turn solr replication back on
- What does the broader infrastructure look like?
- 3 tier system environment
- qa (CD on merge to master)
- stage (CD on merge to release-candidate)
- prod (CD on release)
- 3 tier system environment
- How do we deploy Airflow?
- Terraform building the box
- ansible adds airflow and known dags
- circle_ci/travis handles CD to stage based on tags
- What happens when a single record fails
- What happens when Alma OAI fails
- How do we handle solr changes?
- How do we use our patched traject?
- How to decide which version of tul_cob/ traject configs to use?
- RVM / bundler stuff seems like it should be handle Ansible?
- What does notification look like? Slack, email?
- Is a separate repo for each dag what we actually need? Can we have one big dag?