Skip to content

Babel creates cliques of equivalent identifiers across many biomedical vocabularies.

License

Notifications You must be signed in to change notification settings

NCATSTranslator/Babel

Repository files navigation

Babel

arXiv

Introduction

The Biomedical Data Translator integrates data across many data sources. One source of difficulty is that different data sources use different vocabularies. One source may represent water as MESH:D014867, while another may use the identifier DRUGBANK:DB09145. When integrating, we need to recognize that both of these identifiers are identifying the same concept.

Babel integrates the specific naming systems used in the Translator, creating equivalent sets across multiple semantic types following the conventions established by the Biolink Model. Each semantic type (such as biolink:SmallMolecule) requires specialized processing, but in each case, a JSON-formatted compendium is written to disk. This compendium can be used directly, but it can also be served by the Node Normalization service or another frontend.

In certain contexts, differentiating between some related concepts doesn't make sense: for example, you might not want to differentiate between a gene and the protein that is the product of that gene. Babel provides different conflations that group cliques on the basis of various criteria: for example, the GeneProtein conflation combines a gene with the protein that that gene encodes.

While generating these cliques, Babel also collects all the synonyms for every clique, which can then be used by tools like Name Resolver (NameRes) to provide name-based lookup of concepts.

Using Babel outputs

What do Babel outputs look like?

Three Babel data formats are available:

  • Compendium files contain concepts (sets or "cliques" of equivalent identifiers), which include a preferred identifier, Biolink type, list of equivalent identifiers as well as other information about the concept (such as the descriptions, information content valuen and so on).
  • Synonym files, which don't include the equivalent identifiers for each concept, but do include every known synonym for each concept. These files can be directly loaded into an Apache Solr database for querying. The Name Resolver contains scripts for loading these files and provides a frontend that can be used to search for concepts by label or synonym, or to provide an autocomplete service for Babel concepts.
  • Conflation files contain the lists of concepts that should be conflated when that conflation is turned on.

How can I download Babel outputs?

You can find out about downloading Babel outputs. You can find a list of Babel releases in the Releases list.

How can I deploy Babel outputs?

Information on deploying Babel outputs is available.

How can I access Babel cliques?

There are several ways of accessing Babel cliques:

  • You can run the Babel pipeline to generate the cliques yourself. Note that Babel currently has very high memory requirements -- it requires around 500G of memory in order to generate the Protein clique. Information on running Babel is available.
  • The NCATS Translator project provides the Node Normalization frontend to "normalize" identifiers -- any member of a particular clique will be normalized to the same preferred identifier, and the API will return all the secondary identifiers, Biolink type, description and other useful information. You can find out more about this frontend on its GitHub repository.
  • The NCATS Translator project also provides the Name Lookup (Name Resolution) frontends for searching for concepts by labels or synonyms. You can find out more about this frontend at its GitHub repository.
  • Play around with the Babel Downloads (in a custom format), which are currently available in JSONL, Apache Parquet or KGX formats.

What is the Node Normalization service (NodeNorm)?

The Node Normalization service, Node Normalizer or NodeNorm is an NCATS Translator web service to normalize identifiers by returning a single preferred identifier for any identifier provided.

In addition to returning the preferred identifier and all the secondary identifiers for a clique, NodeNorm will also return its Biolink type and "information content" score, and optionally any descriptions we have for these identifiers.

It also includes some endpoints for normalizing an entire TRAPI message and other APIs intended primarily for Translator users.

You can find out more about NodeNorm at its Swagger interface or in this Jupyter Notebook.

What is the Name Resolver (NameRes)?

The Name Resolver, Name Lookup or NameRes is an NCATS Translator web service for looking up preferred identifiers by search text. Although it is primarily designed to be used to power NCATS Translator's autocomplete text fields, it has also been used for named-entity linkage.

You can find out more about NameRes at its Swagger interface or in this Jupyter Notebook.

Understanding Babel outputs

How does Babel choose a preferred identifier for a clique?

After determining the equivalent identifiers that belong in a single clique, Babel sorts them in the order of CURIE prefixes for that Biolink type as determined by the Biolink Model. For example, for a biolink:SmallMolecule, any CHEBI identifiers will appear first, followed by any UNII identifiers, and so on. The first identifier in this list is the preferred identifier for the clique.

Conflations are lists of identifiers that are merged in that order when that conflation is applied. The preferred identifier for the clique is therefore the preferred identifier of the first clique being conflated.

  • For GeneProtein conflation, the preferred identifier is a gene.
  • For DrugChemical conflation, Babel uses the following algorithm:
    1. We first choose an overall Biolink type for the conflated clique. To do this, we use a "preferred Biolink type" order that can be configured in config.yaml and choose the most preferred Biolink type that is present in the conflated clique.
    2. We then group the cliques to be conflated by the prefix of their preferred identifier, and sort them based on the preferred prefix order for the chosen Biolink type.
    3. If there are multiple cliques with the same prefix in their preferred identifier, we use the following criteria to sort them:
      1. A clique with a lower information content value will be sorted before those with a higher information content or no information content at all.
      2. A clique with more identifiers are sorted before those with fewer identifiers.
      3. A clique whose preferred identifier has a smaller numerical suffix will be sorted before those with a larger numerical suffix.

How does Babel choose a preferred label for a clique?

For most Biolink types, the preferred label for a clique is the label of the preferred identifier. There is a demote_labels_longer_than configuration parameter that -- if set -- will cause labels that are longer than the specified number of characters to be ignored unless no labels shorter than that length are present. This is to avoid overly long labels when a more concise label is available.

Biolink types that are chemicals (i.e. biolink:ChemicalEntity and its subclasses) have a special list of preferred name boost prefixes that are used to prioritize labels. This list is currently:

  1. DRUGBANK
  2. DrugCentral
  3. CHEBI
  4. MESH
  5. CHEMBL.COMPOUND
  6. GTOPDB
  7. HMDB
  8. RXCUI
  9. PUBCHEM.COMPOUND

Conflations are lists of identifiers that are merged in that order when that conflation is applied. The preferred label for the conflated clique is therefore the preferred label of the first clique being conflated.

Where do the clique descriptions come from?

Currently, all descriptions for NodeNorm concepts come from UberGraph. You will note that descriptions are collected for every identifier within a clique, and then the description associated with the most preferred identifier is provided for the preferred identifier. Descriptions are not included in NameRes, but the description flag can be used to include any descriptions when returning cliques from NodeNorm.

What are "information content" values?

Babel obtains information content values for over 3.8 million concepts from Ubergraph based on the number of terms related to the specified term as either a subclass or any existential relation. They are decimal values that range from 0.0 (high-level broad term with many subclasses) to 100.0 (very specific term with no subclasses).

Reporting incorrect Babel cliques

I've found two or more identifiers in separate cliques that should be considered identical

Please report this "split" clique as an issue to the Babel GitHub repository. At a minimum, please include the identifiers (CURIEs) for the identifiers that should be combined. Links to a NodeNorm instance showing the two cliques are very helpful. Evidence supporting the lumping, such as a link to an external database that makes it clear that these identifiers refer to the same concept, are also very helpful: while we have some ability to combine cliques manually if needed urgently for some application, we prefer to find a source of mappings that would combine the two identifiers, allowing us to improve cliquing across Babel.

I've found two or more identifiers combined in a single clique that actually identify different concepts

Please report this "lumped" clique as an issue to the Babel GitHub repository. At a minimum, please include the identifiers (CURIEs) for the identifiers that should be split. Links to a NodeNorm instance showing the lumped clique is very helpful. Evidence, such as a link to an external database that makes it clear that these identifiers refer to the same concept, are also very helpful: while we have some ability to combine cliques manually if needed urgently for some application, we prefer to find a source of mappings that would combine the two identifiers, allowing us to improve cliquing across Babel.

Running Babel

How can I run Babel?

Babel is difficult to run, primarily because of its inefficient memory handling -- we currently need around 500G of memory to build the largest compendia (Protein and DrugChemical conflated information), although the smaller compendia should be buildable with far less memory. We are working on reducing these restrictions as far as possible. You can read more about Babel's build process, and please do contact us if you run into any problems or would like some assistance.

We have detailed instructions for running Babel, but the short version is:

  • We use uv to manage Python dependencies. You can use the Docker image if you run into any difficulty setting up the prerequisites.
  • We use Snakemake to handle the dependency management.

Therefore, you should be able to run Babel by cloning this repository and running:

$ uv run snakemake --cores [NUMBER OF CORES TO USE]

The ./slurm/run-babel-on-slurm.sh Bash script can be used to start running Babel as a Slurm job. You can set the BABEL_VERSION environment variable to document which version of Babel you are running.

Contributing to Babel

If you want to contribute to Babel, start with the Contributing to Babel documentation. This will provide guidance on how the source code is organized, what contributions are most useful, and how to run the tests.

Contact information

You can find out more about Babel by opening an issue on this repository, contacting one of the Translator DOGSLED PIs or contacting the NCATS Translator team.

About

Babel creates cliques of equivalent identifiers across many biomedical vocabularies.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors 7

Languages