Skip to content

'correct' but misleading name-matching results #280

@mjwestgate

Description

@mjwestgate

Some taxa are of conservation importance but are not taxonomically recognised. For example, if we look up the Victorian conservation list:

library(galah)
library(dplyr)

show_all(lists) |>
  filter(isAuthoritative == TRUE,
  region == "Victoria") 
# A tibble: 1 × 22
  species_list_uid listName         description listType dateCreated lastUpdated lastUploaded
  <chr>            <chr>            <chr>       <chr>    <chr>       <chr>       <chr>       
1 dr655            Victoria : Cons… ""          CONSERV… 2015-04-04… 2025-07-08… 2025-07-08T…
# ℹ 15 more variables: lastMatched <chr>, username <chr>, itemCount <int>, region <chr>,
#   isAuthoritative <lgl>, isInvasive <lgl>, isThreatened <lgl>, isBIE <lgl>, isSDS <lgl>,
#   wkt <chr>, category <chr>, generalisation <chr>, authority <chr>, sdsType <chr>,
#   looseSearch <lgl>

Then look up what species are on that list, and filter to those that are a single word:

species_list <- request_metadata() |>
    filter(list == "dr655") |>
    unnest() |>
    collect()

species_list |>
    filter(grepl("^[[:alpha:]]+$", scientificName))

# A tibble: 3 × 6
       id name                      commonName       scientificName lsid      dataResourceUid
    <int> <chr>                     <chr>            <chr>          <chr>     <chr>          
1 6793854 Chiastocaulon biseriale   NA               Chiastocaulon  NZOR-6-7… dr655          
2 6794205 Eucalyptus X oxypoma      Studley Park Gum Eucalyptus     https://… dr655          
3 6795458 Eucalyptus X studleyensis Studley Park Gum Eucalyptus     https://… dr655  

Each of these entries is supplied as a species, but returns a genus. We can confirm this by trying the same query with search_taxa(), e.g.

search_taxa("Eucalyptus X studleyensis")
# A tibble: 1 × 14
  search_term        scientific_name scientific_name_auth…¹ taxon_concept_id rank  match_type
  <chr>              <chr>           <chr>                  <chr>            <chr> <chr>     
1 Eucalyptus X stud… Eucalyptus      L'Hér.                 https://id.biod… genus exactMatch
# ℹ abbreviated name: ¹​scientific_name_authorship
# ℹ 8 more variables: kingdom <chr>, phylum <chr>, class <chr>, order <chr>, family <chr>,
#   genus <chr>, vernacular_name <chr>, issues <chr>

Again, this links the taxon concept to "Eucalyptus", and further describes match_type as exactMatch, meaning we wouldn't normally flag this as an error. The problem, therefore, is that calling this taxon name in a pipe will lead to all Eucalyptus observations being returned, which is almost certainly not what the user wants:

galah_call() |>
    identify("Eucalyptus X studleyensis") |>
    group_by(scientificName) |>
    count() |>
    collect()
# A tibble: 1,193 × 2
   scientificName           count
   <chr>                    <int>
 1 Eucalyptus               44131
 2 Eucalyptus obliqua       43001
 3 Eucalyptus camaldulensis 41418
 4 Eucalyptus sieberi       25192
 5 Eucalyptus melliodora    24164
 6 Eucalyptus crebra        22957
 7 Eucalyptus globoidea     22700
 8 Eucalyptus macrorhyncha  21812
 9 Eucalyptus tereticornis  21447
10 Eucalyptus muelleriana   18830
# ℹ 1,183 more rows
# ℹ Use `print(n = ...)` to see more rows

So, in summary, sometimes the ALA returns poorly targeted information that is technically correct but not useful, and doesn't provide any flags (such as match_type) that we would usually reference to identify undesirable behaviour.

One solution might be to show the user what taxon rank is being returned by search_taxa(), for example by grouping by the rank column:

search_taxa("Eucalyptus X studleyensis") |>
    group_by(rank) |>
    summarize(count = n())
# A tibble: 1 × 2
  rank  count
  <chr> <int>
1 genus     1

This wouldn't help much in piped queries, but for taxon queries it might highlight unexpected behaviour. It would probably need to be controlled by the verbose argument of galah_config().

Metadata

Metadata

Assignees

No one assigned

    Labels

    discussionUnclear what is the best way forward

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions