Skip to content

Updating offsets when a text resource is altered #26

@ngawangtrinley

Description

@ngawangtrinley

One of the main challenges our project faces is that we have multiple copies of the same text resource with degrees of cleanliness and annotations. For instance we will have 50 instances of the heart sutra with the cleanest one not having TOC annotations or with a very dirty version with great NER tags. In some cases at might also only have a bad quality text resource that is being proofread and annotated over a year.

Our goal is to be able to combine the best aspects of all resources and annotations at any given time.

In other words, we see STAM as the pivot format that will link Buddhist data in archives like BDRC, sttacentral or CBETA and websites like 84000, pecha.org, which means that we will have to update, split and merge text resources and annotations on a regular basis.

We are also putting together training datasets for the project monlam.ai which also requires annotation transfer. For instance our MT model currently suffers from a lot of typos in our 2 million aligned sentence dataset and we need to transfer the segment annotations to cleaner versions of texts we are currently producing.

A couple of years ago, our team came up with an "annotation transfer" or "base text update" mechanism combining our CCTV algorithm with Google's Diff Match Patch package.

What would be your approach to tackle this challenge with STAM?

Metadata

Metadata

Assignees

Labels

questionFurther information is requested

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions