-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Idea / Feature
Annotate all sentences in CDP transcripts with the speaker (or specifically leave unlabled or a specific demarcation for public comment and similar situtations where the speaker isn't a defined representative)
Use Case / User Story
As a user I want to be able to:
- quickly parse transcripts for portions of text stated by my representative.
- read the transcript without listening or watching and know who said something.
Solution
Every sentence is linked to a "known person" of the council. If the individual isn't a "known person" i.e. an employee, public commenter, presenter, etc., label them as "Unknown Speaker" or if possible, group them as "Unknown Speaker 1", "Unkown Speaker 2", "Unknown Speaker 3", etc.
That said, to get this to scale to any CDP deployment I think an initial idea as to how to record the "metadata for a training dataset" would be something like the following -- a file that lives in the deployment's repository called speaker-classification.json containing the person id and a list of dictionaries of details for clips of that individual speaking:
{
"abcdefg": [
{
"audio_uri": "gs://...",
"start_time": 12.34,
"end_time": 56.78
},
{
"audio_uri": "gs://...",
"start_time": 45.25,
"end_time": 156.91
}
],
"quhaksd": [
{
"audio_uri": "gs://...",
"start_time": 2548.12,
"end_time": 2596.83
}
]
}(This could also be a CSV or YAML)
This allows the instance maintainer(s) to add more clips of each individual easily.
Generally we should utilize transfer learning and build / extend the training of a pre-existing audio classification model to classify each speaker in the dataset.
We will need to make tools for easily downloading audio clips possible as we can't assume that maintainers will have as thorough an understanding of Python and our file storage system as we do.
Additionally, we should develop a GitHub Action to:
- Kick off the model training on a Google Cloud instance.
- Report the accuracy, precision, and recall of the model, i.e. if accuracy is too low, they may need to add more training data.
- Deploy the model / store the trained model in Google Cloud Storage for use in another pipeline for tagging each sentence's speaker.
The GitHub Action we develop to handle the above should be maintained by us as a separate repo from the cookiecutter. This will allow us to enforce thresholds for results. If we have a repo specifically for this speaker classification bot, we can maintain, in code, CDP wide thresholds for accuracy, etc. Rather than storing those values in a config on the cookiecutter where the instance maintainer(s) could freely edit them and lower the thresholds to allow sentences to be mis-classified.
Alternatives
Stakeholders
Every person working on the project, this generally requires ML and as such we need to collectively make a decision about our minimum allowed accuracy, precision, and recall for any deployed model.
We should also ask users how they would feel about such a model.
Major Components
- Agree on standards and deployment strategy
- Prototype transfer learning audio classification model
- Scale up prototype with more robost testing
- Create GitHub Action for initiating training and validation of model
- Configure and plan deployment and storage strategy
- Write pipeline in
cdp-backendfor applying model to each sentence of transcript