Skip to content

Sequential Topic Segmentation / Session Chapters #9

@evamaxfield

Description

@evamaxfield

Use Case

Please provide a use case to help us understand your request in context

YouTube has a "Video Chapters" feature that splits the timeline bar into chapters based off of timestamps found in the video description. Example:

chapter-example

Similarly, it would be incredibly useful to jump around a meeting video / transcript based off of the minutes items of the meeting.

Solution

Please describe your ideal solution

Going to take a lot of work on the backend side and a bit of work on the front-end.

We could be fancy and train a topic model or use some sort of seeded clustering, and we likely will at some point but as a first past implementation, it may be interesting to see how far the following gets us:

Look for common phrases: "Moving on to...", "Call the roll", "Attendance", etc. and apply breakpoints there.
Additionally, parse all the minutes item attachments (docs, presentations, etc.) for every minutes item for an event and store the list of words UNIQUE to a specific minutes item. Then compare the transcript for those words. Find the breakpoints by taking a moving window sum of the counts of each of the unique words for a given minutes item against the transcript.

I.e.

"minutes_item_1": ["municipal", "broadband", "light"],
"minutes_item_2": ["it", "department"],
lets talk about the municipal broadband bill that would enable seattle city light to serve customers with broadband...
...
moving on to funding the seattle IT department...
...

The moving window word count would be able to see that at some point we switch from using specific words found in minutes item 1 to using specific words found in minutes item 2. If we can combine that with looking for the "section splitter sequences" ("moving on", "call the roll", etc) I think it may be a good first pass, fast and cheap chapter identifier.

Then store chapter indentifiers as annotations in the transcript for the frontend to parse.

Alternatives

Please describe any alternatives you've considered, even if you've dismissed them

Topic modeling? Clustering?

Additionally, we should let whatever pipeline we create the ability to skip this if chapter starts are provided by user as Seattle Channel event descriptions have them in most cases now-a-days.

Stakeholders

Please add any individual person or team's that should be brought in for discussion on the project

Frontend to actually make the video chapters viewer.
Backend for both pipeline and transcript mutation.

Major Components

Please add any major components that need to be done for this project

  • Function to get unique n-grams from minutes item attachments for all minutes items in event
  • Function to apply moving window sum using unique n-grams
  • Function to find "common" section splitter sequences"
  • Function to weight and merge, moving window sum and section splitter sequences
  • Function to store into transcript as annotation
  • Pipeline to wrap the whole thing (may include as part of primary event processing with option to not?)
  • Frontend video timeline to parse and use chapter annotations

Dependencies

Please add any other major or minor project dependencies here

Other Notes

Please add any extra notes here

My one concern is how to handle many-session events. We only store minutes items on the event level and not on the session level, but we will need to find a way to gracefull handle this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    dataRequires some data analysis or computational modelingfeatureNew feature or requestproposalA detailed proposal / spec for a CDP feature

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions