fix(dataset) Fixing video indexing bug when using merged dataset | Fixes #[2328] | (🐛 Bug) | #2438

andras-makany · 2025-11-13T17:26:11Z

What it does

First mentioned in issue #2328. When using a merged dataset (version 3.0) torch.util.data.DataLoader encounters an exception that the given frame index is invalid.

The error was due to not reseting the latest_duration offset in aggregate_videos when creating a new file, resulting in an offset equal to the last files frame number when writing the second episode of the new file.

Solution was to reset latest_duration after creating a new file.

How it was tested

Using the fixed aggregate, I merged a new copy of my dataset containing 75 episodes with video file size set to 500 MB, creating 13 video file in the merged dataset.

Viewing the newly merged dataset, the offsets were corrected.

Then I executed a model training with batch size of 10 and 100 training steps. After multiple try, no exception was caught, assuming the problem was solved.

Repository included tests were successful.

How to checkout & try? (for the reviewer)

Try merging smaller datasets to have at least 2 video files in the new dataset. Viewing the meta/episodes/.../file-000.parquet the from_timestamp and to_timestamp values are sequential with no skips and resets to 0.0 when a new file starts.

Training a model gives no errors as well.

Grigorij-Dudnik · 2025-11-15T17:00:21Z

Hey @andras-makany! I tested your PR; hovewer, it not works unfortunatelly. I still receiving the same 'RuntimeError: Invalid frame index=52130 for streamIndex=0; must be less than 19675' when trying to train a model on the output dataset.

What I did is cloned your branch, installed lerobot from it and runned merging with command:
lerobot-edit-dataset
--repo_id Grigorij/xle_left_arm_merged_filtered_2
--operation.type merge
--operation.repo_ids "['Grigorij/XLeRobot_arms', 'Grigorij/XLeRobot_arms_2', 'Grigorij/XLeRobot_arms_3', 'Grigorij/XLeRobot_arms_4', 'Grigorij/XLeRobot_arms_5', 'Grigorij/XLeRobot_arms_6', 'Grigorij/XLeRobot_arms_7', 'Grigorij/XLeRobot_arms_8', 'Grigorij/XLeRobot_arms_9', 'Grigorij/XLeRobot_arms_10', 'Grigorij/XLeRobot_arms_11']" --push_to_hub true

The output dataset is here: https://huggingface.co/datasets/Grigorij/xle_left_arm_merged_filtered_repaired/tree/main

andras-makany · 2025-11-15T19:00:37Z

Hey @Grigorij-Dudnik! Thank you for testing my attempt on this fix.

Your attached dataset revealed a major problem. My dataset only had a singular task, but yours have at least two. It seems when the task description changes, the merging starts a new file. In this case, the offset hasn't been reset, resulting in your error.

I will look into a possible solution Monday.

…ing occurs or having multiple episodes in one video file

andras-makany · 2025-11-17T09:47:31Z

@Grigorij-Dudnik So I found the problem that resulted in your error.

The chunk and file indexing forced a single value on every episode in a dataset. This caused your problem, since some of your datasets, episodes were concatenated to the previous file, but in the metadata got the new file's index.

Second error was due to having multiple episodes in one file. Your datasets had one episode per video which meant that the episode count was equal to the loop count. Mine had multiple episodes per video (as the resulting dataset too). This caused missing indexes in my dataset when aggregation logic was corrected to yours.

As a solution for chunk and file index instead of a single value, I'm using a list to keep track of the resulting indexes.

The resulting dataset is correct both for your case, and mine.

Tests were fine, test_aggregate_datasets failed on assert_dataset_content_integrity and assert_video_frames_integrity, but passed assert_video_timestamps_within_bounds, which tests the "Invalid frame index" errors. I believe the integrity assertions failed before this fix too.

Grigorij-Dudnik · 2025-11-21T10:32:06Z

@andras-makany no worry, thanks for you for trying to solve an issue.

What about first error "The chunk and file indexing forced a single value on every episode in a dataset. This caused your problem, since some of your datasets, episodes were concatenated to the previous file, but in the metadata got the new file's index." - I didn't undestood what you meant here to be honest. But as I understand, you managed to fix it?

About second error - saving episodes in different video files been done by puropose - other way we had some ffmpeg problem during dataset collection.

I can test it on Sunday or Monday and confirm if it works for me.

andras-makany · 2025-11-21T10:41:00Z

@Grigorij-Dudnik thank you for your answer!

The first mentioned problem was that when aggregating datasets, if the video recordings from a single dataset gets partially merged into one video file in the merged dataset, and videos from other episodes into a second video file, the metadata was set incorrectly, as if every episode was in the last used file. Yes, the problem was solved.

For the second, thank you for your explanation.

fix merged video timestamps

fe40266

andras-makany mentioned this pull request Nov 13, 2025

Can't train on merged dataset if it contains more then one video file per camera #2328

Open

2 tasks

fix(aggregate): fix chunk and file indexing issue when dataset splitt…

dbf394c

…ing occurs or having multiple episodes in one video file

Merge branch 'main' into fix/merged_dataset_video_indexing

525de9b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(dataset) Fixing video indexing bug when using merged dataset | Fixes #[2328] | (🐛 Bug) | #2438

fix(dataset) Fixing video indexing bug when using merged dataset | Fixes #[2328] | (🐛 Bug) | #2438

Uh oh!

andras-makany commented Nov 13, 2025

Uh oh!

Grigorij-Dudnik commented Nov 15, 2025 •

edited

Loading

Uh oh!

andras-makany commented Nov 15, 2025 •

edited

Loading

Uh oh!

andras-makany commented Nov 17, 2025 •

edited

Loading

Uh oh!

Grigorij-Dudnik commented Nov 21, 2025

Uh oh!

andras-makany commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix(dataset) Fixing video indexing bug when using merged dataset | Fixes #[2328] | (🐛 Bug) | #2438

Are you sure you want to change the base?

fix(dataset) Fixing video indexing bug when using merged dataset | Fixes #[2328] | (🐛 Bug) | #2438

Uh oh!

Conversation

andras-makany commented Nov 13, 2025

What it does

How it was tested

How to checkout & try? (for the reviewer)

Uh oh!

Grigorij-Dudnik commented Nov 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andras-makany commented Nov 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andras-makany commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Grigorij-Dudnik commented Nov 21, 2025

Uh oh!

andras-makany commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Grigorij-Dudnik commented Nov 15, 2025 •

edited

Loading

andras-makany commented Nov 15, 2025 •

edited

Loading

andras-makany commented Nov 17, 2025 •

edited

Loading