Multimodal Egocentric Action Recognition

Abstract

Egocentric vision focuses on action recognition from a first-person perspective. This project investigates sampling techniques and multimodal data integration for improving egocentric action recognition. Using the Epic-Kitchens and ActionSense datasets, the project explores spatial and temporal information trade-offs in RGB frames and incorporates ElectroMyoGraphy (EMG) data for multimodal fusion. The research highlights the potential of combining modalities through mid-level fusion to enhance classification performance.

Project Overview

Motivation

Egocentric action recognition poses unique challenges due to the reliance on first-person video recordings. While RGB information is often sufficient, additional modalities like audio and sensor data (e.g., EMG) can provide complementary information. This project investigates:

Sampling strategies for RGB frames.
The integration of EMG data with RGB frames.
Late and mid-level fusion techniques for multimodal action recognition.

Datasets

Epic-Kitchens
- First-person videos of unscripted actions in diverse kitchen environments.
- Focus on RGB frame sampling (dense and uniform).
ActionSense
- Multimodal data, including RGB video, EMG signals, and spectrograms.
- Data recorded using wearable sensors in controlled environments.

Sampling Techniques

Dense Sampling: Focuses on spatial information by selecting adjacent frames.
Uniform Sampling: Prioritizes temporal information by selecting evenly spaced frames.
Finding: Dense sampling with 16 frames per clip outperformed uniform sampling, achieving the best trade-off for egocentric action recognition.

Feature Extraction

Extracted features using the Inflated 3D Convolutional Networks (I3D) pre-trained model.
Visualized features using dimensionality reduction techniques like PCA and t-SNE.

Multimodal Analysis

EMG Data

Preprocessed with filtering, scaling, and zero-padding.
Modeled using single and double-layer LSTM networks.

Spectrograms

Generated from EMG data and processed with a LeNet5 CNN model.

RGB Data

Processed using dense sampling and classified with a single-layer LSTM network.

Fusion Techniques

Late Fusion: Combines pre-trained model outputs at inference time.
Mid-Level Fusion: Jointly trains models by combining mid-level features, enhancing performance.

Experiments

Epic-Kitchens Results

Dense sampling (16 frames per clip) achieved the highest Top-1 accuracy of 60.23% using LSTM.

ActionSense Results

Individual modalities:
- RGB: 78.57% Top-1 accuracy (LSTM).
- EMG: 56.97% Top-1 accuracy (1-layer LSTM).
- Spectrograms: 54.50% Top-1 accuracy (LeNet5).
Fusion approaches:
- Mid-level fusion (RGB+EMG): 80.92% Top-1 accuracy.
- Late fusion struggled to surpass individual modality performances.

Conclusion

The project demonstrates the potential of multimodal approaches in egocentric action recognition:

Dense sampling of RGB frames provides superior spatial and temporal trade-offs.
Mid-level fusion of RGB and EMG enhances classification performance.

Name		Name	Last commit message	Last commit date
Latest commit History 191 Commits
action-net/base		action-net/base
configs		configs
emg/base		emg/base
models		models
plots		plots
pretrained_i3d		pretrained_i3d
saved_features		saved_features
saved_logits		saved_logits
tasks		tasks
train_val		train_val
utils		utils
.gitignore		.gitignore
GUIDE.md		GUIDE.md
README.md		README.md
balance_emg_rgbvideo_splits.py		balance_emg_rgbvideo_splits.py
colab_runner.ipynb		colab_runner.ipynb
late_fusion.py		late_fusion.py
plot_features.py		plot_features.py
process_emg.py		process_emg.py
read_pickles.py		read_pickles.py
report.pdf		report.pdf
requirements.yaml		requirements.yaml
save_feat.py		save_feat.py
save_feat_emg.py		save_feat_emg.py
show_specto.py		show_specto.py
stats.pkl		stats.pkl
train_classifier.py		train_classifier.py
train_classifier_emg.py		train_classifier_emg.py
train_classifier_fusion.py		train_classifier_fusion.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Multimodal Egocentric Action Recognition

Abstract

Project Overview

Motivation

Datasets

Sampling Techniques

Feature Extraction

Multimodal Analysis

EMG Data

Spectrograms

RGB Data

Fusion Techniques

Experiments

Epic-Kitchens Results

ActionSense Results

Conclusion

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

figimodi/AML-egovision

Folders and files

Latest commit

History

Repository files navigation

Multimodal Egocentric Action Recognition

Abstract

Project Overview

Motivation

Datasets

Sampling Techniques

Feature Extraction

Multimodal Analysis

EMG Data

Spectrograms

RGB Data

Fusion Techniques

Experiments

Epic-Kitchens Results

ActionSense Results

Conclusion

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages