Skip to content

Conversation

@just-meng
Copy link

@just-meng just-meng commented Dec 9, 2025

No description provided.

@yarikoptic yarikoptic changed the title add a new blog post on a data analysis workflow with git-worktrees an… add a new blog post on a data analysis workflow with git-worktrees and datalad Dec 12, 2025
Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wonderful! Thank you for composing

I left some initial comments... the plane is taking off, will try to review more later.

Cheer


At the recent [Distribits](https://www.distribits.live/events/2025-distribits/) meeting, I shared my struggle with [Yarik](https://github.com/yarikoptic), and confessed that I never fully understood the [YODA principle](https://handbook.datalad.org/en/latest/basics/101-127-yoda.html) - I was on the way, but I couldn't see clearly where I was heading. Together, we've found a sweet spot that gives me both **fast iteration during development** and **clean reproducibility for batch processing** - all without duplicating data or rebuilding containers. The secret? [Git worktrees](https://git-scm.com/docs/git-worktree) combined with [DataLad](https://handbook.datalad.org/en/latest/index.html)'s nested datasets.

### When YODA's wisdom grows on trees
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I felt that the jump to details is a bit too sudden for those who aren't familiar with yoda. Might be worth adding a sentence or two on work trees and composition of datasets via submodules here as a principle of modularity in Yoda principles

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be even pointing to other examples like datasets in OpenNeuroDerivatives to reattest that it is somewhat "common" and not that scary

At the recent [Distribits](https://www.distribits.live/events/2025-distribits/) meeting, I shared my struggle with [Yarik](https://github.com/yarikoptic), and confessed that I never fully understood the [YODA principle](https://handbook.datalad.org/en/latest/basics/101-127-yoda.html) - I was on the way, but I couldn't see clearly where I was heading. Together, we've found a sweet spot that gives me both **fast iteration during development** and **clean reproducibility for batch processing** - all without duplicating data or rebuilding containers. The secret? [Git worktrees](https://git-scm.com/docs/git-worktree) combined with [DataLad](https://handbook.datalad.org/en/latest/index.html)'s nested datasets.

### When YODA's wisdom grows on trees
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add a sentence or two on the opening on what your study is about and what you are aiming here at (preprocessing vs eg paper figures)

│ └── 04_dataframes/
└── ...
```
The active work in my project happens in the 'derived/L5b' subdataset that consumes raw data as inputs and produces multiple intermediate outputs, e.g. '01_suite2p' ... '04_dataframes'. The subdataset 'code' is a pure git repo and I use [Jujutsu](https://docs.jj-vcs.dev/latest/) for active development (because it's such a beautiful tool). After a rather nerve-wracking experience of mixing datalad with jj - it was like Schrödinger's cat, everything was simultaneously staged and unstaged 😱 - I restrict jj usage to 'code', and continue using datalad for managing all subdatasets with annexed content, orchestrating across nested datasets as a whole, and for captureing provenance of all derived data and figures with the datalad (containers-) run command.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is such a wonderful example of mixing different tech (git annex and datalad , with jj) which is built on top of the same (git) foundation using that foundation basic constructs (repos, commits, submodules) ... I wonder if we could add some kind of message here too for people to use such core text instead of creating new and ad hoc stuff

HEAD detached from refs/heads/runs
nothing to commit, working tree clean
```
Let's update that branch and sew the 'HEAD' back to 'runs'. There are multiple ways to do that, I use jj:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Let's update that branch and sew the 'HEAD' back to 'runs'. There are multiple ways to do that, I use jj:
Let's update that branch and sew the 'HEAD' back to 'runs'. There are multiple ways to do that, including plain `git` commands, I use jj which I am more familiar with:

save (notneeded: 5)
unlock (ok: 28)
```
What happened? - Apparently, I've lost the '.venv' directory during the hard reset which causes "ModuleNotFoundError: No module named 'process2p'". määäh! This actually speaks for the use of a container. The problem with the container is that I have to rebuild it everytime I update my code ... annoying! I guess that's the trade-off between efficiency and reproducibility. To illustrate this 'highly complex' dilemma with Deepseek's smart-ass comment in a graph:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically that's not true: you can use container as your "environment" and run outside script in it. You can even bind-mount that script over some version of that script which you might have already inside the container. So you can kinda have the best of both worlds... might want to adjust "problem statement" here ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants