-
Notifications
You must be signed in to change notification settings - Fork 4
add a new blog post on a data analysis workflow with git-worktrees and datalad #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
yarikoptic
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is wonderful! Thank you for composing
I left some initial comments... the plane is taking off, will try to review more later.
Cheer
|
|
||
| At the recent [Distribits](https://www.distribits.live/events/2025-distribits/) meeting, I shared my struggle with [Yarik](https://github.com/yarikoptic), and confessed that I never fully understood the [YODA principle](https://handbook.datalad.org/en/latest/basics/101-127-yoda.html) - I was on the way, but I couldn't see clearly where I was heading. Together, we've found a sweet spot that gives me both **fast iteration during development** and **clean reproducibility for batch processing** - all without duplicating data or rebuilding containers. The secret? [Git worktrees](https://git-scm.com/docs/git-worktree) combined with [DataLad](https://handbook.datalad.org/en/latest/index.html)'s nested datasets. | ||
|
|
||
| ### When YODA's wisdom grows on trees |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I felt that the jump to details is a bit too sudden for those who aren't familiar with yoda. Might be worth adding a sentence or two on work trees and composition of datasets via submodules here as a principle of modularity in Yoda principles
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be even pointing to other examples like datasets in OpenNeuroDerivatives to reattest that it is somewhat "common" and not that scary
| At the recent [Distribits](https://www.distribits.live/events/2025-distribits/) meeting, I shared my struggle with [Yarik](https://github.com/yarikoptic), and confessed that I never fully understood the [YODA principle](https://handbook.datalad.org/en/latest/basics/101-127-yoda.html) - I was on the way, but I couldn't see clearly where I was heading. Together, we've found a sweet spot that gives me both **fast iteration during development** and **clean reproducibility for batch processing** - all without duplicating data or rebuilding containers. The secret? [Git worktrees](https://git-scm.com/docs/git-worktree) combined with [DataLad](https://handbook.datalad.org/en/latest/index.html)'s nested datasets. | ||
|
|
||
| ### When YODA's wisdom grows on trees | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also add a sentence or two on the opening on what your study is about and what you are aiming here at (preprocessing vs eg paper figures)
| │ └── 04_dataframes/ | ||
| └── ... | ||
| ``` | ||
| The active work in my project happens in the 'derived/L5b' subdataset that consumes raw data as inputs and produces multiple intermediate outputs, e.g. '01_suite2p' ... '04_dataframes'. The subdataset 'code' is a pure git repo and I use [Jujutsu](https://docs.jj-vcs.dev/latest/) for active development (because it's such a beautiful tool). After a rather nerve-wracking experience of mixing datalad with jj - it was like Schrödinger's cat, everything was simultaneously staged and unstaged 😱 - I restrict jj usage to 'code', and continue using datalad for managing all subdatasets with annexed content, orchestrating across nested datasets as a whole, and for captureing provenance of all derived data and figures with the datalad (containers-) run command. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is such a wonderful example of mixing different tech (git annex and datalad , with jj) which is built on top of the same (git) foundation using that foundation basic constructs (repos, commits, submodules) ... I wonder if we could add some kind of message here too for people to use such core text instead of creating new and ad hoc stuff
| HEAD detached from refs/heads/runs | ||
| nothing to commit, working tree clean | ||
| ``` | ||
| Let's update that branch and sew the 'HEAD' back to 'runs'. There are multiple ways to do that, I use jj: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Let's update that branch and sew the 'HEAD' back to 'runs'. There are multiple ways to do that, I use jj: | |
| Let's update that branch and sew the 'HEAD' back to 'runs'. There are multiple ways to do that, including plain `git` commands, I use jj which I am more familiar with: |
| save (notneeded: 5) | ||
| unlock (ok: 28) | ||
| ``` | ||
| What happened? - Apparently, I've lost the '.venv' directory during the hard reset which causes "ModuleNotFoundError: No module named 'process2p'". määäh! This actually speaks for the use of a container. The problem with the container is that I have to rebuild it everytime I update my code ... annoying! I guess that's the trade-off between efficiency and reproducibility. To illustrate this 'highly complex' dilemma with Deepseek's smart-ass comment in a graph: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
technically that's not true: you can use container as your "environment" and run outside script in it. You can even bind-mount that script over some version of that script which you might have already inside the container. So you can kinda have the best of both worlds... might want to adjust "problem statement" here ;)
Co-authored-by: Yaroslav Halchenko <[email protected]>
Co-authored-by: Yaroslav Halchenko <[email protected]>
Co-authored-by: Yaroslav Halchenko <[email protected]>
Co-authored-by: Yaroslav Halchenko <[email protected]>
Co-authored-by: Yaroslav Halchenko <[email protected]>
Co-authored-by: Yaroslav Halchenko <[email protected]>
Co-authored-by: Yaroslav Halchenko <[email protected]>
Co-authored-by: Yaroslav Halchenko <[email protected]>
Co-authored-by: Yaroslav Halchenko <[email protected]>
No description provided.