active learning loop workflow on custom dataset #1087

naxatra2 · 2025-07-03T12:56:15Z

I have created a new notebook which is not linked to deepforest in any ways, I just wrote my code in the same directory for convenience. This notebook tries to see whether my model is learning in the way that I want or not.

I have used a custom dataset of daisy flowers with 131 annotated images in the COCO dataset with images format. To reproduce this code we need training image dataset with its annotations either in .json or .csv format and a test dataset.

I have used a very light model for training, to reduce time. If i replace it with a better model then the accuracy can improve

Objective

simulating how an object-detector’s accuracy (mAP) improves as I will iteratively label more images. This example currently uses random sampling, which (I will take as baseline) from the unlabeled pool, next step is to use an active learning specific sampling tecohnique.

How to reproduce this (without shipping the giant flower dataset)

Prepare your own dataset in COCO format (or convert from VOC/Pascal/CSV into COCO).
- You need a JSON with images, annotations, and categories keys. Each annotation must have image_id, bbox in [x,y,w,h], and category_id.
- Put your image files in two folders (one for training + pool, one for the fixed test/val split).

Majorly 3 steps in the workflow

COCO ↔ CSV conversion
- parse_coco(json_file, img_dir) function reads a COCO-style annotation JSON and writes out a flat labels_raw.csv, with one row per bounding box (xmin,ymin,xmax,ymax,label,image_path).
- build_coco_gt(df, out_json) utility takes that CSV back into a minimal COCO JSON (images, annotations, categories) so that we can use it later as the “ground truth” for evaluation.
Custom Dataset + DataLoader
- FlowerDataset class (subclassing torch.utils.data.Dataset) whose __getitem__ loads an image, retrieves its boxes and labels from my CSV/COCO data, applies resizing, converts everything to tensors, and returns (image, target_dict) for TorchVision detection models.
Active-Learning Loop
- For each of ROUNDS cycles:
  1. build DataLoaders for the current train_idx and test_idx.
  2. train a fresh Faster-RCNN on the labeled subset.
  3. evaluate on the validation set, record the mAP, and print it.
  4. randomly sample POOL_BATCH new images from the pool to add to train_idx

After all rounds finish, I plot the number of labeled images against [email protected] to see how performance scales as I label more data.

naxatra2 · 2025-07-03T12:57:54Z

@jveitchmichaelis @bw4sz can you please look into this repo.

this is made in reference to this comment form #1069

Thanks @naxatra2, I think a good next step would be to demo a training + random sampling example, without label studio for now. I suggest you use images that we already have labels for (to avoid the human step). Let us know if you need any help putting that together.

It would be interesting to start totally from scratch e.g. a RetinaNet model that has been trained on MS-COCO. Use that to select your first X images, train and repeat. A nice outcome from the project would be a graph of number_of_images vs model_performance with different sampling strategies. Once we have your loop ready, those experiments should be easy to run on the UF cluster.

jveitchmichaelis · 2025-07-03T13:58:45Z

Thanks @naxatra2, some comments:

Good to see a self contained example.
My suggestion for these experiments is to always run a baseline training loop first, where you attempt to achieve a good score on the test set using the entire training dataset. Normally what you see in papers for active learning is that if you can use the entire dataset, you'll get the best/same results if you train long enough.
What pre-training options are you intending to use here? The weights parameter is current I think, rather than pretrained/pretrained_backbone (you also have some different initializations in the notebook with false/true, true/true).
We usually always start with something trained on MS-COCO or a similar backbone, at least unless the training dataset is 1000s of images. Oddly enough it looks like there isn't a flower class in COCO, which surprised me (there is "potted plant").
The mAP in your plots is still basically zero, so it looks like we're a long way from convergence.
You're training for a single epoch? I'm not sure what the literature recommends here, but this is very few iterations if your dataset is only 1-2 images at a time.

@bw4sz do you have a suggestion for which tree/box dataset we should start exploring to test this? Neon?

naxatra2 · 2025-07-03T14:57:24Z

I was confused in the dataset part, due to which instead of choosing a heavier and better model (possibly pretrained), I opted to use a very light model that I could run without GPU, I mostly did this to check whether my implemented logic was working or not. I think this creates most of the issues in my notebook. For example this part:

What pre-training options are you intending to use here? The weights parameter is current I think, rather than pretrained/pretrained_backbone (you also have some different initializations in the notebook with false/true, true/true).

This is mostly because I was experimenting with multiple models, and I forgot to clean the code. I think this and the small dataset is the reason behind my almost negligible mAP values

Also, I initially thought of using the NEON dataset, but it was too big and from github I was only able to find the annotations not the training dataset. so, I just used a very basic custom dataset to structure my notebook.

jveitchmichaelis · 2025-07-03T15:34:08Z

If you're running in a Notebook, I would recommend using Google Colab for free GPU access (Nvidia T4). Disk space should be plenty on there too.

naxatra2 · 2025-07-05T13:44:26Z

Hi @bw4sz , you mentioned something about a demo code or structure related to my model training in the last meet. Can you provide it please

bw4sz · 2025-07-07T20:16:14Z

So my general thoughts are to into integrate within deepforest models to give a reasonable starting point.

Load a deepforest model
Make a prediction on a set of images
select images to annotate based on sampling criteria
Gather images push to label studio
Extract from label studio
Prepare for training
Go back to step 2.

You could grab a small number of images from https://milliontrees.idtrees.org/en/latest/. and use the deepforest tree model.

naxatra2 · 2025-07-08T08:14:44Z

I will update my progress by tonight here, based on these inputs

naxatra2 · 2025-07-08T21:22:50Z

Hi @jveitchmichaelis @bw4sz can you please look into this colab file that I created.

https://colab.research.google.com/drive/1C3fINy4rCsPWsWflx-FDtme0q49oRjKr?usp=sharing

My mAP values are still basically non-existent like max mAP value I am getting is 0.06 after 30-40 epochs. which is not good. Can you please check this. I am not able to understand the things that I am doing wrong here. I have also attached the dataset and the resources that I have referenced in the notebook itself. I am feeling stuck in properly training the baseline model.

jveitchmichaelis · 2025-07-08T22:14:11Z

Thanks for sharing - I'll have a look through the notebook and see what I can figure out.

…

On Tue, 8 Jul 2025 at 17:23, Nakshatra ***@***.***> wrote: *naxatra2* left a comment (weecology/DeepForest#1087) <#1087 (comment)> Hi @jveitchmichaelis <https://github.com/jveitchmichaelis> @bw4sz <https://github.com/bw4sz> can you please look into this colab file that I created. https://colab.research.google.com/drive/1C3fINy4rCsPWsWflx-FDtme0q49oRjKr?usp=sharing My mAP values are still basically non-existent like max mAP value I am getting is 0.06 after 30-40 epochs. which is not good. Can you please check this. I am not able to understand the things that I am doing wrong here. I have also attached the dataset and the resources that I have referenced in the notebook itself. I am feeling stuck in properly training the baseline model. — Reply to this email directly, view it on GitHub <#1087 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAYDMJ77MQXANMLOY63SXHL3HQZEBAVCNFSM6AAAAACAWXS7O6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTANJQGMZTONBUGE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

naxatra2 · 2025-07-10T12:35:42Z

Hi @jveitchmichaelis I’ve spent some more time trying to debug the training, but I’m still getting very low mAP values, and I haven’t been able to make any real progress. I’ve tried going through my setup and references again, but I can’t figure out where I’m going wrong.
Would really appreciate it if you could take a look when you get a chance. I'm feeling quite stuck at the moment.

jveitchmichaelis · 2025-07-10T16:43:33Z

https://colab.research.google.com/drive/1V7NlByb1yBt5-XDAombVQYrzR1I_MZ0a?usp=sharing @naxatra2

The main (only?) change is that I fixed the number of training classes to be len(CLASSES) and made sure that CLASSES corresponds to your custom dataset. The model has two components: a backbone, which extracts features, and a "head" which takes those features and detects objects. The backbone should not change much, as it's already been trained on many images and the output shape is the same as . The head however must be changed to detect new classes (i.e. from 90 in COCO to 4 in your custom one).

Val map is still terrible so my guess is there's a bug in the validation code somewhere - check your box formats? Probably this line:

# Box format must be xywh, default is xyxy
metric = MeanAveragePrecision(box_format='xywh')

Potentially also lower the learning rate.

I also added a download cell at the top so the dataset is pulled if you're making a new environment (not saved to Drive).

naxatra2 · 2025-07-16T20:39:03Z

I have made some progress but I think in 1-2 more days I can give a good plot that can show the improvements in mAP values from using active learning. I was able to fix the issues I was getting earlier. And after runining my model for 60 epochs. I was getting almost constant ~0.72 mAP value. I trained my model on the whole training dataset (1052 images). This is the baseline case that I will use later for benchmarks

After this, I created another notebook for active learning. I thought I made some progress because I was getting around 0.4-0.5 mAP values after 5 cycles. But there were logical errors in my code that I didn't notice. So, I have to redo that part again, but still I am not feeling stuck. I can see some progress.

My first step is that I am trying to pick 10% of the images randomly from the training data and calling it as labelled, rest of my dataset will be called unlabeled. then I will train my model on the labelled images and slowly feed more images to it to see how much mAP value can be reached with minimal amount of image. My first 10 images are randomly picked then for the next cycles I am using least confidence sampling. My error that I didn't notice at first were like these:


=== Cycle 1 — training on 210 images ===
  Epoch 1/5 — loss 0.6135
  Epoch 2/5 — loss 0.5633
  Epoch 3/5 — loss 0.5405
  Epoch 4/5 — loss 0.4992
  Epoch 5/5 — loss 0.4600
→ Validation mAP: 0.3623
→ Adding to seed: []

=== Cycle 2 — training on 260 images ===
  Epoch 1/5 — loss 0.4367
  Epoch 2/5 — loss 0.4456
  Epoch 3/5 — loss 0.4073
  Epoch 4/5 — loss 0.3893
  Epoch 5/5 — loss 0.4173
→ Validation mAP: 0.3752
→ Adding to seed: []

=== Cycle 3 — training on 260 images ===
  Epoch 1/5 — loss 0.3749
  Epoch 2/5 — loss 0.4055
  Epoch 3/5 — loss 0.3657
  Epoch 4/5 — loss 0.3213
  Epoch 5/5 — loss 0.3241
→ Validation mAP: 0.4241
→ Adding to seed: []

=== Cycle 4 — training on 260 images ===
  Epoch 1/5 — loss 0.3488
  Epoch 2/5 — loss 0.3353
  Epoch 3/5 — loss 0.3054
  Epoch 4/5 — loss 0.3495
  Epoch 5/5 — loss 0.2992
→ Validation mAP: 0.4452
→ Adding to seed: []


=== Cycle 5 — training on 260 images ===
  Epoch 1/5 — loss 0.3231

I just looked at the first 2 cycles and saw that the dataset was changing in size. But that was not actually the case. This is logically wrong because my number of images are constant after cycle 2, which makes all of these result kind of useless. I am trying to fix them. Running these models takes 2-3 hrs so, the progress is being hindered a little by this issue.

I switched to kaggle from colab because of more CPU and GPU resources avaialable there. The dataset is same

This is the complete notebook that I used for fine tuning and downloading the weights for my cutsom dataset: https://www.kaggle.com/code/jiya1404/african-wildlife

This is the new notebook that I am currently working on for active learning pipeline: https://www.kaggle.com/code/jiya1404/al-african-wildlife

jveitchmichaelis · 2025-07-16T23:49:06Z

Great, if you've got it working (or after you figure out what the current bug is) then I think it might be worth moving to a larger dataset with tree imagery. That might give you some more headroom to try different sampling strategies, and in the end we'd like to apply this to various aerial datasets.

You could use the current public release of Milliontrees to start? That would give you 3-4k images with a decent amount of diversity (different locations). Let me know if you need any help there or if you need us to run some longer experiments.

We should also try to get this working with the deepforest training pipeline.

naxatra2 · 2025-07-17T06:16:04Z

As of now, I first thought to create a plot that shows the 'mAP values 'vs 'the number of images' needed to obtain that value. Once I get that part done. My pipeline will be mostly ready, and I can add 1-2 extra sampling techniques (I already have the functions for different sampling techniques in my different PR, so this part should be a bit easier). I am doing this for benchmarking and annotation efficiency improvement checks.

I want a figure like this (This is not mine, but the graph should represent similar information):

Once, this gets done. I will try to use a larger dataset.

jveitchmichaelis · 2025-07-17T21:27:52Z

Ok, that sounds sensible. I would add that the curves are almost certainly different for different datasets so I'd be curious to see whether aerial (our dataset) vs terrestrial (your example data) photos are more amenable to this.

naxatra2 · 2025-07-24T11:05:00Z

I have used 3 sampling techniques to check the learnings of my model.

Graph explanation

Random Sampling (Baseline): This serves as our baseline by selecting new images for annotation completely at random from the unlabeled pool. It represents the performance gain without any intelligent selection.
Uncertainty Sampling: This improves upon the baseline by having the model query images it is most confused about. We calculate an uncertainty score for each unlabeled image and prioritize labeling the ones with the highest uncertainty (my code crashed while I running this part so, I only have datapoints upto 4th cycle).
Random + Uncertainty + Diversity Sampling (Hybrid): This refines both of the above. It first identifies a pool of highly uncertain images and then randomly samples from this pool. This hybrid technique ensures we select informative examples while also promoting diversity, which prevents the model from focusing too narrowly on a single type of difficult case and helps improve its overall generalization.
Dotted line represents the base line mAP value that I obtained by fine tuning the retinanet model on a custom dataset. It doesn't have any active learning techniques.

This is the notebook That I have worked on: https://www.kaggle.com/code/jiya1404/active-learning-african-wildlife

naxatra2 · 2025-07-28T08:15:24Z

@jveitchmichaelis Can you please attach the deepforest dataset here on which I should try to run my model next. As you said in the last meet.

jveitchmichaelis · 2025-07-28T09:26:51Z

Try the public release of MillionTrees: https://milliontrees.idtrees.org/en/latest/

The "TreeBoxes" dataset. You can use the library API to download it into your Notebook.

naxatra2 · 2025-07-30T06:35:43Z

I am not able to download this dataset, I first thought I am doing something wrong but I don't know what is going wrong. I have copied the exact code from the getting started portion of the docs but it fails to run

My notebook: https://www.kaggle.com/code/jiya1404/million
Docs: https://milliontrees.idtrees.org/en/latest/getting_started.html#installation

jveitchmichaelis · 2025-07-30T06:54:05Z

@naxatra2 URL seems to be broken, try this: https://data.rc.ufl.edu/pub/ewhite/MillionTrees/TreeBoxes_v0.2.zip

It's fine here: https://github.com/weecology/MillionTrees/blob/d0d3942e714abf4261c264e4cd5b49bd9a9f8a45/src/milliontrees/datasets/TreeBoxes.py#L64

If you can extract that, you should be good to start.

naxatra2 · 2025-07-30T12:08:56Z

I was not able to run the dataset loader or the TreeBoxesDataset. So, I have created a custom class to load the locally downloaded dataset. Also, I have created the pool of labeled and unlabeled images from my dataset. I will run the training loop tonight I think it will take a lot of time because of large dataset.

I have attached the notebook that I am working on. It has the things that I mentioned here.

https://www.kaggle.com/code/jiya1404/million

jveitchmichaelis · 2025-07-30T21:27:39Z

Yes, you'll probably need to custom load it, but the format should be straightforward (CSV + folder full of images).

Can't access the notebook - you might need to enable sharing. @naxatra2

naxatra2 · 2025-07-31T05:44:12Z

I accidentally shared with private access. I have now changed it to public @jveitchmichaelis

naxatra2 · 2025-08-21T09:46:52Z

I have committed some changes in this PR and closed the other to avoid confusion. @jveitchmichaelis @bw4sz

naxatra2 · 2025-08-26T17:44:40Z

@jveitchmichaelis @bw4sz can you please tell about how should I add the label studio integration part. In the meet we discussed that there were some functions already made ? or should I make them.

naxatra2 · 2025-08-26T19:34:39Z

Could you also review my approach in the code. So I can refine it before our next meeting? then I will also commit more sampling techniques in this PR.

jveitchmichaelis · 2025-08-27T07:59:17Z

@naxatra2 use the official API as much as possible I think: https://labelstud.io/guide/sdk

Looks like it's coming along well :)

Review - my main comment is that you could try to use more of the existing functionality in DeepForest, for example use the config system that already exists, you can set the training parameters there as most will stay the same. I suggest having a separate section for the active learning bits eg the pool csv, pool image root, sample method, etc. It's good that you're using the deepforest main.

I've left some inline comments. I'd also suggest focusing on integration tests. The current tests are quite hard to follow with all the mocks (if you asked an LLM to help with this, they really like adding monkey patches which isn't always good). We definitely want to check with real model + data calls.

naxatra2 · 2025-09-04T11:47:01Z

train.json
I was able to fetch the task annotations from my locally hosted Label Studio projects, I have also attached the output that I got from it. I’m committing the relevant code here.

I’ve added some Label Studio related helper functions, but as of now a user may not be able to use them directly to interact with label studio since there’s no CLI or user-friendly interface available, because it’s just a collection of helper functions for now. I have not changed the testcases yet, but I am also working towards the config file to possibly remove the config class from my code. I have created a new active_learning.yml for this

naxatra2 · 2025-09-12T10:55:35Z

@jveitchmichaelis Can you please look at my latest commit. I have added some improved testcases for my src\deepforest\active_learning.py and they are running fine locally. But they are not passing through github actions. If I fix that, then most of the part of this PR will be complete

active learning loop workflow on custom dataset

7833f8c

jveitchmichaelis added the Google Summer of Code This label is for ideas that could be used for google summer of code. label Jul 3, 2025

add active learning

a091dc8

add docs

b15973d

naxatra2 added 5 commits September 4, 2025 17:26

add label studio integration functions

eb37525

Improved docs and label studio helper functions

f291396

Improved test_active_learning.py and fixed C/I issues

6d01e2b

fix import error

54ffcee

Revert unintended changes

69440bd

active learning loop workflow on custom dataset #1087

Are you sure you want to change the base?

active learning loop workflow on custom dataset #1087

Uh oh!

Conversation

naxatra2 commented Jul 3, 2025

Objective

How to reproduce this (without shipping the giant flower dataset)

Majorly 3 steps in the workflow

Uh oh!

naxatra2 commented Jul 3, 2025

Uh oh!

jveitchmichaelis commented Jul 3, 2025

Uh oh!

naxatra2 commented Jul 3, 2025

Uh oh!

jveitchmichaelis commented Jul 3, 2025

Uh oh!

naxatra2 commented Jul 5, 2025

Uh oh!

bw4sz commented Jul 7, 2025

Uh oh!

naxatra2 commented Jul 8, 2025

Uh oh!

naxatra2 commented Jul 8, 2025

Uh oh!

jveitchmichaelis commented Jul 8, 2025 via email

Uh oh!

naxatra2 commented Jul 10, 2025

Uh oh!

jveitchmichaelis commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

naxatra2 commented Jul 16, 2025

Uh oh!

jveitchmichaelis commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

naxatra2 commented Jul 17, 2025

Uh oh!

jveitchmichaelis commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

naxatra2 commented Jul 24, 2025

Graph explanation

Uh oh!

naxatra2 commented Jul 28, 2025

Uh oh!

jveitchmichaelis commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

naxatra2 commented Jul 30, 2025

Uh oh!

jveitchmichaelis commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

naxatra2 commented Jul 30, 2025

Uh oh!

jveitchmichaelis commented Jul 30, 2025

Uh oh!

naxatra2 commented Jul 31, 2025

Uh oh!

naxatra2 commented Aug 21, 2025

Uh oh!

naxatra2 commented Aug 26, 2025

Uh oh!

naxatra2 commented Aug 26, 2025

Uh oh!

jveitchmichaelis commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

naxatra2 commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

naxatra2 commented Sep 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

jveitchmichaelis commented Jul 10, 2025 •

edited

Loading

jveitchmichaelis commented Jul 16, 2025 •

edited

Loading

jveitchmichaelis commented Jul 17, 2025 •

edited

Loading

jveitchmichaelis commented Jul 28, 2025 •

edited

Loading

jveitchmichaelis commented Jul 30, 2025 •

edited

Loading

jveitchmichaelis commented Aug 27, 2025 •

edited

Loading

naxatra2 commented Sep 4, 2025 •

edited

Loading