Skip to content

Conversation

@naxatra2
Copy link
Contributor

@naxatra2 naxatra2 commented Jul 3, 2025

I have created a new notebook which is not linked to deepforest in any ways, I just wrote my code in the same directory for convenience. This notebook tries to see whether my model is learning in the way that I want or not.

I have used a custom dataset of daisy flowers with 131 annotated images in the COCO dataset with images format. To reproduce this code we need training image dataset with its annotations either in .json or .csv format and a test dataset.

I have used a very light model for training, to reduce time. If i replace it with a better model then the accuracy can improve

Objective

simulating how an object-detector’s accuracy (mAP) improves as I will iteratively label more images. This example currently uses random sampling, which (I will take as baseline) from the unlabeled pool, next step is to use an active learning specific sampling tecohnique.

How to reproduce this (without shipping the giant flower dataset)

  1. Prepare your own dataset in COCO format (or convert from VOC/Pascal/CSV into COCO).

    • You need a JSON with images, annotations, and categories keys. Each annotation must have image_id, bbox in [x,y,w,h], and category_id.
    • Put your image files in two folders (one for training + pool, one for the fixed test/val split).

Majorly 3 steps in the workflow

  1. COCO ↔ CSV conversion

    • parse_coco(json_file, img_dir) function reads a COCO-style annotation JSON and writes out a flat labels_raw.csv, with one row per bounding box (xmin,ymin,xmax,ymax,label,image_path).
    • build_coco_gt(df, out_json) utility takes that CSV back into a minimal COCO JSON (images, annotations, categories) so that we can use it later as the “ground truth” for evaluation.
  2. Custom Dataset + DataLoader

    • FlowerDataset class (subclassing torch.utils.data.Dataset) whose __getitem__ loads an image, retrieves its boxes and labels from my CSV/COCO data, applies resizing, converts everything to tensors, and returns (image, target_dict) for TorchVision detection models.
  3. Active-Learning Loop

    • For each of ROUNDS cycles:
      1. build DataLoaders for the current train_idx and test_idx.
      2. train a fresh Faster-RCNN on the labeled subset.
      3. evaluate on the validation set, record the mAP, and print it.
      4. randomly sample POOL_BATCH new images from the pool to add to train_idx
  • After all rounds finish, I plot the number of labeled images against [email protected] to see how performance scales as I label more data.

@naxatra2
Copy link
Contributor Author

naxatra2 commented Jul 3, 2025

@jveitchmichaelis @bw4sz can you please look into this repo.

this is made in reference to this comment form #1069

Thanks @naxatra2, I think a good next step would be to demo a training + random sampling example, without label studio for now. I suggest you use images that we already have labels for (to avoid the human step). Let us know if you need any help putting that together.

It would be interesting to start totally from scratch e.g. a RetinaNet model that has been trained on MS-COCO. Use that to select your first X images, train and repeat. A nice outcome from the project would be a graph of number_of_images vs model_performance with different sampling strategies. Once we have your loop ready, those experiments should be easy to run on the UF cluster.

@jveitchmichaelis jveitchmichaelis added the Google Summer of Code This label is for ideas that could be used for google summer of code. label Jul 3, 2025
@jveitchmichaelis
Copy link
Collaborator

Thanks @naxatra2, some comments:

  • Good to see a self contained example.
  • My suggestion for these experiments is to always run a baseline training loop first, where you attempt to achieve a good score on the test set using the entire training dataset. Normally what you see in papers for active learning is that if you can use the entire dataset, you'll get the best/same results if you train long enough.
  • What pre-training options are you intending to use here? The weights parameter is current I think, rather than pretrained/pretrained_backbone (you also have some different initializations in the notebook with false/true, true/true).
  • We usually always start with something trained on MS-COCO or a similar backbone, at least unless the training dataset is 1000s of images. Oddly enough it looks like there isn't a flower class in COCO, which surprised me (there is "potted plant").
  • The mAP in your plots is still basically zero, so it looks like we're a long way from convergence.
  • You're training for a single epoch? I'm not sure what the literature recommends here, but this is very few iterations if your dataset is only 1-2 images at a time.

@bw4sz do you have a suggestion for which tree/box dataset we should start exploring to test this? Neon?

@naxatra2
Copy link
Contributor Author

naxatra2 commented Jul 3, 2025

I was confused in the dataset part, due to which instead of choosing a heavier and better model (possibly pretrained), I opted to use a very light model that I could run without GPU, I mostly did this to check whether my implemented logic was working or not. I think this creates most of the issues in my notebook. For example this part:

What pre-training options are you intending to use here? The weights parameter is current I think, rather than pretrained/pretrained_backbone (you also have some different initializations in the notebook with false/true, true/true).

This is mostly because I was experimenting with multiple models, and I forgot to clean the code. I think this and the small dataset is the reason behind my almost negligible mAP values

Also, I initially thought of using the NEON dataset, but it was too big and from github I was only able to find the annotations not the training dataset. so, I just used a very basic custom dataset to structure my notebook.

@jveitchmichaelis
Copy link
Collaborator

If you're running in a Notebook, I would recommend using Google Colab for free GPU access (Nvidia T4). Disk space should be plenty on there too.

@naxatra2
Copy link
Contributor Author

naxatra2 commented Jul 5, 2025

Hi @bw4sz , you mentioned something about a demo code or structure related to my model training in the last meet. Can you provide it please

@bw4sz
Copy link
Collaborator

bw4sz commented Jul 7, 2025

So my general thoughts are to into integrate within deepforest models to give a reasonable starting point.

  1. Load a deepforest model
  2. Make a prediction on a set of images
  3. select images to annotate based on sampling criteria
  4. Gather images push to label studio
  5. Extract from label studio
  6. Prepare for training
  7. Go back to step 2.

You could grab a small number of images from https://milliontrees.idtrees.org/en/latest/. and use the deepforest tree model.

@naxatra2
Copy link
Contributor Author

naxatra2 commented Jul 8, 2025

I will update my progress by tonight here, based on these inputs

@naxatra2
Copy link
Contributor Author

naxatra2 commented Jul 8, 2025

Hi @jveitchmichaelis @bw4sz can you please look into this colab file that I created.

https://colab.research.google.com/drive/1C3fINy4rCsPWsWflx-FDtme0q49oRjKr?usp=sharing

My mAP values are still basically non-existent like max mAP value I am getting is 0.06 after 30-40 epochs. which is not good. Can you please check this. I am not able to understand the things that I am doing wrong here. I have also attached the dataset and the resources that I have referenced in the notebook itself. I am feeling stuck in properly training the baseline model.

@jveitchmichaelis
Copy link
Collaborator

jveitchmichaelis commented Jul 8, 2025 via email

@naxatra2
Copy link
Contributor Author

Hi @jveitchmichaelis I’ve spent some more time trying to debug the training, but I’m still getting very low mAP values, and I haven’t been able to make any real progress. I’ve tried going through my setup and references again, but I can’t figure out where I’m going wrong.
Would really appreciate it if you could take a look when you get a chance. I'm feeling quite stuck at the moment.

@jveitchmichaelis
Copy link
Collaborator

jveitchmichaelis commented Jul 10, 2025

https://colab.research.google.com/drive/1V7NlByb1yBt5-XDAombVQYrzR1I_MZ0a?usp=sharing @naxatra2

The main (only?) change is that I fixed the number of training classes to be len(CLASSES) and made sure that CLASSES corresponds to your custom dataset. The model has two components: a backbone, which extracts features, and a "head" which takes those features and detects objects. The backbone should not change much, as it's already been trained on many images and the output shape is the same as . The head however must be changed to detect new classes (i.e. from 90 in COCO to 4 in your custom one).

Val map is still terrible so my guess is there's a bug in the validation code somewhere - check your box formats? Probably this line:

# Box format must be xywh, default is xyxy
metric = MeanAveragePrecision(box_format='xywh')

Potentially also lower the learning rate.

I also added a download cell at the top so the dataset is pulled if you're making a new environment (not saved to Drive).

@naxatra2
Copy link
Contributor Author

I have made some progress but I think in 1-2 more days I can give a good plot that can show the improvements in mAP values from using active learning. I was able to fix the issues I was getting earlier. And after runining my model for 60 epochs. I was getting almost constant ~0.72 mAP value. I trained my model on the whole training dataset (1052 images). This is the baseline case that I will use later for benchmarks

After this, I created another notebook for active learning. I thought I made some progress because I was getting around 0.4-0.5 mAP values after 5 cycles. But there were logical errors in my code that I didn't notice. So, I have to redo that part again, but still I am not feeling stuck. I can see some progress.

My first step is that I am trying to pick 10% of the images randomly from the training data and calling it as labelled, rest of my dataset will be called unlabeled. then I will train my model on the labelled images and slowly feed more images to it to see how much mAP value can be reached with minimal amount of image. My first 10 images are randomly picked then for the next cycles I am using least confidence sampling. My error that I didn't notice at first were like these:


=== Cycle 1 — training on 210 images ===
  Epoch 1/5 — loss 0.6135
  Epoch 2/5 — loss 0.5633
  Epoch 3/5 — loss 0.5405
  Epoch 4/5 — loss 0.4992
  Epoch 5/5 — loss 0.4600
→ Validation mAP: 0.3623
→ Adding to seed: []

=== Cycle 2 — training on 260 images ===
  Epoch 1/5 — loss 0.4367
  Epoch 2/5 — loss 0.4456
  Epoch 3/5 — loss 0.4073
  Epoch 4/5 — loss 0.3893
  Epoch 5/5 — loss 0.4173
→ Validation mAP: 0.3752
→ Adding to seed: []

=== Cycle 3 — training on 260 images ===
  Epoch 1/5 — loss 0.3749
  Epoch 2/5 — loss 0.4055
  Epoch 3/5 — loss 0.3657
  Epoch 4/5 — loss 0.3213
  Epoch 5/5 — loss 0.3241
→ Validation mAP: 0.4241
→ Adding to seed: []

=== Cycle 4 — training on 260 images ===
  Epoch 1/5 — loss 0.3488
  Epoch 2/5 — loss 0.3353
  Epoch 3/5 — loss 0.3054
  Epoch 4/5 — loss 0.3495
  Epoch 5/5 — loss 0.2992
→ Validation mAP: 0.4452
→ Adding to seed: []


=== Cycle 5 — training on 260 images ===
  Epoch 1/5 — loss 0.3231

I just looked at the first 2 cycles and saw that the dataset was changing in size. But that was not actually the case. This is logically wrong because my number of images are constant after cycle 2, which makes all of these result kind of useless. I am trying to fix them. Running these models takes 2-3 hrs so, the progress is being hindered a little by this issue.

I switched to kaggle from colab because of more CPU and GPU resources avaialable there. The dataset is same

This is the complete notebook that I used for fine tuning and downloading the weights for my cutsom dataset: https://www.kaggle.com/code/jiya1404/african-wildlife

This is the new notebook that I am currently working on for active learning pipeline: https://www.kaggle.com/code/jiya1404/al-african-wildlife

@jveitchmichaelis
Copy link
Collaborator

jveitchmichaelis commented Jul 16, 2025

Great, if you've got it working (or after you figure out what the current bug is) then I think it might be worth moving to a larger dataset with tree imagery. That might give you some more headroom to try different sampling strategies, and in the end we'd like to apply this to various aerial datasets.

You could use the current public release of Milliontrees to start? That would give you 3-4k images with a decent amount of diversity (different locations). Let me know if you need any help there or if you need us to run some longer experiments.

We should also try to get this working with the deepforest training pipeline.

@naxatra2
Copy link
Contributor Author

As of now, I first thought to create a plot that shows the 'mAP values 'vs 'the number of images' needed to obtain that value. Once I get that part done. My pipeline will be mostly ready, and I can add 1-2 extra sampling techniques (I already have the functions for different sampling techniques in my different PR, so this part should be a bit easier). I am doing this for benchmarking and annotation efficiency improvement checks.

I want a figure like this (This is not mine, but the graph should represent similar information):
image

Once, this gets done. I will try to use a larger dataset.

@jveitchmichaelis
Copy link
Collaborator

jveitchmichaelis commented Jul 17, 2025

Ok, that sounds sensible. I would add that the curves are almost certainly different for different datasets so I'd be curious to see whether aerial (our dataset) vs terrestrial (your example data) photos are more amenable to this.

@naxatra2
Copy link
Contributor Author

I have used 3 sampling techniques to check the learnings of my model.

Graph explanation

  • Random Sampling (Baseline): This serves as our baseline by selecting new images for annotation completely at random from the unlabeled pool. It represents the performance gain without any intelligent selection.

  • Uncertainty Sampling: This improves upon the baseline by having the model query images it is most confused about. We calculate an uncertainty score for each unlabeled image and prioritize labeling the ones with the highest uncertainty (my code crashed while I running this part so, I only have datapoints upto 4th cycle).

  • Random + Uncertainty + Diversity Sampling (Hybrid): This refines both of the above. It first identifies a pool of highly uncertain images and then randomly samples from this pool. This hybrid technique ensures we select informative examples while also promoting diversity, which prevents the model from focusing too narrowly on a single type of difficult case and helps improve its overall generalization.

  • Dotted line represents the base line mAP value that I obtained by fine tuning the retinanet model on a custom dataset. It doesn't have any active learning techniques.

download

This is the notebook That I have worked on: https://www.kaggle.com/code/jiya1404/active-learning-african-wildlife

@naxatra2
Copy link
Contributor Author

@jveitchmichaelis Can you please attach the deepforest dataset here on which I should try to run my model next. As you said in the last meet.

@jveitchmichaelis
Copy link
Collaborator

jveitchmichaelis commented Jul 28, 2025

Try the public release of MillionTrees: https://milliontrees.idtrees.org/en/latest/

The "TreeBoxes" dataset. You can use the library API to download it into your Notebook.

@naxatra2
Copy link
Contributor Author

image

I am not able to download this dataset, I first thought I am doing something wrong but I don't know what is going wrong. I have copied the exact code from the getting started portion of the docs but it fails to run

My notebook: https://www.kaggle.com/code/jiya1404/million
Docs: https://milliontrees.idtrees.org/en/latest/getting_started.html#installation

@jveitchmichaelis
Copy link
Collaborator

jveitchmichaelis commented Jul 30, 2025

@naxatra2 URL seems to be broken, try this: https://data.rc.ufl.edu/pub/ewhite/MillionTrees/TreeBoxes_v0.2.zip

It's fine here: https://github.com/weecology/MillionTrees/blob/d0d3942e714abf4261c264e4cd5b49bd9a9f8a45/src/milliontrees/datasets/TreeBoxes.py#L64

If you can extract that, you should be good to start.

@naxatra2
Copy link
Contributor Author

I was not able to run the dataset loader or the TreeBoxesDataset. So, I have created a custom class to load the locally downloaded dataset. Also, I have created the pool of labeled and unlabeled images from my dataset. I will run the training loop tonight I think it will take a lot of time because of large dataset.

I have attached the notebook that I am working on. It has the things that I mentioned here.

https://www.kaggle.com/code/jiya1404/million

@jveitchmichaelis
Copy link
Collaborator

Yes, you'll probably need to custom load it, but the format should be straightforward (CSV + folder full of images).

Can't access the notebook - you might need to enable sharing. @naxatra2

@naxatra2
Copy link
Contributor Author

I accidentally shared with private access. I have now changed it to public @jveitchmichaelis

@naxatra2
Copy link
Contributor Author

I have committed some changes in this PR and closed the other to avoid confusion. @jveitchmichaelis @bw4sz

@naxatra2
Copy link
Contributor Author

@jveitchmichaelis @bw4sz can you please tell about how should I add the label studio integration part. In the meet we discussed that there were some functions already made ? or should I make them.

@naxatra2
Copy link
Contributor Author

Could you also review my approach in the code. So I can refine it before our next meeting? then I will also commit more sampling techniques in this PR.

@jveitchmichaelis
Copy link
Collaborator

jveitchmichaelis commented Aug 27, 2025

@naxatra2 use the official API as much as possible I think: https://labelstud.io/guide/sdk

Looks like it's coming along well :)

Review - my main comment is that you could try to use more of the existing functionality in DeepForest, for example use the config system that already exists, you can set the training parameters there as most will stay the same. I suggest having a separate section for the active learning bits eg the pool csv, pool image root, sample method, etc. It's good that you're using the deepforest main.

I've left some inline comments. I'd also suggest focusing on integration tests. The current tests are quite hard to follow with all the mocks (if you asked an LLM to help with this, they really like adding monkey patches which isn't always good). We definitely want to check with real model + data calls.

@naxatra2
Copy link
Contributor Author

naxatra2 commented Sep 4, 2025

train.json
I was able to fetch the task annotations from my locally hosted Label Studio projects, I have also attached the output that I got from it. I’m committing the relevant code here.

I’ve added some Label Studio related helper functions, but as of now a user may not be able to use them directly to interact with label studio since there’s no CLI or user-friendly interface available, because it’s just a collection of helper functions for now. I have not changed the testcases yet, but I am also working towards the config file to possibly remove the config class from my code. I have created a new active_learning.yml for this

@naxatra2
Copy link
Contributor Author

@jveitchmichaelis Can you please look at my latest commit. I have added some improved testcases for my src\deepforest\active_learning.py and they are running fine locally. But they are not passing through github actions. If I fix that, then most of the part of this PR will be complete

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Google Summer of Code This label is for ideas that could be used for google summer of code.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants