-
Notifications
You must be signed in to change notification settings - Fork 224
active learning loop workflow on custom dataset #1087
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@jveitchmichaelis @bw4sz can you please look into this repo. this is made in reference to this comment form #1069
|
|
Thanks @naxatra2, some comments:
@bw4sz do you have a suggestion for which tree/box dataset we should start exploring to test this? Neon? |
|
I was confused in the dataset part, due to which instead of choosing a heavier and better model (possibly pretrained), I opted to use a very light model that I could run without GPU, I mostly did this to check whether my implemented logic was working or not. I think this creates most of the issues in my notebook. For example this part:
This is mostly because I was experimenting with multiple models, and I forgot to clean the code. I think this and the small dataset is the reason behind my almost negligible mAP values Also, I initially thought of using the NEON dataset, but it was too big and from github I was only able to find the annotations not the training dataset. so, I just used a very basic custom dataset to structure my notebook. |
|
If you're running in a Notebook, I would recommend using Google Colab for free GPU access (Nvidia T4). Disk space should be plenty on there too. |
|
Hi @bw4sz , you mentioned something about a demo code or structure related to my model training in the last meet. Can you provide it please |
|
So my general thoughts are to into integrate within deepforest models to give a reasonable starting point.
You could grab a small number of images from https://milliontrees.idtrees.org/en/latest/. and use the deepforest tree model. |
|
I will update my progress by tonight here, based on these inputs |
|
Hi @jveitchmichaelis @bw4sz can you please look into this colab file that I created. https://colab.research.google.com/drive/1C3fINy4rCsPWsWflx-FDtme0q49oRjKr?usp=sharing My mAP values are still basically non-existent like max mAP value I am getting is 0.06 after 30-40 epochs. which is not good. Can you please check this. I am not able to understand the things that I am doing wrong here. I have also attached the dataset and the resources that I have referenced in the notebook itself. I am feeling stuck in properly training the baseline model. |
|
Thanks for sharing - I'll have a look through the notebook and see what I
can figure out.
…On Tue, 8 Jul 2025 at 17:23, Nakshatra ***@***.***> wrote:
*naxatra2* left a comment (weecology/DeepForest#1087)
<#1087 (comment)>
Hi @jveitchmichaelis <https://github.com/jveitchmichaelis> @bw4sz
<https://github.com/bw4sz> can you please look into this colab file that
I created.
https://colab.research.google.com/drive/1C3fINy4rCsPWsWflx-FDtme0q49oRjKr?usp=sharing
My mAP values are still basically non-existent like max mAP value I am
getting is 0.06 after 30-40 epochs. which is not good. Can you please check
this. I am not able to understand the things that I am doing wrong here. I
have also attached the dataset and the resources that I have referenced in
the notebook itself. I am feeling stuck in properly training the baseline
model.
—
Reply to this email directly, view it on GitHub
<#1087 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAYDMJ77MQXANMLOY63SXHL3HQZEBAVCNFSM6AAAAACAWXS7O6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTANJQGMZTONBUGE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
|
Hi @jveitchmichaelis I’ve spent some more time trying to debug the training, but I’m still getting very low mAP values, and I haven’t been able to make any real progress. I’ve tried going through my setup and references again, but I can’t figure out where I’m going wrong. |
|
https://colab.research.google.com/drive/1V7NlByb1yBt5-XDAombVQYrzR1I_MZ0a?usp=sharing @naxatra2 The main (only?) change is that I fixed the number of training classes to be Val map is still terrible so my guess is there's a bug in the validation code somewhere - check your box formats? Probably this line: # Box format must be xywh, default is xyxy
metric = MeanAveragePrecision(box_format='xywh')Potentially also lower the learning rate. I also added a download cell at the top so the dataset is pulled if you're making a new environment (not saved to Drive). |
|
I have made some progress but I think in 1-2 more days I can give a good plot that can show the improvements in mAP values from using active learning. I was able to fix the issues I was getting earlier. And after runining my model for 60 epochs. I was getting almost constant ~0.72 mAP value. I trained my model on the whole training dataset (1052 images). This is the baseline case that I will use later for benchmarks After this, I created another notebook for active learning. I thought I made some progress because I was getting around 0.4-0.5 mAP values after 5 cycles. But there were logical errors in my code that I didn't notice. So, I have to redo that part again, but still I am not feeling stuck. I can see some progress. My first step is that I am trying to pick 10% of the images randomly from the training data and calling it as labelled, rest of my dataset will be called unlabeled. then I will train my model on the labelled images and slowly feed more images to it to see how much mAP value can be reached with minimal amount of image. My first 10 images are randomly picked then for the next cycles I am using least confidence sampling. My error that I didn't notice at first were like these: I just looked at the first 2 cycles and saw that the dataset was changing in size. But that was not actually the case. This is logically wrong because my number of images are constant after cycle 2, which makes all of these result kind of useless. I am trying to fix them. Running these models takes 2-3 hrs so, the progress is being hindered a little by this issue. I switched to kaggle from colab because of more CPU and GPU resources avaialable there. The dataset is same This is the complete notebook that I used for fine tuning and downloading the weights for my cutsom dataset: https://www.kaggle.com/code/jiya1404/african-wildlife This is the new notebook that I am currently working on for active learning pipeline: https://www.kaggle.com/code/jiya1404/al-african-wildlife |
|
Great, if you've got it working (or after you figure out what the current bug is) then I think it might be worth moving to a larger dataset with tree imagery. That might give you some more headroom to try different sampling strategies, and in the end we'd like to apply this to various aerial datasets. You could use the current public release of Milliontrees to start? That would give you 3-4k images with a decent amount of diversity (different locations). Let me know if you need any help there or if you need us to run some longer experiments. We should also try to get this working with the deepforest training pipeline. |
|
Ok, that sounds sensible. I would add that the curves are almost certainly different for different datasets so I'd be curious to see whether aerial (our dataset) vs terrestrial (your example data) photos are more amenable to this. |
|
I have used 3 sampling techniques to check the learnings of my model. Graph explanation
This is the notebook That I have worked on: https://www.kaggle.com/code/jiya1404/active-learning-african-wildlife |
|
@jveitchmichaelis Can you please attach the deepforest dataset here on which I should try to run my model next. As you said in the last meet. |
|
Try the public release of MillionTrees: https://milliontrees.idtrees.org/en/latest/ The "TreeBoxes" dataset. You can use the library API to download it into your Notebook. |
I am not able to download this dataset, I first thought I am doing something wrong but I don't know what is going wrong. I have copied the exact code from the getting started portion of the docs but it fails to run My notebook: https://www.kaggle.com/code/jiya1404/million |
|
@naxatra2 URL seems to be broken, try this: https://data.rc.ufl.edu/pub/ewhite/MillionTrees/TreeBoxes_v0.2.zip It's fine here: https://github.com/weecology/MillionTrees/blob/d0d3942e714abf4261c264e4cd5b49bd9a9f8a45/src/milliontrees/datasets/TreeBoxes.py#L64 If you can extract that, you should be good to start. |
|
I was not able to run the dataset loader or the I have attached the notebook that I am working on. It has the things that I mentioned here. |
|
Yes, you'll probably need to custom load it, but the format should be straightforward (CSV + folder full of images). Can't access the notebook - you might need to enable sharing. @naxatra2 |
|
I accidentally shared with private access. I have now changed it to public @jveitchmichaelis |
|
I have committed some changes in this PR and closed the other to avoid confusion. @jveitchmichaelis @bw4sz |
|
@jveitchmichaelis @bw4sz can you please tell about how should I add the label studio integration part. In the meet we discussed that there were some functions already made ? or should I make them. |
|
Could you also review my approach in the code. So I can refine it before our next meeting? then I will also commit more sampling techniques in this PR. |
|
@naxatra2 use the official API as much as possible I think: https://labelstud.io/guide/sdk Looks like it's coming along well :) Review - my main comment is that you could try to use more of the existing functionality in DeepForest, for example use the config system that already exists, you can set the training parameters there as most will stay the same. I suggest having a separate section for the active learning bits eg the pool csv, pool image root, sample method, etc. It's good that you're using the deepforest main. I've left some inline comments. I'd also suggest focusing on integration tests. The current tests are quite hard to follow with all the mocks (if you asked an LLM to help with this, they really like adding monkey patches which isn't always good). We definitely want to check with real model + data calls. |
|
train.json I’ve added some Label Studio related helper functions, but as of now a user may not be able to use them directly to interact with label studio since there’s no CLI or user-friendly interface available, because it’s just a collection of helper functions for now. I have not changed the testcases yet, but I am also working towards the config file to possibly remove the config class from my code. I have created a new |
|
@jveitchmichaelis Can you please look at my latest commit. I have added some improved testcases for my |



I have created a new notebook which is not linked to deepforest in any ways, I just wrote my code in the same directory for convenience. This notebook tries to see whether my model is learning in the way that I want or not.
I have used a custom dataset of daisy flowers with 131 annotated images in the COCO dataset with images format. To reproduce this code we need training image dataset with its annotations either in
.jsonor.csvformat and a test dataset.I have used a very light model for training, to reduce time. If i replace it with a better model then the accuracy can improve
Objective
simulating how an object-detector’s accuracy (mAP) improves as I will iteratively label more images. This example currently uses random sampling, which (I will take as baseline) from the unlabeled pool, next step is to use an active learning specific sampling tecohnique.
How to reproduce this (without shipping the giant flower dataset)
Prepare your own dataset in COCO format (or convert from VOC/Pascal/CSV into COCO).
images,annotations, andcategorieskeys. Each annotation must haveimage_id,bboxin[x,y,w,h], andcategory_id.Majorly 3 steps in the workflow
COCO ↔ CSV conversion
parse_coco(json_file, img_dir)function reads a COCO-style annotation JSON and writes out a flatlabels_raw.csv, with one row per bounding box (xmin,ymin,xmax,ymax,label,image_path).build_coco_gt(df, out_json)utility takes that CSV back into a minimal COCO JSON (images, annotations, categories) so that we can use it later as the “ground truth” for evaluation.Custom Dataset + DataLoader
FlowerDatasetclass (subclassingtorch.utils.data.Dataset) whose__getitem__loads an image, retrieves its boxes and labels from my CSV/COCO data, applies resizing, converts everything to tensors, and returns(image, target_dict)for TorchVision detection models.Active-Learning Loop
ROUNDScycles:train_idxandtest_idx.POOL_BATCHnew images from the pool to add totrain_idx