Skip to content

Conversation

@mturk24
Copy link
Contributor

@mturk24 mturk24 commented Dec 18, 2025

Summary

Fixing the tutorials to read datasets from HuggingFace Hub instead of from S3.

The relevant Dataset cards are:

S3 datasets for each of these tutorials have been deleted.

I ran with Python 3.11 on my local machine to test.

mturk24 and others added 4 commits December 18, 2025 17:25
Replace S3 URLs with HuggingFace Hub dataset loading across all 4 tutorials:
- improving_ml_performance.ipynb: Use load_dataset("Cleanlab/student-grades")
- object_detection.ipynb: Use hf_hub_download for labels, predictions, and images
- segmentation.ipynb: Use hf_hub_download for given_masks and predicted_masks
- token_classification.ipynb: Use hf_hub_download for pred_probs

All tutorials now load data from HuggingFace Hub instead of S3, with proper
imports and dependencies added (datasets, huggingface_hub).

🤖 Generated with Claude Code
The tutorial notebooks were failing with 404 errors because hf_hub_download()
defaults to looking for model repositories, not dataset repositories.

Fixed by adding repo_type="dataset" parameter to all hf_hub_download() calls in:
- object_detection.ipynb (3 downloads: labels.pkl, predictions.pkl, example_images.zip)
- segmentation.ipynb (2 downloads: given_masks.npy, predicted_masks.npy)
- token_classification.ipynb (1 download: pred_probs.npz)

This ensures the downloads use the correct URL format:
https://huggingface.co/datasets/Cleanlab/... instead of
https://huggingface.co/Cleanlab/...

🤖 Generated with Claude Code

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
@mturk24 mturk24 changed the title [WIP]: Migrate tutorials to hf hub Migrate CLOS tutorials to hf hub Dec 19, 2025
@mturk24 mturk24 requested review from elisno and jwmueller December 19, 2025 14:59
"source": [
"# Package installation (hidden on docs website).\n",
"dependencies = [\"cleanlab\", \"matplotlib\"]\n",
"dependencies = [\"cleanlab\", \"matplotlib\", \"huggingface_hub\"]\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also cap the version of huggingface_hub in our developer/build dependencies to latest version (assuming you tested with that latest version)

huggingface_hub==0.25.2 # TODO: uncap version

It's currently set at old version: huggingface_hub==0.25.2

"%%capture\n",
"!wget -nc 'https://cleanlab-public.s3.amazonaws.com/ImageSegmentation/predicted_masks.npy' "
]
"source": "from huggingface_hub import hf_hub_download\n\n# Download from HuggingFace Hub\ngiven_masks_path = hf_hub_download('Cleanlab/segmentation-tutorial', 'given_masks.npy', repo_type=\"dataset\")\npredicted_masks_path = hf_hub_download('Cleanlab/segmentation-tutorial', 'predicted_masks.npy', repo_type=\"dataset\")"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in case this happens to print a lot of stuff (which we don't want to show on our live docs site), then keep the:
%%capture, statement at the top of cell (same for all other tutorials)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants