|
| 1 | +--- |
| 2 | +title: 'training with less data' |
| 3 | +tags: 'journal' |
| 4 | +date: 'Oct 18, 2025' |
| 5 | +--- |
| 6 | + |
| 7 | +i was wondering if there's a way to know which data actually matters before you even train. like can you look at your dataset and say "these 2k examples are worth more than those 10k" |
| 8 | + |
| 9 | +does more data always = better model? |
| 10 | + |
| 11 | +[research](https://arxiv.org/abs/2001.08361) shows it's a power law, not linear: |
| 12 | + |
| 13 | +``` |
| 14 | +100 samples → loss = 10 |
| 15 | +1,000 samples (10x more) → loss = 5 (not 1) |
| 16 | +10,000 samples (10x more) → loss = 2.5 (not 0.5) |
| 17 | +``` |
| 18 | + |
| 19 | +it has diminishing returns that [holds across seven orders of magnitude](https://www.pnas.org/doi/10.1073/pnas.2311878121). |
| 20 | + |
| 21 | +what about finetuning? since the base model already knows a lot, you're just teaching it something specific, does the same rule apply? |
| 22 | + |
| 23 | +yes but you might only need 20-50% of your data to get 95% performance. so which 20-50%? |
| 24 | + |
| 25 | +j morris showed that models have a [capacity limit](https://arxiv.org/abs/2505.24832). GPT-style models memorize ~3.6 bits per parameter. |
| 26 | + |
| 27 | +this means a 1B parameter model can only memorize ~450MB of information. that's your budget. |
| 28 | + |
| 29 | +training on more data doesn't increase budget. just spreads it thinner. |
| 30 | + |
| 31 | +when you exceed capacity, model is forced to generalize instead of memorize. this explains grokking - that moment when performance suddenly jumps. |
| 32 | + |
| 33 | +so the question becomes: which data fills the budget? |
| 34 | + |
| 35 | +if you have lots of data, keep hard examples. easy ones are redundant. |
| 36 | + |
| 37 | +if you have little data, keep easy examples. hard ones might just be noise. |
| 38 | + |
| 39 | +[someone showed](https://arxiv.org/abs/2206.14486) you can discard 20% of ImageNet without hurting performance. potentially breaking power law scaling. |
| 40 | + |
| 41 | +how do you actually do this though? |
| 42 | + |
| 43 | +there's [information bottleneck](https://adityashrm21.github.io/Information-Theory-In-Deep-Learning/) theory - find maximally compressed mapping that preserves info about output. keep only data that tells you something useful. |
| 44 | + |
| 45 | +practical methods exist: |
| 46 | +- [coreset selection](https://arxiv.org/abs/1907.04018) - finds small weighted subset that approximates full dataset |
| 47 | +- geometry-based pruning - preserve feature space structure |
| 48 | +- uncertainty-based - keep what model is uncertain about |
| 49 | +- error-based - keep high-loss examples |
| 50 | + |
| 51 | +problem: most don't scale well. best ones are expensive to compute. |
| 52 | + |
| 53 | +there's also this idea of [four scaling regimes](https://www.pnas.org/doi/10.1073/pnas.2311878121). basically asking two questions: |
| 54 | + |
| 55 | +1. is the bottleneck your data or your model? |
| 56 | +2. is the problem noise or lack of detail? |
| 57 | + |
| 58 | +two limitations: |
| 59 | + |
| 60 | +- **variance-limited:** error from noise in limited samples (like photos in a dark room) |
| 61 | +- **resolution-limited:** can't capture fine-grained patterns (like a pixelated image) |
| 62 | + |
| 63 | +knowing which regime you're in tells you if more data helps or if you need something else. |
| 64 | + |
| 65 | +j morris continues to show [embeddings](https://arxiv.org/abs/2505.12540) from different models converge to similar representation geometries. |
| 66 | + |
| 67 | +if there's a universal geometry, maybe there's an optimal compression of training data that fills that structure efficiently. |
| 68 | + |
| 69 | +there's also a ton of research on synthetic data that can fit into the equation as well. a rabbit hole that i would love to dive into some other time. |
0 commit comments