Optimized small model training is not only important for availability but also for the scientific study of LLMs. It’s like the use of simple organisms like yeast for biological studies - we also need to study the simplest possible transformers that exhibit behaviors of interest from the larger models if we hope to ever understand LLMs and have more control over their behavior.
willvarfar 34 minutes ago [-]
(there are also lots of private company datasets like e.g. user purchase history that can be used with small models to solve real business problems. All the advances in 'large' language models can be leveraged and applied to small problems if the input sequences can be represented as a special custom language.)
ai-christianson 21 minutes ago [-]
I'm interested in one that can run fast on a laptop, but training can take a few days (maybe even longer) on the same laptop.
smeeth 28 minutes ago [-]
I've been annoyed for a while people don't use a common parameter weight/compute budget for benchmarking papers.
That said, it does make it easier to claim progress...
arethuza 20 minutes ago [-]
Thanks - that's one of the most interesting comments I've seen about LLMs.
Makes me want to try training a model to sing "Daisy, Daisy..."
biophysboy 2 hours ago [-]
It’s a fun analogy because the data “environment” of the model being trained matters a great deal
jebarker 56 minutes ago [-]
Exactly. YOLO runs of frontier models with a single random seed/data shuffle are pretty limited for trying to study the “molecular biology”. I actually like to think of LLM understanding as being like biology in the 1850s. There's lots of inspiration to be found in how biology has advanced since then and the types of experiments we might run to better understand LLMs.
chasd00 14 minutes ago [-]
AI is a broad term, the zero-to-hero series by Karpathy trains one in a Jupyter notebook. You can make some pretty powerful networks to de-duplicate database rows right in your laptop too. Data de-duplication and general MDM is pretty useful in large businesses.
zarzavat 4 hours ago [-]
Instead of time it should be energy. What is the best model you can train with a given budget in Joules. Then the MBP and the H100 are on a more even footing.
NooneAtAll3 3 hours ago [-]
it's not about efficiency - it's about availability
H100 is not an everyday product. Laptop is
Sharlin 44 minutes ago [-]
H100s are almost-instantly available to anyone with a credit card and access to the internet. Without even having to lift their butt from the seat. And you get plenty more than five minutes of compute for the price of an M4.
jsperson 38 minutes ago [-]
For the orgs where I've worked the important thing isn't availability of compute it's security. Using what we have on our local network is much easier from a governance and approval standpoint than whatever is available on the internet.
potatolicious 22 minutes ago [-]
And yet just about any intro-to-programming tutorial gets something running on your local machine, and local machine development continues to be the default for most people, even though devving on a cloud machine is eminently reasonable.
"Pull out credit card, sign up for some thing and pay a bit of money" is a non-trivial bit of friction! Extremely non-trivial!
Especially in a corporate context - you have to get the expense approved. It's not clear if you can put company data onto the machine. Whereas generally running local things on corporate laptops is far less controversial.
"Download this tool and run it." is still an extremely powerful pitch. Pretty much the only thing that beats it is "go to this website which you can use without any signup or payment".
nickpsecurity 25 minutes ago [-]
Also, my laptop running Linux and its outputs are probably mine and private. If I use cloud GPU's, I need to be a lawyer to be sure what they can or can't do with my data or models.
There's also no overages or hidden charges with a laptop. Past simply breaking it. You know the replacement cost ahead of time, though.
Der_Einzige 2 hours ago [-]
At this point, given how many H100s there are in existence, it’s basically an everyday product.
logicchains 2 hours ago [-]
I envy you if $25k is an everyday product cost.
falcor84 2 hours ago [-]
Maybe not to buy one, but to rent one. Like how barista-made coffee is an everyday product even though most people can't afford a fancy professional coffee machine.
jeroenhd 2 hours ago [-]
For what it's worth, most of the world can't afford an M4 Macbook either.
wongarsu 1 hours ago [-]
And renting an H100 for an hour is a lot easier than renting an M4 MacBook for an hour.
KeplerBoy 3 hours ago [-]
Still, I don't think the m4 is going to be far off from the h100 in terms of energy efficiency.
edit: fixed typo
menaerus 3 hours ago [-]
What efficiency did you have in mind? Bandwidth-wise M4 is ~10x to ~30x lower.
KeplerBoy 3 hours ago [-]
ah, i mistyped. I meant energy efficiency, not memory efficiency.
giancarlostoro 3 hours ago [-]
Mac is more competitive on power consumption though since its not ever pulling as much as a Nvidia GPU is my understanding.
On that note you can rent an H100 for an hour for under $10 which might make for a slightly more interesting test, whats the best model outcome you can train in under an hour.
dtnewman 2 hours ago [-]
> you can rent an H100 for an hour for under $10
Far cheaper these days. More like $2-3 for a consumer to do this. For bulk deals, pricing is often < $2.
bigyabai 1 hours ago [-]
It depends. If you're bottlenecked by memeory speed, the Mac typically comes out on-top.
In terms of conpute efficiency though, Nvidia still has Apple beat. Nvidia wouldn't have the datacenter market on a leash if Apple was putting up a real fight.
netcan 2 hours ago [-]
They're all good. Being somewhat arbitrary isnt a bad thing.
jvanderbot 1 hours ago [-]
Bro por que no los dos
We can / should benchmark and optimize this to death on all axes
aniijbod 2 hours ago [-]
Let the AI efficiency olympics begin!
On a laptop, on a desktop, on a phone?
Train for 5 minutes, an hour, a day, a week?
On a boat? With a goat?
yojo 32 minutes ago [-]
> With a goat?
I think you meant Llama.
The rhymes are admittedly more limited, unless you have a Boston accent.
Nevermark 1 hours ago [-]
On a maxxxed out Mac Studio M3 Ultra 512GB.
That boat will float your goat!
visarga 1 hours ago [-]
goats have too many parameters, they are like GPT-4
rPlayer6554 1 hours ago [-]
I’d pay for GoatLM
lifestyleguru 1 hours ago [-]
Honestly AI is a trick to make us buy new expensive computers. I'm writing this from over 10 years old one and the computers offered in a leaflet from nearby electronic store aren't much better.
voidUpdate 25 minutes ago [-]
I mean, gaming is the big pusher of new hardware these days, and web is basically the reason you can use a 90s computer in the modern day. I happily survived on roughly 10 year old components all the way through university because I wasn't playing AAA games
LorenDB 3 hours ago [-]
> Paris, France is a city in North Carolina. It is the capital of North Carolina, which is officially major people in Bhugh and Pennhy. The American Council Mastlandan, is the city of Retrea. There are different islands, and the city of Hawkeler: Law is the most famous city in The Confederate. The country is Guate.
I love the phrase "officially major people"! I wonder how it could be put to use in everyday speech?
api 2 hours ago [-]
[flagged]
emeril 2 hours ago [-]
[flagged]
chias 2 hours ago [-]
This is not true. I watched the clip. She referred to AI as AI. When she said A1 she was very clearly referring to America First.
quaristice 1 hours ago [-]
Snopes confirmed that McMahon began by referring correctly to “AI development,” but in the same response, twice said “A1 teaching,” clearly meaning artificial intelligence. Not steak sauce. Multiple outlets including Gizmodo, Newser, Yahoo News, Mediaite, and Cybernews all reported the slip-up as genuine: she erroneously said “A1” when she meant “AI”.
chias 51 minutes ago [-]
Did you watch the clip yourself? I assume not, so here you go:
She was the chair of the board of the America First Policy Institute. She's not talking about AI, she's talking about pumping ultra-nationalist, Nazi-adjacent propaganda into Red state education.
tootyskooty 3 hours ago [-]
I suspect one can go a lot further by adopting some tweaks from the GPT-2 speedrun effort [0], at minimum Muon, better init and carefully tuning learning rate.
At which point is a simple markov chain same/better?
Nevermark 56 minutes ago [-]
It is the other way around.
Neural-type models have long passed the point where markov chains made any sense by many orders of magnitude.
Markov models fail by being too opinionated about the style of compute.
In contrast, a linear tensor + non-linear function has incredible flexibility to transform the topology of information. Given large enough tensors, two such layers, with recurrence, can learn any mapping, static or dynamical. No priors (other than massive compute) needed.
All other neural architectures then are simply sparser arrangements, that bring compute demands down. Where the sparseness is fit to the type of problem.
Sparseness can be deeper but narrower information flows (thus “deep” learning). Or in lower numbers of weights to weight application (I.e. shared weights, like convolutions).
visarga 1 hours ago [-]
Output text is word salad every few words apart. You can't scale n-gram counting enough to make it work.
I suppose if you only have 5 minutes this is probably about the level you'd get.
nottorp 3 hours ago [-]
But supposing you have a real specific need to train, is the training speed still relevant? Or do the resources spent on gathering and validating the data set dwarf the actual CPU/GPU usage?
wongarsu 1 hours ago [-]
If training is trivially fast that allows you to iterate on architecture choices, hyperparameters, choices which data to include, etc
Of course that only works if the trial runs are representative of what your full scale model will look like. But within those constraints optimising training time seems very valuable
yalogin 1 hours ago [-]
The bigger question or may be even realization is that with this architecture there is no way to build a capable model to run on the laptop or phone, which means there will never be local compute and servers became ever more important. In general thinking about how ML itself works, reducing model size while retaining capability will just never happen.
simonw 1 hours ago [-]
This post is about training, not inference.
The lesson here is that you can't use a laptop to train a useful model - at least not without running that training for probably decades.
That doesn't mean you can't run a useful model on a laptop that was trained in larger hardware. I do that all the time - local models hit really good this year.
> reducing model size while retaining capability will just never happen.
Tell that to Qwen3-4B! Those models are remarkably capable.
grim_io 1 hours ago [-]
It's always a question of "compared to what?"
Local models are no where near capable compared to frontier big models.
While a small model might be fine for your use case, it can not replace Sonnet-4 for me.
simonw 19 minutes ago [-]
Sure, Qwen-3-4B - a 4GB download - is nowhere near as capable as Claude Sonnet 4.
But it is massively more capable than the 4GB models we had last year.
Meanwhile recent models that are within the same ballpark of capabilities as Claude Sonnet 4 - like GLM 4.5 and Kimi K2 and the largest of the Qwen 3 models - can just about fit on a $10,000 512GB of RAM Mac Studio. That's a very notable trend.
sdenton4 51 minutes ago [-]
It depends, actually...
The data and train time requirements seen to increase exponentially for linear gains in performance. As a result, you can often trade a 10x reduction in training time to get a model with 90+% of the real deal. And as we accumulate more architecture and efficiency tricks, the ceiling in what you can do locally goes up commensurately.
There's also a whole world of data curation to improve training, which is likely to be great for small models and seems still underexplored.
highfrequency 2 hours ago [-]
This is awesome - thanks for sharing. Appreciate the small-scale but comprehensive studies testing out different architectures, model sizes and datasets.
Would be curious to see a version of your model size comparison chart but letting the training continue until perplexity plateaus / begins to overfit. For example: are your larger models performing worse because they are overfitting to a small dataset, or because you are comparing model sizes at a fixed 5 minute computation time - so that the large models just don't get to learn very much in that time.
(Also interesting would be learning curve comparisons between architecture/param count)
l5870uoo9y 3 hours ago [-]
The most powerful Macbook Pro currently has 16 CPU cores, 40 GPU cores, and 128 GB of RAM (and a 16-core “neural engine” specifically designed to accelerate machine learning). Technically, it is a laptop, but it could just as well be a computer optimized for AI.
Apple m3/m4 silicon is certainly good in some ways, but the bottleneck is often a lack of CUDA software support and price (could buy >4 times the GPU raw performance on a dual rtx 5090 desktop.) =3
hodgehog11 3 hours ago [-]
I love seeing explorations like this, which highlight that easily accessible hardware can do better than most people think with modern architectures. For many novel scientific tasks, you really don't need an H100 to make progress using deep learning over classical methods.
faangguyindia 55 minutes ago [-]
The best LLM on the planet right now is Gemini Pro 2.5 and Gemini Flash 2.5, nothing comes close to these.
Once you setup a good system prompt on these, nothing really compares.
Most of the models you see with high benchmarks are not even comparable on real tasks.
qwen3 or deepseek r1, they aren't even 1/10 as good as Gemini Pro2.5
dvrj101 50 minutes ago [-]
> not even comparable on real tasks.
care to elaborate how gemini did completed this task successfully and how other models fumbled ?
faangguyindia 44 minutes ago [-]
I am using AI to write full projects, complete code generation and haven found any model which comes close to Gemini Pro2.5 in code generation reasoning and generation.
While other models like qwen3, glm promise big in real code writing they fail badly, get stuck in loops.
The only problem right now i run into gemini is i get throttled every now and then with empty response specially around this time.
howmayiannoyyou 51 minutes ago [-]
Then they are not the best. Most users aren't prompt engineers and grew up expecting to enter search terms into Google and get a result. If its the case OpenAI or Anthropic are best able to interpret user intent there's a good argument to be made they are the best.
faangguyindia 33 minutes ago [-]
this is something people do not understand.
If model trusts the users, and if user is dumb model will "weigh" user's input much higher and end up with flawed code.
If the model is more independent, it will find the right solution. If just want a dumb model which says yes to everything, and follows you when u are not at smart enough then you'll never end up with good solution if not by luck.
wowczarek 3 hours ago [-]
Not the point of the exercise obviously, but at five minutes' training I wonder how this would compare to a Markov chain bot.
pilooch 1 hours ago [-]
I'd be interested in what implementation of D3PM was used (and failed). Diffusion model are more data efficient than their AR LLM counterpart but les compute efficient at training time, so it'd be interesting to know whether with more time.to.converge the diffusion approach does succeed. I guess I'll try :)
mhogers 3 hours ago [-]
Any reason to upgrade an M2 16GB macbook to a M4 ..GB (or 2026 M5) for local LLMs? Due an upgrade soon and perhaps it is educational to run these things more easily locally?
sandreas 2 hours ago [-]
For LLMs, VRAM is the requirement number one. Since MacBooks have unified RAM you can use up to 75% for the LLM, so a higher RAM model would open more possibilies, but these are much more expensive (of course).
As an alternative you might consider a Ryzen Pro 395+ like in the Framework desktop or HP Zbook G1a but the 128GB versions are still extremely expensive. The Asus Flow Z13 is a tablet with ryzen 395+ but hardly available with 128GB
ionwake 3 hours ago [-]
I did just that , got the r 32gb ram one so I could run qwen.
Might still be early days I’m trying to use the model to sort my local notes but I don’t know man seems only a little faster yet still unusable and I downloaded the lighter qwen model as recommended.
Again it’s early days maybe I’m being an idiot I did manage to get it to parse one note after about 15 mins though.
schaefer 2 hours ago [-]
You could train an unbeatable tic-tac-toe ai on your laptop in five minutes. It doesn’t get any stronger than that.
—
I know, I know. I’m intentionally misinterpreting the OP’s clear intent (the stuff of comedy). And normally a small joke like this wouldn’t be worth the downvotes…
But, I think there’s a deeper double meaning in this brave new world of prompt engineering. Most chat isn’t all that precise without some level of assumed shared context:
These days the meaning of the phrase ai has changed from the classical definition (all algorithms welcome), and now ai usually means LLMs and their derivatives.
silverlake 1 hours ago [-]
I’m actually working on just this. What’s the smallest training data set required to learn tic-tac-toe? A 5yo doesn’t need much training to learn a new game, but a transformer needs millions of samples.
rkomorn 1 hours ago [-]
> A 5yo doesn’t need much training to learn a new game
A 5yo also has... 5 years of cumulative real world training. I'm a bit of an AI naysayer but I'd say the comparison doesn't seem quite accurate.
silverlake 1 hours ago [-]
It’s a glib analogy, but the goal remains the same. Today’s training sets are immense. Is there an architecture that can learn something with tiny training sets?
rkomorn 31 minutes ago [-]
I'm certainly not challenging anything you're writing, because I only have a very distant understanding of deep learning, but I do find the question interesting.
Isn't there a bit of a defining line between something like tic-tac-toe that has a finite (and pretty limited for a computer) set of possible combinations where it seems like you shouldn't need a training set that is larger than said set of possible combinations, and something more open-ended where the impact of the size of your training set mainly impacts accuracy?
Daltonagray 1 hours ago [-]
This sounds super interesting. Will you be sharing your work anywhere? :)
pjmlp 2 hours ago [-]
Which laptop, though?
yunusabd 2 hours ago [-]
Now imagine what you could do in 6 minutes!
But honestly I really like the short turnaround times. Makes it easy to experiment with different parameters and develop an intuition for what they do.
That said, it does make it easier to claim progress...
Makes me want to try training a model to sing "Daisy, Daisy..."
H100 is not an everyday product. Laptop is
"Pull out credit card, sign up for some thing and pay a bit of money" is a non-trivial bit of friction! Extremely non-trivial!
Especially in a corporate context - you have to get the expense approved. It's not clear if you can put company data onto the machine. Whereas generally running local things on corporate laptops is far less controversial.
"Download this tool and run it." is still an extremely powerful pitch. Pretty much the only thing that beats it is "go to this website which you can use without any signup or payment".
There's also no overages or hidden charges with a laptop. Past simply breaking it. You know the replacement cost ahead of time, though.
edit: fixed typo
On that note you can rent an H100 for an hour for under $10 which might make for a slightly more interesting test, whats the best model outcome you can train in under an hour.
Far cheaper these days. More like $2-3 for a consumer to do this. For bulk deals, pricing is often < $2.
In terms of conpute efficiency though, Nvidia still has Apple beat. Nvidia wouldn't have the datacenter market on a leash if Apple was putting up a real fight.
We can / should benchmark and optimize this to death on all axes
On a laptop, on a desktop, on a phone?
Train for 5 minutes, an hour, a day, a week?
On a boat? With a goat?
I think you meant Llama.
The rhymes are admittedly more limited, unless you have a Boston accent.
That boat will float your goat!
I love the phrase "officially major people"! I wonder how it could be put to use in everyday speech?
https://www.youtube.com/live/lxrg28zBv94?t=7562s
She was the chair of the board of the America First Policy Institute. She's not talking about AI, she's talking about pumping ultra-nationalist, Nazi-adjacent propaganda into Red state education.
[0]: https://github.com/KellerJordan/modded-nanogpt
Neural-type models have long passed the point where markov chains made any sense by many orders of magnitude.
Markov models fail by being too opinionated about the style of compute.
In contrast, a linear tensor + non-linear function has incredible flexibility to transform the topology of information. Given large enough tensors, two such layers, with recurrence, can learn any mapping, static or dynamical. No priors (other than massive compute) needed.
All other neural architectures then are simply sparser arrangements, that bring compute demands down. Where the sparseness is fit to the type of problem.
Sparseness can be deeper but narrower information flows (thus “deep” learning). Or in lower numbers of weights to weight application (I.e. shared weights, like convolutions).
https://m.youtube.com/shorts/4qN17uCN2Pg
"You're absolutely right!"
https://www.ioccc.org/2019/mills/index.html
I suppose if you only have 5 minutes this is probably about the level you'd get.
Of course that only works if the trial runs are representative of what your full scale model will look like. But within those constraints optimising training time seems very valuable
The lesson here is that you can't use a laptop to train a useful model - at least not without running that training for probably decades.
That doesn't mean you can't run a useful model on a laptop that was trained in larger hardware. I do that all the time - local models hit really good this year.
> reducing model size while retaining capability will just never happen.
Tell that to Qwen3-4B! Those models are remarkably capable.
Local models are no where near capable compared to frontier big models.
While a small model might be fine for your use case, it can not replace Sonnet-4 for me.
But it is massively more capable than the 4GB models we had last year.
Meanwhile recent models that are within the same ballpark of capabilities as Claude Sonnet 4 - like GLM 4.5 and Kimi K2 and the largest of the Qwen 3 models - can just about fit on a $10,000 512GB of RAM Mac Studio. That's a very notable trend.
There's also a whole world of data curation to improve training, which is likely to be great for small models and seems still underexplored.
Would be curious to see a version of your model size comparison chart but letting the training continue until perplexity plateaus / begins to overfit. For example: are your larger models performing worse because they are overfitting to a small dataset, or because you are comparing model sizes at a fixed 5 minute computation time - so that the large models just don't get to learn very much in that time.
(Also interesting would be learning curve comparisons between architecture/param count)
Apple M3 Ultra (GPU - 80 cores) scores 7235.31
NVIDIA GeForce RTX 5090 Laptop GPU scores 7931.31
Note the memory constraints of NVIDIA are not like Apple silicon which tends to also be less i/o constrained. YMMV
https://www.youtube.com/watch?v=d8yS-2OyJhw
https://www.youtube.com/watch?v=Ju0ndy2kwlw
Apple m3/m4 silicon is certainly good in some ways, but the bottleneck is often a lack of CUDA software support and price (could buy >4 times the GPU raw performance on a dual rtx 5090 desktop.) =3
Once you setup a good system prompt on these, nothing really compares.
Most of the models you see with high benchmarks are not even comparable on real tasks.
qwen3 or deepseek r1, they aren't even 1/10 as good as Gemini Pro2.5
While other models like qwen3, glm promise big in real code writing they fail badly, get stuck in loops.
The only problem right now i run into gemini is i get throttled every now and then with empty response specially around this time.
If model trusts the users, and if user is dumb model will "weigh" user's input much higher and end up with flawed code.
If the model is more independent, it will find the right solution. If just want a dumb model which says yes to everything, and follows you when u are not at smart enough then you'll never end up with good solution if not by luck.
As an alternative you might consider a Ryzen Pro 395+ like in the Framework desktop or HP Zbook G1a but the 128GB versions are still extremely expensive. The Asus Flow Z13 is a tablet with ryzen 395+ but hardly available with 128GB
Might still be early days I’m trying to use the model to sort my local notes but I don’t know man seems only a little faster yet still unusable and I downloaded the lighter qwen model as recommended.
Again it’s early days maybe I’m being an idiot I did manage to get it to parse one note after about 15 mins though.
—
I know, I know. I’m intentionally misinterpreting the OP’s clear intent (the stuff of comedy). And normally a small joke like this wouldn’t be worth the downvotes…
But, I think there’s a deeper double meaning in this brave new world of prompt engineering. Most chat isn’t all that precise without some level of assumed shared context:
These days the meaning of the phrase ai has changed from the classical definition (all algorithms welcome), and now ai usually means LLMs and their derivatives.
A 5yo also has... 5 years of cumulative real world training. I'm a bit of an AI naysayer but I'd say the comparison doesn't seem quite accurate.
Isn't there a bit of a defining line between something like tic-tac-toe that has a finite (and pretty limited for a computer) set of possible combinations where it seems like you shouldn't need a training set that is larger than said set of possible combinations, and something more open-ended where the impact of the size of your training set mainly impacts accuracy?
But honestly I really like the short turnaround times. Makes it easy to experiment with different parameters and develop an intuition for what they do.