Right. Local models haven't quite hit that level yet. The biggest open models, which you need tens of thousands of dollars of hardware to run at reasonable speed, have pretty much hit that level of capability, but most models you can reasonably run at home aren't quite there yet. But given the gap, if local models keep improving, you'd expect to maybe see that level by this November.
My understanding is that we could in fact run the largest models on "reasonable" home hardware by focusing on throughput rather than raw speed and having them do unattended inference in large batches. The big proprietary suppliers have no interest in this because their own incentive is to fill all the physical space available with top-performing hardware and doing huge amounts of inference as quickly as possible. A home user with limited hardware investment has very different constraints.
That should do pretty well. Memory bandwidth is the biggest bottleneck for token generation, at 644 GB/s you should be able to do pretty well on a 9070, while prompt proessing is more compute bound and Nvidia tends to have the edge there.
16 GiB won't fit you much, so you'd probably want at least 2x, and preferably 3x of those, and then you need a motherboard, power, etc. that can handle that.
If you believe the benchmarks, Qwen 3.6 35B-A3B already outperforms Claude 4 Opus.
Now, there's a bit of a degree to which some of the open source models do some benchmaxxing, and bigger models with more params may always feel like they have more depth. But anyhow, right now you have something that is arguably comparable to Claude 4 Opus on your laptop. I can't really compare myself because I never used it. It looks like Claude 4 Opus is still available on OpenRouter, so you could try it out and compare yourself if you're interested.
It will likely always be the case that there are proprietary cloud models that are more powerful than what you can run on a laptop. You can just do a whole lot more with terabytes of VRAM on multi-GPU clusters than you can do on a laptop. So for folks who must have the most capable, you're probably not going to want to leave Anthropic.
But right now, the models you can run on your laptop are comparable to the cloud models that were popular when vibecoding and Claude Code first took off.
You really need to take the benchmarks with a massive pinch of salt. I’ve been testing local LLMs since the original llama and there’s nothing I’ve tried that is in the same category as Opus.
Which Opus? They certainly outperform Claude 3 Opus.
Anyhow, feel free to try them out head to head on OpenRouter. I'd love to see someone write up their results, of a modern local sized open source model vs. frontier models from ~a year ago, on something other than the standard benchmarks.
There's a guy on Youtube named Bijan Bowen who tests all the models (open and frontier) on a series of one/few shot programming exercises and has been for a long while now. You can pretty much watch him compare the results for any two models you're likely to be interested in.
I'm not affiliated, I just like his style and have found it handy. I know it's not very rigorous, but it's good enough for me and I've found his examples to pretty closely match the results I see in real life.
Qwen 3.6 produced far more working functionality than Claude 4 Opus did.
Obviously, just one test of a single one-shot prompt of a silly toy OS, but yeah, this particular test shows Qwen 3.6 running locally dramatically outperforming Claude 4 Opus, which was a frontier model a year ago.
I’m normally comparing frontier open/cheap models against frontier closed source. I use deepseek/glm regularly, they’re fine and you can get real work done with them but it’s super obvious when you switch back to opus or even sonnet. A 3B active param MoE model is not comparable.
Yeah. I was pointing out that local 3b active models outperform frontier models from a year ago.
Will this trend continue? Who knows. Both the frontier and local model will probably continue to get better. Which one will hit the top of the S-curve first? Hard to say, really. But what you can do right now locally is better than what you could do a year ago on the frontier, and lots of people were already using it pretty heavily a year ago.
Hoever, November is when most folks agree that the frontier models got good enough for much of their work. Local models aren't quite there yet (where by "local" I mean "can run at reasonable speed and quant on a system less that $10,000 with today's RAM and GPU prices"). The biggest open weights models are getting there, but those require something like an 8x H100 server to reasonably run.
It's likely that there will always be a gap between frontier and local if you're comparing models at the same time, you can just do a lot more with terabytes of HBM than gigabytes of DDR. But will local models get good enough to be usable for useful work? For many folks, they already are.
Agreed, but at their current prices Deepseek + GLM are clear winners in my book. This weekend I spent $5 between the two where as I'd probably have to pay $20-30 to Anthropic (and that's still with the massive VC subsidies).
For web development (or anything else with an extreme amount of training data) it's number one for sure. You can't beat it at its costs. US companies will not be able to compete on a competitive market, which is why they rely on so much US government protection + corporate welfare.
There is no Claude 4 Opus model... It's a series of model, of which the strongest is Opus 4.8, and Qwen 3.6 35B-A3b gets 51.5% on Swe-bench pro to Opus 4.8's 69.2%
But it is still available on Google Vertex according to OpenRouter (though it's possible that info is just out of date, it's currently quoting 3tps which is unusably slow): https://openrouter.ai/anthropic/claude-opus-4
The thing is, to do a proper fix it would really need all of the context (maybe the tool call that failed was for an edit to a file that was last touched way at the beginning of the context), so you'd need to either keep that smaller model running doing prompt processing all the time, or have a very long wait while it does prompt processing on your whole session.
And then also, sometimes the tool call errors are because of something like a file was changed out from under it; the larger model is probably going to do a better job of figuring that out and fixing it up.
Finally, in Pi, you can always just use the /tree command to skip back to before a series of failed tool calls, with a summary if you want to let the model know what happened. The Pi /tree command is pretty powerful in managing your context
An illustrative example I've seen a lot is creating Jira tickets in projects with custom fields marked as mandatory. It tries to create the ticket without the field and the tool call fails. The LLM needs access to the full context so that it can generate text to put in the "Why couldn't this meeting be an email?" field.
This is very similar to my setup. Pi in a container (I do let it have network access, just no access to creds or anything, only the one directory that I'm working on at the time and my ~/.pi directory), talking to llama.cpp in another container. I'm on a Strix Halo 128 GiB unified memory laptop.
I've never used the frontier models in earnest, I don't believe in using proprietary tools for my programming, so I can't really compare.
And I'm still a AI skeptic, so I'm doing more testing and kicking the tires than I am actually using it. That means I spend a lot of time trying to break various models, probe them for strengths and weaknesses, etc.
But I find that when I do try to use it for real for agentic coding, Qwen 3.6 35B-A3B is definitely the one I reach for the most often.
For other chat tasks and translation, I'll frequently use Gemma 4 31B.
For audio, I'll use Gemma 4 12B.
I keep a bunch of other models around to try out every once in a while (Qwen 3.5 122B-A10B, Qwen 3.6 27B, Nemotron 3 Super 122B-A12B, Step 3.7 Flash and Minimax M2.7 both at somewhat more aggressive quants, and GPT-OSS 120B if I want super fast but not terribly smart), but so far Qwen 3.6 35B-A3B is really the sweet spot for coding on a setup like this.
Hopefully this isn't off-topic, but your setup sounds just like mine, Strix Halo and (I'm assuming) llama.cpp on ROCm, and I'm finding that the Qwen hybrid models don't handle prompt caching and instead re-process the context in full on every turn. I'm wondering if you were able to solve this and how?
I use Vulkan mostly instead of ROCm. Vulkan is actually a bit faster, paradoxically. I do switch out and try them both out, and it's not a huge difference, but I've been mostly saying on Vulkan.
The re-processing context every turn problem is definitely something I've hit. Some of the causes have been solved upstream in llama.cpp; make sure you're up to date.
But another cause of the issue that has a big effect is that older Qwen models didn't support preserving thinking. This means that each time you have a long sequence of tool calls with interleaved thinkging, as soon as you had your next turn in the chat, it would have to re-process all of that as it would drop all of the reasoning.
Qwen 3.6, however, now supports preserving thinking. This can use a bit more context, becasue you're not dropping the thinking every turn, but it re-uses the cache better, not causing you to have to reprocess a whole turn at a time each time.
In my models.ini, I have this for the Qwen3.6 models:
Thanks for sharing have been running ROCm primarily with Qwen 3.6 and Qwen Coder, on the runs much better statement is that a stability, performance or other capability your experiencing?
I'm a little surprised that preserve_thinking would matter here for cache purposes. for actual capabilities/intelligence, yes, I'd imagine it helps to have past reasoning traces in multi-turn setups.
but for caching, all you are doing is leaving off a fraction of the most recent assistant message generation, which will have little/no impact on cache hit rate.
> all you are doing is leaving off a fraction of the most recent assistant message generation
True, but not a tiny fraction, qwen is very verbose in its thinking traces. And it basically means that for every (nonthinking) generated token you have to compute the KV twice (once as tg, the second one as pp).
I was able to solve this for my setup, 7900XTX and llama.cpp on ROCM in the oh-my-pi fork of pi.dev harness. I documented my setup on github, check under my username/omp-config, but the important thing is making sure the context is strictly append-only, and starting llama.cpp with
If you're hitting this you have a bug, this is not related to the model. Either your harness is editing the messages between turns incorrectly (i.e. it is not append-only), or sometimes this is because of llama.cpp bugs, but bet on the former. Setting up something like Tailscale's Aperture will let you capture the requests and then you can diff them.
> Qwen hybrid models don't handle prompt caching and instead re-process the context in full on every turn. I'm wondering if you were able to solve this and how?
Isn't this the nature of how LLMs work? Or do you mean that it recalculates the entire KV cache instead of saving the old KV cache, in which case the problem is likely in your executor (llama.cpp, vllm, e.g.) configuration or capabilities?
So, one of the ways that this problem manifests is that most local models aren't trained on preserving the full reasoning between turns. Every turn, they skip passing the reasoning trace from previous turns to the the LLM. So if on one turn you have a long interleaved chain of reasoning and tool calls, then it responds to you, and then you give a new prompt to fix something, it has to re-process all of those tools calls now with the reasoning stripped out.
Qwen 3.6 has finally been trained both with and without preserving thinking, so you can optionally enable preserving thinking. This will use up a bit more context, but it will avoid having to do this re-processing of long agentic turns, and also the preserved thinking can avoid having to re-do some of the same reasoning over again in later turns.
Besides that, modern LLMs don't only use full attention (apparently, attention is not all you need). Full attention is very expensive to compute and store (0(n^2)). But additionally, full attention is actually bad at certain kinds of reasoning; keeping track of some value that gets replaced over the course of time, for example. So most models these days use various forms of local attention which is fixed length and gets updated as you go; sliding window attention, Mamba-2 state space models, etc.
But one advantage of attention is that you can go back and reprocess by truncating the KV cache and starting over. You can't do that with other forms of local attention; you've lost the state earlier in the sequence.
So to allow you to go back without fully recomputing the cache all over again, your engine will save snapshots of the local attention state at various times, so if you need to go back to recompute the cache, you can start from the last snapshot. However, these snapshots can get large, you can't keep too many of these, so sometimes you need to go back quite far to get to one, or they're all past the point you need to go back to and you need to start over again from the beginning.
There have been particular bugs in llama.cpp that have caused this to be triggered more often than it should; for instance, it wouldn't take snapshots before turns that included images at one point, so if you had an image heavy agentic workflow, that issue plus the lack of preserving thinking would mean you would frequently have to go back and start over from scratch.
Some of these issue have been fixed, some are addressed by preserving thinking. There are still some issues sometimes; for instance, one that's hard to fix is that the tokens generated autoregressively don't always parse the same when doing prefill. For instance, you could generate something as two tokens "pre" and "fill", but it turns out that "prefill" is also a single token so the tokenizer will use that, so when you send that back again on the next turn, it will see a divergence and have to recompute from that point. It might be possible to ignore that and use the not fully greedy tokenization that's in the cache, but I've definitely seen llama.cpp have to do some cache recomputation due to that.
Not a harness issue. The harness (pi in my case) passes back the cot for all previous turns.
The jinja template is what renders the openai-format request sent by the harness, into the actual string of text that will be tokenized and fed to the model. For models without preserve thinking support, the jinja template drops the reasoning from all but the current turn.
{#- Render reasoning/reasoning_content as thinking channel -#}
{%- set thinking_text = message.get('reasoning') or message.get('reasoning_content') -%}
{%- if thinking_text and loop.index0 > ns_turn.last_user_idx and message.get('tool_calls') -%}
{{- '<|channel>thought\n' + thinking_text + '\n<channel|>' -}}
{%- endif -%}
You see that it only preserves the thinking for indexes that are later than the last user message; thinking is only preserved for a single turn (which can include a lot of interleaved thinking and tool calls), once it goes back to the user and the user replies, it will replay the tool calls but not the thinking between them.
{%- if (preserve_thinking is defined and preserve_thinking is true) or (loop.index0 > ns.last_query_index) %}
{{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content + '\n</think>\n\n' + content }}
{%- else %}
{{- '<|im_start|>' + message.role + '\n' + content }}
{%- endif %}
It additionally has a preserve_thinking flag that you can set. If that's set, it will include all turns thinking in the text passed to the model. But you do have to set that, it's not the default.
It's possible to modify the jinja file that you're using with a model. Some people do that with models that haven't been specifically trained for it, and report good results; but some report that because it wasn't trained for that, they get worse results if they include thinking from previous turns.
So for models like Gemma, you would have to modify the default jinja to enable this. For Qwen, you can just set the preserve_thinking flag to get this behavior; and apparently they have trained in this mode so you get better results than models that have not trained this way.
What harness are you using? Some of them (e.g. OpenCode) mutate the system prompt every turn, and therefore can't work with a KV cache.
I've had the best luck with Pi so far, but it comes without some bells and whistles you might be used to (e.g. plan mode, subagents, MCP client support)
I'm a housekeeper skeptic. While I concede that a professional housekeeper would probably do a better job than me on most domestic tasks, I still think everyone should clean their own home, cook their own dinner, and write their own code.
For me the distinction is that your rice only needs to be edible once, while your code may need to last for decades. Using AI to code anything I could comfortably throw away if needed is a lot less fraught than letting it make choices that I and anybody who inherits the code is gonna have to live with, especially if by outsourcing those choices I reduce my understanding of the implications of those choices.
I don't let the AI make any choices. I have a lot of instructions and sample code for it to follow. It is basically a glorified code generator at that point.
I'm not getting it. OP said they are wary of letting the agent make choices for them, and outsourcing those choices lessens their understanding of them. They could interrogate the agent on why those choices were made until they have sufficient understanding, and they can also change the solution if they want to.
I think the idea that code should last decades is now questionable, if not problematic. If we can now produce code at 10x the rate, that means we can have 10x more code (probably not desirable) or we can have 10x as many revisions. Whoever inherits the code can have it rewritten to their liking and understanding. Nothing helps better in understanding a system than to rebuild it, even if just by handholding an LLM.
Exactly, but if I start from working code with a lot of tests I don't need to remember the requirements. I just need to know my current requirement and figure out the ones I'm changing with my new requirement. It doesn't catch everything, but in most cases if I break some other requirement I find out about it and can figure out just that one more requirement and not the millions of others that still work.
The thing about this is that you can choose how high level you go.
For example you can just tell it to make a website for a business with a webshop and it'll just generate thousands of lines of code and you have no control over anything. Or you can spend hours/days writing the specification and then have it generate it.
Or you can do what I do and work iteratively one feature at a time making sure everything is exactly the way you want it. I generally solve the problem myself then tell it what to do, or if I'm not sure what the best solution is I might discuss with the AI until we agree on a plan and then have it execute it. Often this leads to me learning useful things, like it will suggest a tool/feature that I didn't know about that's perfect for my usecase or it will identify a problem in my plan that I wouldn't have found until after spending hours on the implementation.
I've always been very detail oriented and I care a lot about code quality, I want my solutions to be clean, consistent and as simple as possible while solving the problem. To me, AI tools let me do that more quickly and better, it's not a compromise it's just flat out better in every dimension. It's about how you use it.
A lot of people seem to think that it's a binary choice, either hand craft a high quality bespoke solution or just vibe code a pile of trash. There's a whole spectrum in between those two, and I think there's a sweet spot where you still maintain control and understanding, it's just much faster and the result is actually better because it's not just you and the knowledge in your brain it's also the AI that practically knows everything - it will teach you things and suggest solutions you wouldn't have thought about, it makes you a better developer. It's a force multiplier and the smarter you are the better you will be at using it.
It's not a replacement it's an enhancement. It's like imagine a developer with Google vs one without, obviously the one with Google will be better because they have access to more information. The AI is like automatic google that just googles everything all the time, things you wouldn't have even thought to Google or things you couldn't possibly formulate a good search term for. With AI you can just show it a screenshot or describe an issue in detail and get a really solid answer a lot of the time. It's like having an expert on standby all the time, sure it's sometimes wrong but most of the time it's not and if you're smart you'll recognize when it isn't.
I'd say anyone who isn't using AI today aren't using their full potential. I don't see how anyone could possibly perform better without this tool than with it. I do see how someone who doesn't care could produce a lot of slop, but the people who refuse to use it aren't that guy. That guy has been using it to produce slop for years already. You can use it to produce top quality code if you choose to.
It means that even if it works for certain tasks, I think that the problems caused by use of LLMs outweigh their benefits. I think it's a bad idea to generate large piles of code that you don't understand, but due to competitive pressures, it's too tempting for people to pass up, leading to a world in which software is getting worse by the day, while pumping CO2 into the atmosphere and boiling scarce water supplies to do so, DDOSing websites to scrape the data, and polluting the internet with mountains of slop.
This isn't about using rice cookers or not, that's a personal choice for how you cook your food, and choosing to do so or not really only affects the person cooking and cleaning. A rice cooker probably uses a similar amount of energy as cooking it by hand, possibly even less.
But when people using LLMs are causing active harm, and are making it more difficult to collaborate on a team, it's a lot harder to accept that it's just a personal preference.
If you wanted to use the rice cooker analogy, imagine if rice cookers let you cook rice in just one minute. Faster, don't have to wait for the rice to be done, great! But in order to do so, you have to cook 50 pounts of rice, but throw out the majority of it, and use a thousand kilowatt hours of energy to do so. You'd better believe I'm going to be skeptical of everyone deciding that they suddenly have to use these 1-minute rice cookers that burn so much energy and generate so much waste.
Much more complex than that. Even if it does give you a speedup at certain tasks, is it worth the cost and risks? You go faster, but now you have more code that you don't understand and so won't be as good at maintaining. There's the engergy use, the water use, the scrapers destroying the internet, the massive piles of slop, the hallucinations and bullshit, etc.
Haven't used for actual coding but was testing locally - for example running some swebench instances - whether qwen-3.6-35b-a3b@Q8 was better than qwen-3.5-122b-a10b@Q4. With MTP the former runs at around 55t/s and the latter at around 30t/s meaning the latter is also usable. It looked like qwen-3.5-122b-a10b@Q4 performed a bit better.
Open-source data coverage: The released datasets cover an estimated 8–10T tokens
(~40–50% of the internal 25T blend). Missing categories include code (~14% of blend),
nemotron-cc-code (~2%), crawl++ (~2%), and academic text (~2%). Users should
supplement with their own data for these categories and adjust train_iters
accordingly.
Nemotron is the strongest model (on most benchmarks) that has its full training pipeline and most of the data open. Olmo 3 from AllenAI, and K2 Think V2 from Mohamed bin Zayed University of Artificial Intelligence are both fully open, but not as capable as the Nemotron family. Granite has much of the training pipeline and data open, but is missing some of each.
I tried this with the original comment in the thread. Guaranteed to not be in the corpus, references a few terms that also wouldn't be in the corpus (Claude Fable), and long enough to be more than a sentence or two while short enough to compare in a discussion like this.
I did this with entirely local models I have sitting around on my laptop. Minimax M2.7 at a 3 bit quant with 8 bit quantized KV cache for English -> French, Gemma 4 31B QAT (4 bit quant) MTP for French -> English.
It's perfectly readable, but there are a few places where the phrasing is a bit more awkward after the double translation ("auditing" to "revision" in particular is a bit off). Gemma did comment on not knowing what Claude Fable was in its thought process: "The author compares Ellsworth's translation with one produced by "Claude Fable" (likely a misspelling of "Claude" or a specific version of Claude)."
Here's the double translation:
"I have no doubt that a writer is better at translating than AI, but I must say that AI translation has become so good that I'm not sure how much longer the profession of translation will exist—or rather, it may become more a matter of revision.
"For example, I just read Lawrence Ellsworth's translation of The Three Musketeers, which I enjoyed immensely. I neither speak nor read French, but from what I understand, Ellsworth's translation is considered one of the most faithful translations of the work.
"Out of curiosity, I asked Claude Fable to translate the original French version of The Three Musketeers; I asked it to translate faithfully, but also to try to maintain the same playful tone as the original and to censor nothing.
"Once it was finished, I didn't read the entire result, but I compared a few individual chapters between Ellsworth's translation and Fable's.
"They were honestly remarkably similar. As far as I can tell, nothing was substantially different between Ellsworth's translation and Fable's. I think the prose in Ellsworth's translation was slightly better, but Fable's was actually perfectly readable. Again, I don't speak French, so I can't say for certain, but I don't believe I would have had a significantly different experience if I had read Fable's version instead of Ellsworth's.
"It is possible (and probable) that this is partly a self-fulfilling prophecy; Fable may have been trained using Ellsworth's translation and can therefore draw directly from it. Unfortunately, since I don't speak any language other than English, there is a sort of vicious circle: the only way to compare the fidelity of a translation is to compare it to other translations, but if other translations already exist, that will likely influence the results, and if a translation doesn't exist yet, I have no way of verifying it.
"I am going to continue reading Ellsworth's translations for the following stories simply because it feels more canonical to me, and as I said, I think the prose was slightly better."
Olmo 3 claims that if they paid market rates for their training, it would have cost $2.75m. It was trained by a non-profit and probably had some of the compute donated, hence why they have to estimate.
From the blog, it looks like there hasn't been much progress for a few months, but if you check their HF it looks like they have a series of 32B models trained on top of Qwen3 32B with different numbers of training examples that they've uploaded a few days ago: https://huggingface.co/collections/open-thoughts/openthinker...
So looks a little bit more research oriented than intended for production use, but still neat to see this effort.
You're right, there are probably lots of sites misconfigured to not respect language headers, but we don't notice because English is the default.
However, the right solution is still to use the language header. I send that to them, they should use it to give me the right one by default.
One of the funny things is that this whole site is in an iframe; which breaks both Google Translate, and the Firefox translate feature. If you check, the outer iframe seems to indicate `lang="en'` and loads the iframe with `src="/coder/index.html?lang=en"`, but the inner iframe still gets a `lang="zh-CN'` by default until you use the toggle.
If you go to the eventual redirect source of the page with `lang=en` parameter, you get a `lang="en"` attribute, but it's still in Chinese until you toggle it with the menu: https://mimo.xiaomi.com/coder?lang=en
Anyhow, yeah, lots of pages are probably broken this way but we don't notice. But still, it has that info from your request, it should use it.
Anthropic has been releasing models named Opus since 2024 with Claude 3 Opus.
Opus has gotten vastly more capable since then.
Local model far surpass Opus 3. They even surpass Opus 4 on most benchmarks.
Sure, if you compare to the latest Opus 4.8 or even 4.6, they're not there yet. But there's a huge difference in performance between 4 and 4.8.
reply