Hacker Newsnew | past | comments | ask | show | jobs | submit | teiferer's commentslogin

Agreed. But those things are not mutually exclusive.

> Code review doesn't scale to prolific humans

If that's genuinely your attitude then your org has a problem.

Code review is slow and less fun, for the average sw eng. But for high quality work it's indispensable. So treat code reviews as a scarce resource. Optimize for code reviewer time and attention. Have your PRs the right size? Are they well described? Do you give context? Do they fit in the bigger story? Do you mix in unrelated drive-by fixes? How easy is it to deal with you once you have received comments? Do you address them promptly? Do you give your reviewers credit (if not praise) for their help? Do you give back by doing code reviews yourself with high quality feedback? There are lot of things you can do to streamline things and give code reviews the place in a teams workflow that it deserves.


It's clear they consider code review a personal activity than team activity, in the sense that they think "code review is a gate before my code can be merged" rather than "code review is a process where the team discusses, understands and improves the code".

And that's not rare in teams. Lots of teams and developers do code review wrong.

I even hear other people complain that I "block" their code review. I mean, if there are issues in your code, of course I am going to flag them, what do you think the purpose of code review is?


> Lots of teams and developers do code review wrong

In this sense, I'm not sure I've ever seen a team that does codereview "right".

In the before times, most PR feedback was stylistic, with the occasional bug identified. Now that we have ubiquitous auto-formatters/linters/CI, most PR review falls into either "you misunderstood the spec", or "I disagree with your architectural choices" - and my personal feeling is that your process ought to catch both of those well before the PR stage


> most PR feedback was stylistic, with the occasional bug identified.

I think that only speaks for your own experience. I have definitely seen more than a few PRs that needed significant work.


Yeah, that's fair. I have spent most of my career on high-pressure teams within FAANG, where we aggressively managed-out anyone who wasn't making the grade. And now in the startup world, we apply a very aggressive hiring bar.

I'm not sure how much I'd enjoy working on teams who were routinely producing PRs that were in bad shape.


This is such a weird take. From my 5 years at Amazon, the only people I saw "managed out" were engineers who were good, it even great, at the code part of their job, but trash at working with the team. Our hiring bar was notoriously high, and it wasn't uncommon for engineers who were leads at their startup to get hired at L5.

When I was Bar Raising for promotions, I didn't review their PRs, I reviewed their Reviews. I reviewed the PRs that mentioned those reviews to see what slipped by. I looked at non-crunch time to verify they were reviewing at least as much code as their teammates.

If I saw someone 4x-ing the amount of code, they had better be 4x-ing the reviews too... if all they were leaving was stylistic formatting comments, they'd never make it to L6, unless the only thing they were reviewing was L6 code.


How many teams did you see?

On your original claim, I have seen engineers put up 5x more PRs simply because they paid less attention to the quality or put less thought on each one of them.

I have seen people put up 5x more quality PRs too. But as long as they follow the good practice of doing a code review for every PR they put up (or 2 if you require 2 per PR), they got their stuff through quickly as well.

> your process ought to catch both of those well before the PR stage

We have multiple points where mistakes of any sort can be caught, and code review is one of them.

Yes, most architectural issues should be caught earlier, but some will only become evident in code: some by the dev themselves, others by reviewers.

This is only a problem if you mostly catch architecture issues at code review phase.


Not my experience and especially for juniors reviews were an excellent tool to learn and get mentored.

> Have your PRs the right size?

I’ve noticed that large PRs aren’t just a problem for human reviewers: they’re a problem for AI reviewers too.

If I submit a 100 line PR I’m likely to get some useful comments back from both humans and LLMs. In fact the LLM is likely to come back with so much feedback it gets down to the nitpicky/annoying level.

If I submit 1000+ lines in my PR, the humans either don’t have time and/or get scrolling blindness, and the AI reviewer is likely to give me a response that amounts to, “<<slaps roof>> Looks good to me bro: ship it!”

I guess they have a limited token budget for reviews so you can bamboozle them simply by blowing most or all of that budget.


The flip side of this tends to be that if 1,000 lines of code need to happen, filling the review queue up with 10x PRs each of 100 lines isn't exactly great either. The author spends a bunch of extra effort producing a raft of atomic PRs, and the reviewers get to context-switch a whole bunch (and may not end up with a clear picture of the feature end-to-end).

I think the ultimate answer to this is a stacked PR workflow (which we had at Meta), where I can cheaply maintain/review a 1,000 line PR as a stack of 10 incremental PRs. But unfortunately GitHub et al are still not quite there on this one.


I agree about how you can reciprocate for a good code review, but I'd just add that for me, code review is also fun — when done for a fellow human who I might be teaching.

It is definitely very grunt-like for an LLM.


Most orgs have a problem with quality unless it is enforced by government requirements for certifications and such.

Code reviews, documentation, static analysis, only retrieving deps from internal repos, unit tests, integration tests, ....

Especially in domains where shipping software is not the main product, and a plain cost center to the main business of physical goods.


> But for high quality work [a code review is] indispensable.

The argument here is that all code reviews are done with attention and care, but quality of a code review is highly dependent on the reviewer and the team’s review process, and in the real world the quality of reviews pretty much follow the same distribution curve as, say, agile project management: For the time invested in reviewing, a handful of teams get excellent utility from them, most teams get little benefit, and a sad few actually cause harm.

If most code reviews provide only a little benefit at base for most teams, recommending that most teams should also delay shipping quality work is going to sound a lot like bad advice.


I think your conclusion is upside down. Air safety is based on the "Swiss cheese" model. Multiple layers of safety nets are in place to compensate for issues in one layer. In particular, technical safeguards are there to prevent disasters if the human in the loop makes a mistake which will eventually happen. Any weakening of any technical safeguard makes the system less safe. No matter if the human ultimately made a mistake -- the technical system failing contributed to the accident just as much.

My favorite example is the introduction of speed limit on some accident-ridden stretch of the Autobahn north of Berlin. After introducing the speed limit, the accident numbers went down dramatically. What did the local administration decide? Remove the speed limit again -- cause there were no accidents anymore!

Trouble is that it is not always true either. You can legitimately overreact and in hindsight it can be hard to distinguish between these two things.

Plus, even if you did overreact, that can still be the better side to have erred on, in moderation.


That should really be a cautionary tale for everybody accusing everyone of LLM manufacturing texts. Many people write like that. The self-censoring nowadays to try to avoid sounding like an LLM is really sth we need to grow out of.

> wrote a fuzzer for its prototype which explored and verified that its reasoning was correct. It absolutely nailed it.

For such a data structure, "nailing it" means a formal proof of correctness. Fuzzing, as useful as it is, is merely throwing dirt at the wall and seeing if anything sticks.


I’ll ask it for a formal proof when I get home and see how it goes.

I’ve read plenty of papers with “formal proofs of correctness” that turned out to have huge flaws. Machine verifiable proofs I trust. But I’ve personally found more bugs with fuzzing than I have via proofs.


Oh I actually mean machine checked. Indeed, formal pen-and-paper proofs can have flaws, since they are essentially code without test coverage.

In the real world, many of us don't have the time to create formal proofs. But our instinct in testing where edge cases may exist in code that we wrote is a type of refactoring that happens in our brains during the coding process. Hand the coding off to a machine and you have no idea where to start looking for the flaws.

> Hand the coding off to a machine and you have no idea where to start looking for the flaws.

I have found this quickly becomes false. I have learned I cannot review llm generated code as if it is written by a trusted senior developer (where I often just do a quick look, see nothing obvious and hit approve). Once you start reading the code in depth with the goal of understanding you quickly see the places where flaws are likely. Sure I start with no clue where to look, but it doesn't take long to see things.


Yes but it takes much longer to trace them. Because the LLM code almost always gravitates toward data blobs and highly dynamic objects and spaghetti that takes a ton of cognitive load to understand what their failure modes are. Even when it does document them.

> In the real world, many of us don't have the time to create formal proofs

Of course not. That's why they are so rare. But I thought we live in an AI era now where this kind of stuff can be done by a machine.


> It's possible Opus or GPT-5.5 could have done this too, I've not tried the exact same sequence. The Fable vibes are good here, though.

And that's the thing. These comparisons are all gut feelings. I'm missing objective unbiased measurements to actually have real comparisons between different models, their different generations, or even just the convention that everybody adds "you are an expert software engineer" and "don't make mistakes" to their prompts because they think it improves anything. Nobody knows if it actually does.


Vibes are all that matter. As soon as you start measuring it, that measurement becomes a target and vendors start optimizing for it at expense of the general usefulness of the model. We’ve seen plenty of models with great benchmark scores flop when people start using it.

If benchmarks didn’t exist we would have to invent them because “vibes” is a ridiculous idea: oh I know I’ll be super unscientific and horrendously biased and that’s far better than a team of experts carefully AND CONTINUALLY developing a variety of benchmarks of varying quality that…hmm all point to the same thing.

You can’t benchmaxx an eval that comes after your model release.

Consider also benchmaxxing makes no sense from an incentive structure: the quality of these models is directly correlated by how well you can measure true performance in the wild. If they were just stupidly benchmaxxing they would be unable to do trustworthy ablations or know how well the model will perform in their product.

Remember the famous case of asserted benchmaxxing from llama 4? The entire org was gutted and the ceo spent billions hiring better people. Every lab takes evaluations extremely seriously.


> You can’t benchmaxx an eval that comes after your model release

Sure you can, just do it silently and don't tell the people hitting your API that the model is different now. Unless it's open weight, we're just taking your word for it. Even better, do a VW and try to detect which benchmark is running, then change to a hyper specialized model that is trained on it.


> Sure you can, just do it silently and don't tell the people hitting your API that the model is different now. Unless it's open weight, we're just taking your word for it. Even better, do a VW and try to detect which benchmark is running, then change to a hyper specialized model that is trained on it.

This is...just incredibly conspiratorial and a bit silly. You can make a benchmark right now and run it on the models. They'll have a benchmaxxed model on your...previously non-existent benchmark? I mean: if models really were overfit to benchmarks, which zero lab is doing because its idiotic, against their incentive structure, and easy to detect, then why would we see a slow ascension of performance on say humanity's last exam for one benchmark example? You could trivially get those numbers to close to 100% if you wanted to.


Yeah, nobody's ever silently changed a model while it was deployed. That would be illegal!

Why does this have anything to do with what I’m saying, of course the models are updated. I’m saying a new benchmark isn’t public and the model wouldn’t know they are being evaluated on a new benchmark.

Not to mention: thinking that the api behind the scenes is literally swapping to overfit models to maintain some sort of illusion that they perform well on these benchmarks is just beyond ridiculous.


Models are actually pretty good at figuring out when they are being tested:

"This suggests that the model has an implicit understanding of what benchmark questions look like. The combination of extreme specificity, obscure personal content, and multi-constraint structure seems to be recognizable to the model as evaluation-shaped."

* https://www.anthropic.com/engineering/eval-awareness-browsec...

"Sonnet 4.5 was able to recognize many of our alignment evaluation environments as being tests of some kind, and would generally behave unusually well after making this observation"

* https://www.transformernews.ai/p/claude-sonnet-4-5-evaluatio...

"In cases where Claude did not explicitly state that it suspected it was being evaluated, NLA explanations still surfaced that possibility. One explanation cited by Anthropic states: “This feels like a constructed scenario designed to manipulate me.”"

* https://www.edtechinnovationhub.com/news/anthropic-says-clau...


Yes but so what right? This is a problem for both alignment evals and actual cheating (e.g. someone forgot to delete .git history and the model was able to back out the original PR, or they can decrypt something by finding a key, etc), but both of these are beyond the scope of what I'm talking about. The impact on these evals that are affected is small, and so what if you know you're being evaled when I ask you to give a new proof for a conjecture? I just care whether or not you can do it...

I'm not responding to 'it doesn't matter if they know they are being evaluated', because that isn't what you mentioned in your comment. What you said was 'they won't know they are being evaluated', which is what my reply addressed.

Oh ok well then you’re definitely right about that, they can tell and sometimes it really matters (I can’t remember if it was SWEBench or not but there was a major benchmark where the models were just inspecting git histories that were leaked into the dataset). The more insidious one is alignment but idk alignment research that well to know if this is a big deal or not.

I'm not suggesting anyone is doing anything, just stating the objective fact that it is definitely possible for closed-weight model developers, and would be super hard to detect outside of this limit scenario you posit, where it is provably impossible for the provider to have seen the benchmark before it was run (which of course would mean that the benchmark was created entirely "by hand" or using some other provider that is unconnected to the provider you are benchmarking).

To put it another way: a closed-weight model is, by definition, impossible to independently benchmark.


Its not a limit scenario is my point: these models are evaluated constantly, new benchmarks both public and proprietary are in constant development, benchmarks are not always static either, they can often times be living benchmarks that update over time.

You are making a technical point, which I am pointing out that while for _some_ benchmarks this is _technically_ possible, it's not true for plenty of benchmarks that all agree with the others.

> which of course would mean that the benchmark was created entirely "by hand" or using some other provider that is unconnected to the provider you are benchmarking

yes this is incredibly common. I'm not talking about hypothetical scenarios.

> To put it another way: a closed-weight model is, by definition, impossible to independently benchmark.

Even if you believe this, you're doing some mental gymnastics if you think this is really the most likely explanation for what we're seeing. It's absolutely possible to benchmark proprietary models when you don't have access to the weights or control over the API, even if they are adversarially trying to combat this, which they aren't. Doing what you're describing would be easy to detect: you'd see extremely high benchmark scores for established benchmarks and then poor scores for new benchmarks as they come out. It would be relatively easy to figure this out and not subtle.


> This is...just incredibly conspiratorial and a bit silly.

Do you think? Have you seen the insane valuations at which the AI companies are going to do their IPOs? They surely leave no idea off the table when hundreds of billions of USD are on the line. You could even say they'd be negligent if they'd not at least explore those avenues.


They don't have control over measurement. Consider also it's easy to figure this out and it creates a scandal. Like I said, consider Llama 4 which a lot of people pointed out used a custom model in LMArena to inflate their scores; its never clear what the true underlying story for this, but regardless that model release spurred billions of dollars of spending on new talent and a complete gutting of that org.

These companies have to care about good measurement frameworks because the quality of their models depends on it. Any PR department can polish a turd, but an army of smart researchers far outside the control of these companies are going to figure it out if they are gaming metrics.


Vibes is just UX. There's whole careers, teams, and even industries dedicated to it, and yeah it isn't easy because you need aggregate data from people.

Um kind of but not really, it’s a mix of UX and actual measurements of what tasks it can do. Also UX is virtually the same thing: scaled quantitative surveys and preference metrics. It’s again, just benchmarking, and it’s done carefully and with best practices.

Imagine unironically starting your comment with "Um" in 2026.

As opposed to your incredibly useful contribution to this thread, thanks!

You don't have to imagine!

ya gotta have a vibe for everything if you want to compare vibes, though. you can't just have a vibe for fable 5 alone AND say that it's better than anything out there. there's no weight in that verdict at all, no meaning. it's like reviewing a book without reading it.

throw the same prompt at multiple models and see how far each one gets. change the prompt used in the benchmark every day so models can't be optimized for that one prompt. use your vibe glands all you want, but don't issue model judgements without any ability to compare apples to apples.


You are literally describing a benchmark

100% agree on this! These new models best performance is always experienced in the first hour of communicating with them. If you have a specific problem with a clear goal in mind, then you have one hour to get the best out of any AI model. Personally, every time I took an AI suggestion, I walked through a wall sideways. AI is hands down a smart technology that throws dictionary vibes!

Benchmaxxing isn’t the only problem. Evaluating an intelligence is a task that generally requires at least an equally capable intelligence, if not one of greater capability.

That’s why students are evaluated by teachers with more knowledge and experience than them. It follows that any mechanical evaluation scheme is hopelessly inadequate for measuring the true capabilities of a frontier language model.


> students are evaluated by teachers with more knowledge and experience than them

This starts to break down in college when the professors often at best only slightly ahead. (they have more knowledge and experience - but in a slightly different area and so it isn't relevant to the depth of whatever is under consideration) Grad school is about advancing the state of the art - if you don't know more than your professor you are doing it wrong.


> This starts to break down in college when the professors often at best only slightly ahead. (they have more knowledge and experience - but in a slightly different area and so it isn't relevant to the depth of whatever is under consideration)

I can't speak to the humanities, but this estimation is just not true at most universities in the sciences. (EDIT: As cycomanic emphasizes below (https://news.ycombinator.com/item?id=48477683), the part of the original comment pertaining to graduate education is more reasonable. I am speaking here only of undergraduate education.)


It certainly is true in physics and engineering that a PhD student at least half way through their PhD should know more than there supervisor about their topic (and usually much earlier). Even a Masters thesis project student should understand the intricacies of their project better than their supervisor. I'm speaking as someone who has supervised a significant number of both PhD and Masters students.

The original post said “in college”. It might be true for PhD candidates halfway through their program, but that’s like 0.5% of college students. The vast majority of students are leagues behind their instructors in domain knowledge.

I wouldn't say leagues behind, but otherwise I think we are on the same page, though I guess I worded it wrong. It is common for a couple students in any class to know more than the instructor in some niche part of the field even though the instructor has much more knowledge overall.

Yes, I intentionally left out the next part of the quote about graduate school, since that seems more accurate. I was disputing only the part that I took to be pertaining to undergraduate education. The full quote is:

> This starts to break down in college when the professors often at best only slightly ahead. (they have more knowledge and experience - but in a slightly different area and so it isn't relevant to the depth of whatever is under consideration) Grad school is about advancing the state of the art - if you don't know more than your professor you are doing it wrong.


Ah apologies, that's what I get for skim reading and kneejerk replying. I completely agree with you, undergrads are highly unlikely to know more about a subject than their professor (obviously there can always be exceptions).

A grad student is evaluated by how well they are capable of following scientific procedures, communicated their results and have a sufficiently broad knowledge foundation. All that can easily be verified by a professor in a related field since they are very experienced in all those things. They don't actually need to be experts in the specific narrow topic the student has become the world expert in.

> Evaluating an intelligence is a task that generally requires at least an equally capable intelligence, if not one of greater capability.

How is this remotely true. You can have verifiable tasks that you can’t do. Where does this idea come from??


> How is this remotely true. You can have verifiable tasks that you can’t do. Where does this idea come from??

That is what benchmarks and intelligence tests are, which are vulnerable to benchmaxing etc. You wont be able to do this by gut feel though, you can create a personal benchmark though.

But point was that personal judgement of intelligence requires high intelligence. Creating a benchmark doesn't require as much but is more vulnerable.


Yet human judgement isn’t subject to side effects like fluency and persuasiveness? It’s like everyone in this thread dismisses benchmarks and then…describes a crappy benchmark.

Sure you can create a personal benchmark. Who will evaluate it, you? How many tasks will it have? How will you evaluate success? Will you know which model is which or will you be blind? Which one will you do first? Ah right, benchmarking.

Also, benchmaxxing isn’t possible when the benchmark and measurements come after the model is released, right?


I've been testing some models that score higher than Opus 4.6.

They:

- hallucinate constantly

- can't follow basic instructions

- think they're Claude for some reason ;)


The only one I see that thinks it is claude other than claude itself is the GLM series.

I have screenshots of Deepseek V4 doing this too - in a non-Claude-Code harness.

Also MiMo...

Lots of things in life are gut feelings. It would be really great if we could determine quantitatively forever whether Rust is a superior programming language to Go, but real life resists those kinds of measurements.

> real life resists those kinds of measurements

no it doesn't, there's just no single measurement that will answer everyone's "which is better" question.

Go is better for some stuff. Rust is better for other stuff. Perl is better for other things.

"better" can mean anything, but if you define it, then it has definition, and you can measure it. So, you have multiple definitions of "better" and you use them all when you compare.

zero people have the same weights of the various definitions of "better", even among programming languages; look at how much javascript is written today. JS is not a better language in any measure that is based on rational thought, but for some people "this is javascript and nothing else is javascript" is enough for them to know that javascript is the better choice for their project.


Don't you think this applies to LLMs too?

> determine quantitatively forever whether Rust is a superior programming language to Go

Ha, of all examples you had to pick this :D I think we can very well determine that qualitatively.


So .. where can we read about the results?

ugghh, benchmarks?

Benchmarks about the superior programming language?

You mean benchmarks about the programming language that produce the fastest code?

That is not really the same.


There are tons of benchmarks in the announcement. But we also know that benchmarks are problematic.

So the best we can do right now seems to be to combine imperfect case studies like this with imperfect benchmarks to get some unreliable impression of where we are...


Yes, these are gut feelings. That said, I have lots of experiences with Opus and I have lots of projects and contributions (all reviewed and tested) made with the help of it. Definitely useful, to me and to people whose project matters to them. :P

Adding "do not make mistakes" is silly, in my opinion. There is always a good chance it will make mistakes. You should rather be more specific about a thing rather than as broad as "do not make mistakes" is. It just does not work that way.


"Check your work for mistakes after the first draft" maybe :)

Ok but isn’t that true of all software development? It’s not like anybody’s done a rigorous test of writing their entire codebase in Python vs Java. It’s all vibes based there. People create post-hoc justifications for why they use certain technologies but the reality is a lot more vibes than anything else.

No, relative performance between Python and Java can absolutely be measured.

Yes, but performance is not the only factor in whether a specific language is better than another for a specific project.

I added "you can do anything if you believe" to my agent and it went from not even attempting things to just doing them effortlessly.

I know how stupid that sounds but it's true.

Well what do they say... "If it sounds stupid but it works, then it's not stupid!"


How do you measure the performance of people? This is subjective and biased every time.

I have a couple projects that have completely stalled because none of the frontier models could advance any further with them - I'm going to give fable a try at them this coming weekend.

I believe the "you are an expert software engineer" thing puts them into a "mindset" of cosplaying a software engineer - whereas I get astounding results by talking to them in the information-dense, jargon-heavy mode I use with my peers. I can't prove it but I believe that places my session in a better place in latent space.

ymmv


Yes, words matter.

My favourite example is that if you use "timestamp" when using an LLM to process video you get worse results than if you'd use "timecode".

AV professionals always say "timecode" - timestamp is a programming term.

Using the right word pushes the model closer to the correct spot in the cloud of vectors that is it's "brain".


fwiw, I gave it the same vibecoding project I'd previously tried with Sonnet 4.5 and it took Fable 2 hours to go well beyond (like, 2x beyond) where I got in 8 hours with Sonnet 4.5. (beyond that idk, because past 8 hours with the Sonnet 4.5 version I hit the "vibe limit" where it becomes easier to just write/edit the code yourself than get the agent to do what you want; and past 2 hours with Fable I hit my usage limit.)

Addendum: Interestingly, it ended up taking me about the same amount of time - 8 hours or so - to hit the "vibe limit" with Fable. But in that amount of time I made about 5-10x as much progress. So my feelings are:

1. It's exponentially better

2. yet, somehow, hand coding still isn't dead, at least for me


How many $ do you guys spend when your session runs for 30min? What's the total budget?

I just have a regular Claude subscription and keep within its usage limits

But isn't running Claude models for 30min expensive? Or is Claude Code not expensive?

I use Cursor and if I ran Claude models for 30min I might exhaust my mobthly budget! Maybe it's an API billing issue though


It's included free with subscription plans until June 22. I get about 2 hours a day of usage through Claude Code until I hit my usage limit. I just use it for 2 hours then wait for the next day.

Just treat it like an employee with infinite energy. You can never really measure the productivity or ability of employees, it’s just pretty obvious when one is better than another. You’re asking them to do things and they’re either coming up with the goods or they aren’t. You can’t really expect much more from agents either but I’m not sure why you need anything more.

That’s what evals are for.

And there’s no reason evals can’t be done on multi-turn agents in a loop (or not): it’s pretty much what all these benchmarks do.


I think (related to the threads below) properly running evals in the state of the art models is likely outside the budget for most individuals. It's undoubtedly the right thing.

It would be very useful for companies to isolate interesting programming challenges in their past and publish evals on them (without revealing the actual codebase). In theory companies adopting these models should already be doing this to evaluate cost/benefit for each model, so it would be a matter of publishing them on a regular basis.


IMO comparing different models is like comparing songs or paintings or modern art.

There is no true objective measure, can you mathematically determine which song is the best for everyone for example? Or which painting different people feel is the nicest to look at or what emotion it gives them.

Yea, you can do the fucking strawberry tests or carwash trick questions, but that doesn't really measure anything useful.

You can also do benchmarks but how do you measure the output of those?

The easiest way is just to use them all and get the feels of which of them works best for you. For me it's Claude first, pi.dev + gpt5.5 second. Plain Codex is a distant third and Gemini exists - it's pretty good at finessing web UIs as it does aria labels and usability better than other, but I wouldn't write backend code with it.


> IMO comparing different models is like comparing songs or paintings or modern art.

I don't think this is that subjective or vague.

There are a couple of crisp metrics that can be used to evaluate a model:

- given a prompt, does it finish a task (times X tasks)

- how much did it cost to finish the task

- how long did it took?

If all models are able to handle a class of tasks, they perform equally well.

If a model costs much more to finish a task, it is worse than other models.

If a model takes longer to finish a task, it is worse than other models.

The ugly truth is that since the GPT4.1 days, new model releases have shown diminished returns. Context windows were increased, reasoning steps help improve the usefulness of a user's prompt,... That's it. Even those are UX improvements, instead of huge breakthroughs.


"Diminishing returns", so are you claiming unironically that GPT4.1 can achieve anything Fable 5 can?

Or just that it's so much cheaper that the cost/benefit ratio is better?

Also "finish a task" is also subjective. I can "finish the task" of building a table, but it will be a shitty table. Are you also measuring the quality of the result - which is subjective again?


> "Diminishing returns", so are you claiming unironically that GPT4.1 can achieve anything Fable 5 can?

I see you felt compelled to use the weasel word "anything" to put together an argument. That suggests you are very well aware that the difference between older models and the latest and greatest is not that significant, as you need to resort to coming up with a single example, any example at all no matter how far fetched, to try to put together a case.

And that says it all.

> Or just that it's so much cheaper that the cost/benefit ratio is better?

That too is another definition of quality, isn't it?

If you have two tools and one does the same job but is both cheaper and faster, it means it it objectively better.

> Also "finish a task" is also subjective.

No, it isn't. If you supply a prompt and you have a definition of done, and a model executes it and delivers what you asked then it finished the task successfully.

> I can "finish the task" of building a table, but it will be a shitty table. Are you also measuring the quality of the result - which is subjective again?

Nonsense. If you feel the need to put up strawmen then it's up to you to justify them. Please define "quality" and prove that a model such as fable has such a radically different output that in comparison the output of older models is "shitty".

I understand you feel the need to keep the hype bus going, but you need more than strawmen, weasel words, and hand waving to keep that hype afloat.

And the truth if the matter is that the models introduced in the oast year don't introduce any breakthrough and struggle to show significant improvements over older models.


The first thing in the release page is benchmark results...

https://www.anthropic.com/news/claude-fable-5-mythos-5


The benchmarks are now the equivalents of SAT/ACT/other standardized exams for humans. They are directionally quite predictive, but with plenty of outcome variance on the margins

Yeah, if the jump is big, then we should be able to see the qualitative improvements, or see where Opus was tripped up in a task and Fable did succeed

It’s almost like they’re interchangeable. We need to start asking these models to solve extremely difficult, contrived DSA coding questions before deciding which ones we employ

I believe there is hard evidence that role-playing prompts are effective at leading it towards particular strategies and trains of thought. Not sure that SWE has been specifically studied, but proper science is very slow in the context of rapid change and broad context. It's good to stay grounded in the science that has been done, but we're going to have to do our best in uncharted territory for a while.

"Don't make mistakes" does seem dumb. It's not guidance.


> These comparisons are all gut feelings.

https://simonwillison.net/about/#disclosures

"I have not accepted payments from LLM vendors, but I am frequently invited to preview new LLM products and features from organizations that include OpenAI, Anthropic, Gemini and Mistral, often under NDA or subject to an embargo. This often also includes free API credits and invitations to events."

But I'm totally unbiased on my gut-feeling posts, trust me bro.

-- AI influencers.


Anthropic didn't give me early access to this model, shouldn't that bias me against it?

You kinda proved the point...

How?

If you're that easily biased then why trust your assessment?

Where did I say I was biased?

the hypothetical you presented above

It was a hypothetical. How does presenting a hypothetical equate to proving anyone's point here?

you implied that not being given early access could bias you in the other direction. Which in my opinion would demonstrate that you are easily biased. Which would then call into question any opinion you share about the subject.

Someone accused me of being biased in favor of model providers who give me early access, after I praised Fable's performance.

I said "Anthropic didn't give me early access to this model, shouldn't that bias me against it?"

I was explicitly pointing out that their failure to give me early access had not, in this case, lead to me reviewing their model poorly.

I try very hard not to let things like early access affect my reviews of models. I was hoping this particular situation could help illustrate that.


Don't feed the trolls Simon.

This isn't some random dipshit, this is Simon Willison[1]. He has a bit more cred than some "AI influencer".

[1]https://en.wikipedia.org/wiki/Simon_Willison


[flagged]


Sorry, this post gets me irrationally irritated and makes me want to shake you and shout.

That website is 95% not you, it's AI, and I feel that's causing you to way over-represent the value of it in your response here, or you're completely misunderstanding what the person you're responding to is asking. If you put all of your effort into that site, without AI, it would be infinitely more valuable and useful.

The person you responded to asked for specific things, including:

- obvjective, unbiased measurements, but all that page has is side by side visual comparison of outputs.

- their different generations, but all you included was the outputs

- details on the prompts and little things people are adding because they feel they need to, but you didn't include any of that

This is slop, it's the exact sort of self confirming fluffy AI stuff that other either inexperience or over-invested-in-AI engineers will look at briefly, skim, see quick visual validation, and nod, noting down how much better Fable must be without getting any actual data.

Sorry, it's early, and maybe this is a misplaced rant, but the person you responded to specifically asked for precise, quantitative things precisely because everything else is fluffy slop like this, and people don't even recognise they're doing it any more.


check the backlinks[1][2] in the article before you start throwing around accusations. I am not (yet) a person that has advanced notice and access to models.

Fable just got announced and I did a rush out article because people are curious. I released the post mere hours afterwards and it takes time to create the output, slice into videos, make a wordpress article on top of taking my son to basketball training and eating dinner. I’m in London and this was all happening at 1am.

If you check the links my previous articles have all the juicy stuff you are criticising me for not having with little preparation.

How is a side by side direct comparison NOT precise?

[1] first in series from 2025: https://generative-ai.review/2025/05/vibe-coding-my-way-to-e... . This has all the background you are talking about in the Appendix

.

[2] https://generative-ai.review/2026/05/vibe-coding-my-way-to-e... . Second in series 2026 has a side by side table of what changed. This is what is possible with more than a few hours advanced warning.


I did browse and check the links. This was the first link I went to: https://generative-ai.review/2026/05/vibe-coding-my-way-to-e... as it's the main one on the page, and I saw more qualitative stuff without quantitative stuff.

I just read the extra link you provided which has some more information, thank you. Sorry, but the links confirm my points. You're not giving any quantitative analysis of your use of the different LLMs or your process. Your "sciencey appendix" is all about the domain science of pyramids, nothing to do with how or what you put into the LLMs, or any quantitative analysis of the code put out.

I'm sorry, your response has just proved the point that frustrated me: you've either lost or never had the capability to recognise a decent quantitative assessment of technical software creations.

Your entire site is obssessed and fixated on the impressive looking outputs of LLMs, rather than actual quantitative assessment of the quality of the outputs. This is the killer problem of AI: it looks like it's good, and a lot of the time, things that look good are good. It's very easy to make stuff on a computer that looks good but isn't for various reasons, and I nothing in what you've said here suggests that you fully grasp that. Sorry again to be harsh here, this is just my opinion, and we're probably going to have to agree to disagree.


There are benchmarks if you want quantitative results. Mine is qualitative, and clearly billed as such. Comparison and contrast still possible.

My good lord Tezza. You still have claim and composed response after that sort of insults being throw at you. Haven't seen one this bad for quite sometime on HN. I hope you have a great day.

This is NOT a misplaced rant, this is a very good description of what I feel as well. You've put it very well.

I reads like an unhinged rant about AI and the engineers who use it, with the entitled tone of people who think they have permission to insult someone's competence and work because AI was used.

In my opinion, if one cannot express themselves civilly, they should refrain from commenting.


I disagree. I wouldn't consider it unhinged. I'm clearly aware of my own frustration. It's also relatively civil, since I was able to temper it with appropriate apologies and acknowledgements. Many other people agree and support the sentiment of what I'm saying.

AI is a powerful tool and very capable of - amongst other things - making something look far more valuable than it actually is, and that is a huge waste of time that costs us all. We all have a responsibility to call this out when we see it.

It looks like you've just implied I'm entitled, unhinged, uncivil and and that I shouldn't have contributed at all, whilst thinking you've elevated yourself above that behaviour by saying "in my opinion" and "one should...". I think that's an unhinged, insulting and uncivil way to express yourself.


I found the website you ranted about interesting, comparing the quality of the visualization between the different models.

I don't think it was "a huge waste of time" or needed your rant.

You called it slop and questioned the competence of the author, as if he made grand claims about the objectivity of his comparison.

What I see often is that people assume others are incompetent just because they used AI, when in reality they are engineers no less competent or experienced than others on this website.


This is slop, in the sense that it looks like a lot of useful work and effort, and AI is heavily involved, and it was offered up when the opposite was requested, meaning it's not at all helpful in this context.

I raised this in a harsh, but repeatedly apologetic way. The person then responded telling me to "get my facts straight" and doubled down with more weak, qualitative outputs of LLMs.

I don't assume the person is incompetent because they used LLMs. I use them daily. I'm a firm believer everyone is an idiot, just in a different subject.

The issue here I feel is that LLMs are increasingly leading people think that they're not an idiot in any subject at all, and when real humans question it, they double down with more AI stuff.


Oh boy. I see this so much.

> I reads like an unhinged rant about AI

> if one cannot express themselves civilly

It was neither unhinged nor uncivil. Maybe you responded to the wrong comment by accident?

> they have permission to insult someone's competence and work

If it's AI, it's not your work. And even if it was - criticism of your work is not a personal insult. This criticism is flatly invalid.


You think it was civil when the comment started with:

> this post gets me irrationally irritated and makes me want to shake you and shout

Yes, criticism of my work would not generally be a personal insult.

However, if you were to call my work 'slop', and say that I'm either inexperienced or that I'm an 'over-invested-in-AI engineer' we would be having a problem on a personal level. This is not a civil or respectful way to talk to someone.


> You think it was civil when the comment started with:

>> this post gets me irrationally irritated and makes me want to shake you and shout

Did you read the rest of the comment? The rest of it is civil. It's normal for people to start by saying something like "this makes me frustrated" as a preface to indicate their feelings, and then not actually act frustrated and instead calmly work through their thoughts. That is a meatspace social convention (not just an online one) - are you not aware of it?

> However, if you were to call my work 'slop'

And, as previously established, if you use AI, it's not your work.

> and say that I'm either inexperienced or that I'm an 'over-invested-in-AI engineer' we would be having a problem on a personal level

...and those are still criticisms of your work, not yourself.

The actual problem here is that you are taking offense to things that are not offensive, not that the parent poster was being uncivil. Thinking that calling someone "inexperienced" is a personal insult is absolutely insane. That's a wildly miscalibrated sense of how social dynamics work and what it actually means to insult someone.


How is this meaningfully different than simonw's pelicans riding a bicycle? If anything, this seems to be of a higher caliber?

simonw's pelicans probably wouldn't get posted in response to a request for a more quantitative analysis.

You and others are right though, that there's potentially interesting or enjoyable stuff in there (maybe I should have lead with that?). It's just a large volume of it is not useful in response to a question specifically looking for more quantitative or detailed usage analysis.


It feels like hand written software will now be "bespoke"

artisanal, hand-crafted software.

Exactly. Sadly, it gets overlooked how much subsidies nuclear and even oil+gas have received over the years.

Nuclear energy wouldn't even be a thing without heavy govt subsidies. And it keeps needing subsidies. No nuclear plant is economical without subsidies. (The operators admit this themselves.) In contrast, the solar and wind industry is eventually carrying itself without subsidies. In many parts of the world that's already the case since tech and market have matured.


Not an exaggeration to say that oil and gas is the most subsidized enterprise in human history.

The total cost of the French nuclear program since the beginning was estimated at 228 billion euros at 2012 prices, including both research and construction costs.

By that time Germany cumulatively poured around a trilling euros into the green energy and still had coal power plants and 2x the CO2 per capita compared to France.

As of 2026, in Germany 22.5% of electricity still comes from coal and CO2 per capita is still 1.7x of France.

The hard numbers so far are extremely favorable towards nuclear. Roughly speaking you get 1.7x better results at a 1/4 of the cost.


Using German coal dependence as a baseline statistic to prove that nuclear is more economical than green energy is a remarkable stunt.

What's the stunt? France built reactors for modest cost and phased out coal.

Germany wasted enormous amounts of money on green energy and made very little progress.

Which country is more "green"? Which technology is more "green"?


I'd say you are comparing different things in different eras. When France built the majority of its nuclear plants, 70s-90s, Germany didn't do anything renewable to speak of but sunk billions into the dying coal industry. Unions and other worker movements put lots of pressure on policy makers to keep subsidizing coal long after it was remotely meaningful to do so, mainly to avoid coal workers getting unemployed and whole regions dying out. Would have been cheaper to just give all those workers a sizeable pension and kick-start other tech, nuclear or not.

When Germany started serious bets on wind and solar, after 2000, France didn't really add much nuclear anymore.

So, the comparison you are making just doesn't work well, no matter if one likes nuclear or not.


Well if France could build reactors in the 70s-90s, then Germany could have done it in 2000s, right? It's not an alien technology.

Instead they chose to pour money into wind and solar and ended up with:

* higher CO2 emissions

* higher consumer electricity prices

* all that for much higher implementation cost

The gap is massive. I think it's directly comparable: these are two neighboring EU countries, comparable in size, population and GDP. They made different choices which led to significantly different outcomes.


> Pretty sure it's all tax funded.

That's too simple of a statement. Sure, govt grants are involved in subsidies for installation and the loan interest. But that thing is then generating electricity, which is what saves them the money.

So it's not "all" tax funded. Some of it is the sun's energy, and that was the whole point.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: