I found it tripped in most laughable situations by mere were words that could be related in some way to hacking but are in common use in programming. I would have to go back, examine my prompt for word that could be use in another context and replace it with a synonym.
I got downgraded from Opus to Fable for asking why MDMA was not addictive in the same way Cocaine is, so yeah, the "guardrails" are clearly vibe-coded.
Just speculating but I "feel" 4.7 was post-trained using more synthetic techniques. The way it writes for one thing, it's "personality", is less human and more fatiguing-AI-slop like.
You don't need to fry with RLAF to get that "slop feel". The first iterations of "AI slop" were raw SFT+RLHF - all human input, all inhuman output.
That said, I completely agree that 4.7 was a pronounced "model personality" regression. Closer to ChatGPT, and I mean that as an insult. Yet to check whether 4.8 is better.
grok-4.1-fast is the the number 2 model on this benchmark.
~~If you've used this model in real life to do any sort of programming, and have seen its output, you would know that there is something VERY wrong with your benchmark.~~
Edit: Oh sorry, I looked at the questions, I see this is also for SQL specifically. Interesting. Maybe they tuned that grok model for SQL. Cool site. I bookmarked it.
If we learned anything from the code leak is that they essentially do not know what is in the blackbox of the code for that 500k line mass. So that's plausible.
reply