Hacker Newsnew | past | comments | ask | show | jobs | submit | comboy's commentslogin

Different network layer, no centralization, no authorities, DNS has nothing to do with making p2p connections, it's like the ballpark is not even in the same country

I'm so disappointed in this comment thread https://en.wikipedia.org/wiki/OSI_model

I've just learned about it, but my understanding is that Iroh is L7, compared to e.g. tailscale which is L3


That is correct. Iroh connections are at L6, individual protocols such as blobs or gossip are at L7.

From the OSI point of view, QUIC itself is a bit of a layering violation. It covers transport (L4, Reliable ordered delivery, stream multiplexing, congestion control, ...), session (L5, connection establishment and lifecycle, path migration, ...) and presentation (L6, encryption).

And of course below that we have the ability to provide custom transports.

This was done intentionally in QUIC to provide more control. The application layer doesn't have to care about what goes on below, but for some advanced use cases it can know what's going on and even influence which path is being used.

QUIC/TLS being such a comprehensive and well tested package allows us to delegate a lot of the work and just add a tiny bit of logic to make it peer to peer.

Although delegate is not exactly right, since we ended up having to write our own QUIC implementation, noq, to support QUIC multipath...


Prompt matters. Obviously if you want another model opinion you must generate from the scratch using the same prompt and then you can try to synthesize, but working with an existing response can work if desired. I use explicit instructions to find issues with assigned severities and then these are going through the panel of judges, only issues passing certain threshold are fixed in the original response.

I'll share a revelation which vastly improved my results: tell judges to evaluate truth and usefulness/should-be-fixed axis separately. Because inevitably with a prompt that is forcing to find issues you will end up with nitpicks. Plus truth axis allows to better evaluate the issue-finder models for your use case.

That's some part of what happens when I generate explanations like this one: https://hanzirama.com/character/%E6%9D%A5#explain - at this point the site is a small side product of my LLMs-evaluation machinery.

Bonus content for patient readers: if you need top quality you will likely need to pin provider(s) on OR, :exacto is not enough to get good repeatable results especially for open-weights models.


They cannot do it. Apart from all the practical, technical and talent reasons, it would still be exporting forbidden stuff.

The signal is clear enough though for the next Anthropic..


It's simple, marketing dominates everything. With attention being very expensive, appearance is what matters.

It doesn't matter if you write fantastic library, nobody is gonna use it because they won't know about it, the one with a gif of the terminal (ffs) will win that has a good page describing what it does (and being the most popular one can even become better than your library because of the following but that's not the point here).

It's everywhere, products, hiring, services. We have no network of trust (sigh), we need to trust some heuristics based on a shallow information. If somebody focuses on the shallow he wins, because nobody can ever dive into everything.


But it sounds like FableFool so it has that going for it.

There is in /config "Switch models when a message is flagged" now which can be set to false, but I had no chance to see what happens then, does it just stop or what.

Session paused

Fable 5 has safety measures that flag messages on most cybersecurity or biology topics. They may flag safe, normal content as well. These measures let us bring you Mythos-level capability in other areas sooner, and we're working to refine them. Send feedback with /feedback or learn more

   1. Switch to Opus 4.8
   2. Edit prompt and retry with Fable 5

Biology? Why?

they're worried about people creating bioweapons

'by the way, your previous attempts have these structural problems."

Just to be clear, it did not have access to any previous work that opus did? Because they are pretty good at digging out relevant tmp files and making use of whatever is out there.

With my fable adventures I caught it hallucinating something and stating it as a fact in CLI twice. And it was something that I did not see opus do in such way, opus obviously many times stated some things that it did not verify but guessed, but fable said something like "the probe showed that ..." - but there was no probe, it was not about some past events it was about what it was doing right now. "I overstated"...

But boy does it know Chinese, so much better than any other english model, gemini used to be the king but fable clearly was trained on a decent amount of it. It has a deep cultural understanding.


If you have some spare time, I'd be interested in knowing what kind of questions you use to test models on understanding of Chinese culture.

I'm creating hanzirama.com

I generate explanations for characters and words like so: https://hanzirama.com/character/%E6%9D%A5#explain

But I don't want to mislead learners and want to provide some cultural depth, so I have a hole sophisticated pipeline, using multiple models to generate the explanation, then multiple models look for issues in the explanation, each issue goes through the panel of judges (basically trying to squash down any hallucinations), it's fixed and it goes through such cycles a few times over.

I've been at it for some months now, so I have dozens of different probes, that I needed to evaluate prompts and method changes. Plus on some items I generated so many explanations through different means that I can tell a lot about given model just by looking at one.

Plus I'm doing some statistics, so I see how e.g. when working as judges of issues some models correlate heavily with some others... Fun fact during some testing runs basically just testing providers I stumbled upon qwen introducing himself as made by Google. And also Anhropic's Sonnet saying that it was made by OpenAI :)

At this point all my evaluations frameworks and pipelines stuff is much bigger than the site itself. I'm having lots of fun though.


Yes, iit had access. Thats actually the point.

I maintain a failure registry in the repo. Every failed attempt gets documented with the exact mechanism, the test that regressed, the revert SHA, and an instruction to start from that frontier. Fable read all of it.

But so did Opus.

Each of the 16 Opus failures ran in the same harness with the same accumulating registry. By attempt 15, it had disproofs 1–14 in context. By the end, Opus had basically the same corpus that Fable started with, and it still kept failing, sometimes by re-deriving an already-disproved approach in a slightly different shape.

So “it leveraged the previous work” doesn’t really separate them. Both had the leverage. Only one converted it.

What changed wasn’t more context. It was that Fable rejected a premise inside the context.

The registry’s standing framing was: “this needs whole-program borrow inference, which conflicts with per-module incrementality” (architecturally blocked.) Fable ran around 5 fresh attempts in-session, hit the same wall, and then noticed the framing was a red herring: the borrow analysis already runs module-wide, and for a single-module program, the module is the whole program.

Opus read that same framing for months and treated it as a constraint. Fable falsified it.

its the same repo, same rules, same disproof history, same workflow. The model was the only variable that changed, and the outcome flipped. Is it possible that attempt 17 by Opus could have figured it out? sure. but there's 16 previous attempts that say otherwise.

As fars as anecdotes go, that’s about as controlled as it gets.


I’ve had a similar experience.

Pointing out past suboptimal / failing behaviours to new opus sessions would almost always actually create a sort of "anchoring bias" that would drive the agents towards exhibiting the failure mode (often while mentioning how it wouldn’t fall for it).

As far as I can recall, Fable has been the first model to discover the documented failure modes, comment on them, and just… keep going, actually avoiding them. Quite a surprise.


I kind of enjoy exploring black boxes, trying how different inputs are mapping to differences in outputs. It's kind of like hacking. The problem is, they keep altering the box.

The box is stochastic by design, and has an untraceable amount of complexity between its context and output by nature.

It may be fun to look at inputs and outputs, but it's not hackable and trying to map one into the other is more like astrology than a science.


It's copromancy. Picking through the clanker's doings in an attempt to predict the future.

Thanks, you taught me a new word today! https://en.wikipedia.org/wiki/Scatomancy

It feels like Greek mythology should have some metaphor for "apparently simple structure that is so complex it leads anybody that studies it into madness". But I can't think of any name to put there.

Maybe the idea of complexity is too modern.


No but you see, I have a system! /s

(I spent too long by the horse racing track)


The third sentence got to what my objection was going to be. It's fun trying to make the thing do what you want it to do! That's why many of us like computers. It's the randomness that sucks and makes the process unsatisfying.

That's just just a slot machine

ROTFL

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: