> We already live in the world where hackers are pwning refrigerators, I can't wait for prompt injection attacks on animatronic cartoon characters.
It's not necessarily AI controlling the communication. Disney has long had 'puppet' characters whose communication is controlled by a human behind the scenes.
They're already using similar tech for the Mickey meet and greets and the Galaxy's Edge stormtroopers. The details aren't public, but it seems to be a mix of complex dialogue trees with interrupts or context switches, controlled in real time by the actor or operator.
It's not even complex, just some pre-recorded lines that the character can trigger via finger movements. You can want them do it and it becomes very obvious.
That's interesting; if you're doing human in the loop, I would have thought it'd be easier to just do voice swapping. Or did the technology not quite line up?
Someones linked in this thread the Defunctland video essay on these characters that I highly recommend watching since it goes into this in detail.
But the main reason is, there's a lot of brand imagery on the line with these interactions, someone putting on a voice, or using a voice changer could make a mistake. Disney instead have a conversation tree with pre-recorded voice lines that a remote operator can control. Much harder to mess up
Maybe I'm not late enough in my career to understand what you're saying, but what kind of problems are you helping the business solve with code that hasn't been proven to work?
Sorry I wrote that hastily and my wording seems to have caused much confusion. Here's a rewrite:
> The job is to help the business solve a problem, not just to ship code. In cases where delivering code actually makes sense, then yeah you should absolutely be able to prove it works and meets the requirements like the OP says. But there are plenty of cases where writing code at all is the wrong solution, and that’s an important distinction I didn’t really understand until later in my career.
Although funnily enough, the meaning you interpreted also has its own merit. Like other commenters have mentioned, there's always a cost tradeoff to evaluate. Some projects can absolutely cut corners to, say, ship faster to validate some result or gain users.
Getting a big customer to pay for a product that your sales team said could do X, Y, and Z but Y wasn't part of the product and now you need some plausible semblance of Y added so that you can send an invoice. If it doesn't work, that can be addressed later.
My pet theory is that OpenAI screwed up the image normalization calculation and was stuck with the mistake since that's something that can't be worked around.
At the least, it's not present in these new images.
There's still something off in the grading, and I suspect they worked around it
(although I get what you mean, not easily since you already trained)
I'm guessing when they get a clean slate we'll have Image 2 instead of 1.5. In LMArena it was immediately apparent it was an OpenAI model based on visuals.
There's a possibility that any automatic correction could have false positives (since the yellow tint doesn't happen 100% of the time) which creates different problems where a image could have an even weirder hue.
Yeah, though I can imagine a conversation like this:
SWE: "Seriously? import PIL \ read file \ == (c + 10%, m = m, y = y, k = k) \ save file done!"
Exec: "Yeah, and first blogger get's a hold of image #1 they generate, starts saying 'Hey! This thing's been color corrected w/o AI! lol lame'"
Or not, no idea. i've not understood the choice either, besides very intelligent AI-driven auto-touch up for lighting/color correction has been a thing for a while. It's just, for those I end up finding an answer for, maybe 25% of head scratcher decisions do end of having a reasonable, if non intuitive answer for. Here? haven't been able to figure one yet though, or find a reason/mention by someone who appears to have an inside line on it.
Meta's codec avatars all have a green cast because they spent millions on the rig to capture whole bodies and even more on rolling it out to get loads of real data.
They forgot to calibrate the cameras, so everything had a green tint.
Meanwhile all the other teams had a billion macbeth charts lying around just in case.
Also, you'd be shocked at how few developers know anything at all about sRGB (or any other gamut/encoding), other than perhaps the name. Even people working in graphics, writing 3D game engines, working on colorist or graphics artist tools and libraries.
Not really, but there's a number of theories. The simplest one is that they "style tuned" the AI on human preference data, and this introduced a subtle bias for yellow.
And I say "subtle" - but because that model would always "regenerate" an image when editing, it would introduce more and more of this yellow tint with each tweak or edit. Which has a way of making a "subtle" bias anything but.
That seems unlikely, as we didn't see anything like that with Dall-E, unless the auto regressive nature of gpt-image somehow was more influenced by it.
The Studio Ghibli craze started with the initial release of images in ChatGPT, and the yellow filter has always existed even at that time. They did not make changes to the model as a result of RL (until pontentially today, with a new model)
Yeah, money machine go brrrr is a great sign of "footprint", lets just ignore millenniums of inventions, technology and others things coming from Europe, before the US was even a colony. Texas GDP was $x millions last year, clearly larger footprint on the world :)
It's actually pretty fun and interesting the different bubbles we all live in, for better or worse.
> After the war ended, Cecil Powell, a British physicist, continued the research in England using similar methods with more sensitive plates, detecting a new particle and winning him the Nobel Prize in 1950. Chowdhury and Bose’s work was acknowledged in his book, but their recognition quickly faded.
This post is a mess. The best advice is clear and specific, and this is neither.
The examples are at best loosely related to the points they're supposed to illustrate.
It's honestly so bad that I cynically suspect that this post was created solely as a way to promote click3, in the first bullet, and then 4 more bullets were generated to make it a "whole" post
1. Don't re-send info you've already sent (be resourceful)
2. Play to model strengths (e.g. generating an image of text vs generating text in an image, or coding/executing a string counter rather than counting a search string)
3. Stay aware of declining accuracy as context window fills up
4. Don't ask for things it doesnt know (e.g. obscure topics or outside cutoff window)
5. Careful with vibe coding
Yeah, there's not a lot that's actionable here... mostly boils down to "try a lot of stuff yourself and find out what LLMs are good and bad at." Rs in strawberry, generating some specific text in Nano Banana, what knowledge it knows, etc. Don't do those specific things because obviously (?) models are bad at them.
Why? They just closed a $13B funding round. Entirely possible that they're selling below-cost to gain marketshare; on their current usage the cloud computing costs shouldn't be too bad, while the benefits of showing continued growth on their frontier models is great. Hell, for all we know they may have priced Opus 4.1 above cost to show positive unit economics to investors, and then drop the price of Opus 4.5 to spur growth so their market position looks better at the next round of funding.
Nobody subsidizes LLM APIs. There is a reason to subsidize free consumer offerings: those users are very sticky, and won't switch unless the alternative is much better.
There might be a reason to subsidize subscriptions, but only if your value is in the app rather than the model.
But for API use, the models are easily substituted, so market share is fleeting. The LLM interface being unstructured plain text makes it simpler to upgrade to a smarter model than than it used to be to swap a library or upgrade to a new version of the JVM.
And there is no customer loyalty. Both the users and the middlemen will chase after the best price and performance. The only choice is at the Pareto frontier.
Likewise there is no other long-term gain from getting a short-term API user. You can't train out tune on their inputs, so there is no classic Search network effect either.
And it's not even just about the cost. Any compute they allocate to inference is compute they aren't allocating to training. There is a real opportunity cost there.
I guess your theory of Opus 4.1 having massive margins while Opus 4.5 has slim ones could work. But given how horrible Anthropic's capacity issues have been for much of the year, that seems unlikely as well. Unless the new Opus is actually cheaper to run, where are they getting the compute from for the massive usage spike that seems inevitable.
LLM APIs are more sticky than many other computing APIs. Much of the eng work is in the prompt engineering, and the prompt engineering is pretty specific to the particular LLM you're using. If you randomly swap out the API calls, you'll find you get significantly worse results, because you tuned your prompts to the particular LLM you were using.
It's much more akin to a programming language or platform than a typical data-access API, because the choice of LLM vendor then means that you build a lot of your future product development off the idiosyncracies of their platform. When you switch you have to redo much of that work.
No, LLMs really are not more sticky than traditional APIs. Normal APIs are unforgiving in their inputs and rigid in their outputs. No matter how hard you try, Hyrum's Law will get you over and over again. Every migration is an exercise in pain. LLMs are the ultimate adapting, malleable tool. It doesn't matter if you'd carefully tuned your prompt against a specific six months old model. The new model of today is sufficiently smarter that it'll do a better job despite not having been tuned on those specific prompts.
This isn't even theory, we can observe the swings in practice on Openrouter.
If the value was in prompt engineering, people would stick to specific old versions of models, because a new version of a given model might as well be a totally different model. It will behave differently, and will need to be qualified again. But of course only few people stick with the obsolete models. How many applications do you think still use a model released a year ago?
A Full migration is not always required these days.
It is possible to write adapters to API interfaces. Many proprietary APIs become de-facto standards when competitors start creating those compatibility layers out of the box to convince you it is a drop-in replacement. S3 APIs are good example Every major (and most minor) providers with the glaring exception of Azure support the S3 APIs out of the box now. psql wire protocol is another similar example, so many databases support it these days.
In the LLM inference world OpenAI API specs are becoming that kind of defacto standard.
There are always caveats of course, and switches go rarely without bumps. It depends on what you are using, only few popular widely/fully supported features or something niche feature in the API that is likely not properly implemented by some provider etc, you will get some bugs.
In most cases bugs in the API interface world is relatively easy to solve as they can be replicated and logged as exceptions.
In the LLM world there are few "right" answers on inference outputs, so it lot harder to catch and replicate bugs which can be fixed without breaking something else. You end up retuning all your workflows for the new model.
> But for API use, the models are easily substituted, so market share is fleeting. The LLM interface being unstructured plain text makes it simpler to upgrade to a smarter model than than it used to be to swap a library or upgrade to a new version of the JVM.
Agree that the plain text interface (which enables extremely fast user adoption) also makes the product less sticky. I wonder if this is part of the incentive to push for specialized tool calling interfaces / MCP stuff - to engineer more lock in by increasing the model specific surface area.
Eh, I'm testing it now and it seems a bit too fast to be the same size, almost 2x the Tokens Per Second and much lower Time To First Token.
There are other valid reasons for why it might be faster, but faster even while everyone's rushing to try it at launch + a cost decrease leaves me inclined to believe it's a smaller model than past Opus models
We already know distillation works pretty well. So definitely would make sense Opus 4.5 is effectively smaller (like someone else said, could be via MoE or some other technique too).
We know the big labs are chasing efficiency cans where they can.
It is considered valuable and worthwhile for a society to educate all of its children/citizens. This means we have to develop systems and techniques to educate all kinds of people, not just the ones who can be dropped off by themselves at a library when they turn five, and picked up again in fifteen years with a PHD.
Sure. People who are self motivated are who will benefit the earliest. If a society values ensuring every single citizen gets a baseline education they can figure out how to get an AI to persuade or trick people into learning better than a human could.
Sure, but the point is that if 5% of students are using AI then you have to assume that any work done outside classroom has used AI, because otherwise you're giving a massive advantage to the 5% of students who used AI, right?
> The students remain motivated to learn how to solve problems without AI because they know they will be evaluated without it in class later.
Learning how to prepare for in-class tests and writing exercises is a very particular skillset which I haven't really exercised a lot since I graduated.
Never mind teaching the humanities, for which I think this is a genuine crisis, in class programming exams are basically the same thing as leetcode job interviews, and we all know what a bad proxy those are for "real" development work.
> in class programming exams are basically the same thing as leetcode job interviews, and we all know what a bad proxy those are for "real" development work.
Confusing university learning for "real industry work" is a mistake and we've known it's a mistake for a while. We can have classes which teach what life in industry is like, but assuming that the role of university is to teach people how to fit directly into industry is mistaking the purpose of university and K-12 education as a whole.
Writing long-form prose and essays isn't something I've done in a long time, but I wouldn't say it was wasted effort. Long-form prose forces you to do things that you don't always do when writing emails and powerpoints, and I rely on those skills every day.
There's no mistake there for all the students looking at job listings that treat having a college degree as a hard prerequisite for even being employable.
Only if it's a proper heist. I don't need more guys just walking in and taking something like they're shoplifting a candy bar. I need guys meticulously planning and executing a theft that dodges the very latest in alarm and anti-theft technology.
Taking the discussion seriously, a case study of a well-planned heist that culminated in someone walking in at the right time and just taking the thing could actually be pretty interesting.
Right, all of these amateurs wanting to spend all this money on special glass cutting tools, rappelling equipment, bypassing alarms, or even some Ocean's 11 EMP ridiculousness when you just need a ~$10 tool and a big pair of brass ones to pull it off.
We already live in the world where hackers are pwning refrigerators, I can't wait for prompt injection attacks on animatronic cartoon characters.
reply