

LLMs have made it really clear when previous concepts actually grouped things that were distinct. Not so long ago, Chess was thought to be uniquely human, until it wasn’t, and language was thought to imply intelligence behind it, until it wasn’t.
So let’s separate out some concerns and ask what exactly we mean by engineering. To me, engineering means solving a problem. For someone, for myself, for theory, whatever. Why do we want to solve the problem, what we want to do to solve the problem, and how we do that often blurred together. Now, AI can supply the how in abundance. Too much abundance, even. So humans should move up the stack, focus on what problem to solve and why we want to solve it. Then, go into detail to describe what that solution looks like. So for example, making a UI in Figma or writing a few sentences on how a user would actually do the thing. Then, hand that off to the AI once you think it’s sufficiently defined.
The author misses a step in the engineering loop that’s important though. Plans almost always involve hidden assumptions and undefined or underdefined behavior that implementation will uncover. Even more so with AI, you can’t just throw a plan and expect good results, the humans need to come back, figure out what was underdefined or not actually what they wanted, and update the plan. People can ‘imagine’ rotating an apple in their head, but most of them will fail utterly if asked to draw it; they’re holding the idea of rotating an apple, not actually rotating the apple, and application forces realization of the difference.
The last 6 to 12 months of open models has pretty clearly shown you can substantially better results with the same model size or the same results with smaller model size. Eg Llama 3. 1 405B being basically equal to Llama 3.3 70B or R1-0528 being substantially better than R1. The little information available about GPT 5 suggests it uses mixture of experts and dynamic routing to different models, both of which can reduce computation cost dramatically. Additionally, simplifying the model catalogue from 9ish(?) to 3, when combined with their enormous traffic, will mean higher utilization of batch runs. Fuller batches run more efficiently on a per query basis.
Basically they can’t know for sure.