In 2008, investment banks saw tremendous losses caused by defaults on mortgages. It wasn't just mortgages; investment companies had bundled and repackaged mortgage loans into securities and sold those securities to other investors. The demand for these mortgage-backed securities was high (they paid good interest) and that demand spurred demand for mortgages, which spurred banks to offer (and originate) mortgages to a large number of people including a large number of people whom they would normally not give mortgage loans. The problem came when interest rates rose, causing mortgage payments to increase (many were adjustable-rate mortgages), and many mortgage holders could not afford the higher payments. They defaulted on the loans, which triggered failures through the entire chain of investments.
The end products, the mortgage-based securities, were supposedly top quality. The mortgages upon which they were based were not; the investment bankers had convinced themselves that a combination of mixed-grade mortgages could support a top-grade investment product.
It was a system that worked, until it didn't.
What does this have to do with AI? Keep in mind the notion of building top-grade products from a composite of mixed-grade products.
AI -- at least AI for programming -- works by building a large dataset of programs and then using that dataset to generate requested programs. The results are, in a sense, averages of certain selected items in the provided data (the "training data").
The quality of the output depends on the quality of the input. If I train an AI model on a large set of incorrect programs, the results will match those flawed programs. By training on large sets of programs, AI providers are betting on the "knowledge of the masses"; they assume that a very large collection of programs will be mostly correct. Scanning open source repositories is a common way to build such datasets. Companies with large datasets of their own (such as Microsoft) can use those private datasets for training an AI model.
I think that averaging to correctness works for most requests, but not necessarily for all requests.
I expect that simpler code is more available in code repositories, and complex and domain-specific code is less common. We can see lots and lots of "hello, world" programs, in almost any programming language. We can see lots of simple classes for a customer address (again, in almost any programming language).
We don't see lots of code for obscure applications, or very large applications. There are few publicly available applications to run oil rigs, for example. Or large, multinational accounting systems. Or perhaps even control software for a consumer-grade microwave oven.
There may be a few large, complex programs available in AI training data. But a few (or one) is not drawing on "the knowledge of the masses". It is not averaging a large set of mostly right code into a correct set of code.
Here we can see the parallel of AI for coding to the mortgage securities industry. The latter built (what it thought were) top-grade investment products from mixed-grade mortgages. The former is building (what it and users think are) quality code from mixed-grade existing code.
But I won't be surprised to learn that AI coding models work for small, simple code and fail for large, complex code.
In other words, AI coding works -- until it doesn't.