davideisinger.com/static/archive/softwarecrisis-dev-7c7z9g.txt

   #[1]Out of the Software Crisis (Newsletter) [2]Out of the Software
   Crisis (Newsletter)

   [3]Out of the Software Crisis

   Bird flying logo [4]Newsletter [5]Book [6]AI Book [7]Archive [8]Author

Modern software quality, or why I think using language models for programming
is a bad idea

   By Baldur Bjarnason,
   May 30th, 2023

   This essay is based on a talk I gave at [9]Hakkavélin, a hackerspace in
   Reykjavík. I had a wonderful time presenting to a lovely crowd, full of
   inquisitive and critically-minded people. Their questions and the
   discussion afterwards led to a number of improvements and
   clarifications as I turned my notes into this letter. This resulted in
   a substantial expansion of this essay. Many of the expanded points,
   such as the ones surrounding language model security, come directly
   from these discussions.

   Many thanks to all of those who attended. The references for the
   presentation are also the references for this essay, which you can find
   all the way down in the footnotes section.

   The best way to support this newsletter or my blog is to buy one of my
   books, [10]The Intelligence Illusion: a practical guide to the business
   risks of Generative AI or [11]Out of the Software Crisis. Or, you can
   buy them both [12]as a bundle.

The software industry is very bad at software

   Here’s a true story. Names withheld to protect the innocent.

   A chain of stores here in Iceland recently upgraded their point-of-sale
   terminals to use new software.

   Disaster, obviously, ensued. The barcode scanner stopped working
   properly, leading customer to be either overcharged or undercharged.
   Everything was extremely slow. The terminals started to lock up
   regularly. The new invoice printer sucked. A process that had been
   working smoothly was now harder and took more time.

   The store, where my “informant” is a manager, deals with a lot of
   businesses, many of them stores. When they explain to their customers
   why everything is taking so long, their answer is generally the same:

   “Ah, software upgrade. The same happened to us when we upgraded our
   terminals.”

   This is the norm.

   The new software is worse in every way than what it’s replacing.
   Despite having a more cluttered UI, it seems to have omitted a bunch of
   important features. Despite being new and “optimised”, it’s
   considerably slower than what it’s replacing.

   This is also the norm.

   Switching costs are, more often than not, massive for business
   software, and purchases are not decided by anybody who actually uses
   it. The quality of the software disconnects from sales performance very
   quickly in a growing software company. The company ends up “owning” the
   customer and no longer has any incentive to improve the software. In
   fact, because adding features is a key marketing and sales tactic, the
   software development cycle becomes an act of intentional, controlled
   deterioration.

   Enormous engineering resources go into finding new ways to minimise the
   deterioration—witness Microsoft’s “ribbon menu”, a widget invented
   entirely to manage the feature escalation mandated by marketing.

   This is the norm.

   This has always been the norm, from the early days of software.

   The software industry is bad at software. Great at shipping features
   and selling software. Bad at the software itself.

Why I started researching “AI” for programming

   In most sectors of the software industry, sales performance and product
   quality are disconnected.

   By its nature software has enormous margins which further cushion it
   from the effect of delivering bad products.

   The objective impact of poor software quality on the bottom lines of
   companies like Microsoft, Google, Apple, Facebook, or the retail side
   of Amazon is a rounding error. The rest only need to deliver usable
   early versions, but once you have an established customer base and an
   experienced sales team, you can coast for a long, long time without
   improving your product in any meaningful way.

   You only need to show change. Improvements don’t sell, it’s freshness
   that moves product. It’s like store tomatoes. Needs to look good and be
   fresh. They’re only going to taste it after they’ve paid, so who cares
   about the actual quality.

   Uptime reliability is the only quality measurement with a real impact
   on ad revenue or the success of enterprise contracts, so that’s the
   only quality measurement that ultimately matters to them.

   Bugs, shoddy UX, poor accessibility—even when accessibility is required
   by law—are non-factors in modern software management, especially at
   larger software companies.

   The rest of us in the industry then copy their practices, and we mostly
   get away with it. Our margins may not be as enormous as Google’s, but
   they are still quite good compared to non-software industries.

   We have an industry that’s largely disconnected from the consequences
   of making bad products, which means that we have a lot of successful
   but bad products.

  The software crisis

   Research bears this out. I pointed out in my 2021 essay [13]Software
   Crisis 2.0 that very few non-trivial software projects are successful,
   even when your benchmarks are fundamentally conservative and short
   term.

   For example, the following table is from [14]a 2015 report by the
   Standish Group on their long term study in software project success:

            SUCCESSFUL CHALLENGED FAILED TOTAL
    Grand   6%         51%        43%    100%
    Large   11%        59%        30%    100%
    Medium  12%        62%        26%    100%
   Moderate 24%        64%        12%    100%
    Small   61%        32%        7%     100%

   The Chaos Report 2015 resolution by project size

   This is based on data that’s collected and anonymised from a number of
   organisations in a variety of industries. You’ll note that very few
   projects outright succeed. Most of them go over budget or don’t deliver
   the functionality they were supposed to. A frightening number of large
   projects outright fail to ship anything usable.

   In my book [15]Out of the Software Crisis, I expanded on this by
   pointing out that there are many classes and types of bugs and defects
   that we don’t measure at all, many of them catastrophic, which means
   that these estimates are conservative. Software project failure is
   substantially higher than commonly estimated, and success if much rarer
   than the numbers would indicate.

   The true percentage of large software projects that are genuinely
   successful in the long term—that don’t have any catastrophic bugs,
   don’t suffer from UX deterioration, don’t end up having core issues
   that degrade their business value—is probably closer to 1–3%.

  The management crisis

   We also have a management crisis.

   The methods of top-down-control taught to managers are
   counterproductive for software development.
     * Managers think design is about decoration when it’s the key to
       making software that generates value.
     * Trying to prevent projects that are likely to fail is harmful for
       your career, even if the potential failure is wide-ranging and
       potentially catastrophic.
     * When projects fail, it’s the critics who tried to prevent disaster
       who are blamed, not the people who ran it into the ground.
     * Supporting a project that is guaranteed to fail is likely to
       benefit your career, establish you as a “team player”, and protects
       you from harmful consequences when the project crashes.
     * Teams and staff management in the software industry commonly
       ignores every innovation and discovery in organisational
       psychology, management, and systems-thinking since the early
       sixties and operate mostly on management ideas that Henry Ford
       considered outdated in the 1920s.

   We are a mismanaged industry that habitually fails to deliver usable
   software that actually solves the problems it’s supposed to.

   Thus, [16]Weinberg’s Law:

     If builders built buildings the way programmers wrote programs, then
     the first woodpecker that came along would destroy civilization.

   It’s into this environment that “AI” software development tools appear.

   The punditry presented it as a revolutionary improvement in how we make
   software. It’s supposed to fix everything.

   —This time the silver bullet will work!

   Because, of course, we have had such a great track record with
   [17]silver bullets.

   So, I had to dive into it, research it, and figure out how it really
   worked. I needed to understand how generative AI works, as a system. I
   haven’t researched any single topic to this degree since I finished my
   PhD in 2006.

   This research led me to write my book [18]The Intelligence Illusion: a
   practical guide to the business risks of Generative AI. In it, I take a
   broader view and go over the risks I discovered that come with business
   use of generative AI.

   But, ultimately, all that work was to answer the one question that I
   was ultimately interested in:

  Is generative AI good or bad for software development?

   To even have a hope of answering this, we first need to define our
   terms, because the conclusion is likely to vary a lot depending on how
   you define “AI” or even "software development.

  A theory of software development as an inclusive system

   Software development is the entire system of creating, delivering, and
   using a software project, from idea to end-user.

   That includes the entire process on the development side—the idea,
   planning, management, design, collaboration, programming, testing,
   prototyping—as well as the value created by the system when it has been
   shipped and is being used.

   My model is that of [19]theory-building. From my essay on
   theory-building, which itself is an excerpt from [20]Out of the
   Software Crisis:

     Beyond that, software is a theory. It’s a theory about a particular
     solution to a problem. Like the proverbial garden, it is composed of
     a microscopic ecosystem of artefacts, each of whom has to be treated
     like a living thing. The gardener develops a sense of how the parts
     connect and affect each other, what makes them thrive, what kills
     them off, and how you prompt them to grow. The software project and
     its programmers are an indivisible and organic entity that our
     industry treats like a toy model made of easily replaceable lego
     blocks. They believe a software project and its developers can be
     broken apart and reassembled without dying.

     What keeps the software alive are the programmers who have an
     accurate mental model (theory) of how it is built and works. That
     mental model can only be learned by having worked on the project
     while it grew or by working alongside somebody who did, who can help
     you absorb the theory. Replace enough of the programmers, and their
     mental models become disconnected from the reality of the code, and
     the code dies. That dead code can only be replaced by new code that
     has been ‘grown’ by the current programmers.

   Design and user research is an integral part of the mental model the
   programmer needs to build, because none of the software components
   ultimately make sense without the end-user.

   But, design is also vital because it is, to reuse Donald G.
   Reinertsen’s definition from Managing the Design Factory (p. 11),
   design is economically useful information that generally only becomes
   useful information through validation of some sort. Otherwise it’s just
   a guess.

   The economic part usually comes from the end-user in some way.

   This systemic view is inclusive by design as you can’t accurately
   measure the productivity or quality of a software project unless you
   look at it end to end, from idea to end-user.
     * If it doesn’t work for the end-user, then it’s a failure.
     * If the management is dysfunctional, then the entire system is
       dysfunctional.
     * If you keep starting projects based on unworkable ideas, then your
       programmer productivity doesn’t matter.

   Lines of code isn’t software development. Working software,
   productively used, understood by the developers, is software
   development.

A high-level crash course in language models

   Language models, small or large, are today either used as autocomplete
   copilots or as chatbots. Some of these language model tools would be
   used by the developer, some by the manager or other staff.

   I’m treating generative media and image models as a separate topic,
   even when they’re used by people in the software industry to generate
   icons, graphics, or even UIs. They matter as well, but don’t have the
   same direct impact on software quality.

   To understand the role these systems could play in software
   development, we need a little bit more detail on what language models
   are, how they are made, and how they work.

   Most modern machine learning models are layered networks of parameters,
   each representing its connection to its neighbouring parameters. In a
   modern transformer-based language model most of these parameters are
   floating point numbers—weights—that describe the connection. Positive
   numbers are an excitatory connection. Negative numbers are inhibitory.

   These models are built by feeding data through a tokeniser that breaks
   text into tokens—often one word per token—that are ultimately fed into
   an algorithm. That algorithm constructs the network, node by node,
   layer by layer, based on the relationships it calculates between the
   tokens/words. This is done in several runs and, usually, the developer
   of the model will evaluate after each run that the model is progressing
   in the right direction, with some doing more thorough evaluation at
   specific checkpoints.

   The network is, in a very fundamental way, a mathematical derivation of
   the language in the data.

   A language model is constructed from the data. The transformer code
   regulates and guides the process, but the distributions within the data
   set are what defines the network.

   This process takes time—both collecting and managing the data set and
   the build process itself—which inevitably introduces a cut-off point
   for the data set. For OpenAI and Anthropic, that cut-off point is in
   2021. For Google’s PaLM2 it’s early 2023.
     __________________________________________________________________

  Aside: not a brain

   This is very, very different from how a biological neural network
   interacts with data. A biological brain is modified by input and
   data—its environment—but its construction is derived from nutrition,
   its chemical environment, and genetics.

   The data set, conversely, is a deep and fundamental part of the
   language model. The algorithm’s code provides the process while the
   weights themselves are derived from the data, and the model itself is
   dead and static during input and output.

   The construction process of a neural network is called “training”,
   which is yet another incredibly inaccurate term used by the industry.
     * A pregnant mother isn’t “training” the fetus.
     * A language model isn’t “trained” from the data, but constructed.

   This is nonsense.

   But this is the term that the AI industry uses, so we’re stuck with it.

   A language model is a mathematical model built as a derivation of its
   training data. There is no actual training, only construction.

   This is also why it’s inaccurate to say that these systems are inspired
   by their training data. Even though genes and nutrition make an
   artist’s mind they are not in what any reasonable person would call
   “their inspiration”. Even when they are sought out for study and
   genuine inspiration, it’s our representations of our understanding of
   the genes that are the true source of inspiration. Nobody sticks their
   hand in a gelatinous puddle of DNA and spontaneously gets inspired by
   the data it encodes.

   Training data are construction materials for a language models. A
   language model can never be inspired. It is itself a cultural artefact
   derived from other cultural artefacts.

   The machine learning process is loosely based on decades-old grossly
   simplified models of how brains work.

   A biological neuron is a complex system in its own right—one of the
   more complex cells in an animal’s body. In a living brain, a biological
   neuron will use electricity, multiple different classes of
   neurotransmitters, and timing to accomplish its function in ways that
   we still don’t fully understand. It even has its own [21]built-in
   engine for chemical energy.

   The brain as a whole is composed of not just a massive neural network,
   but also layers of hormonal chemical networks that dynamically modify
   its function, both granularly and as a whole.

   The digital neuron—a single signed floating point number—is to a
   biological neuron what a flat-head screwdriver is to a Tesla.

   They both contain metal and that’s about the extent of their
   similarity.

   The human brain contains roughly 100 billion neuron cells, a layered
   chemical network, and a cerebrovascular system that all integrate as a
   whole to create a functioning, self-aware system capable of general
   reasoning and autonomous behaviour. This system is multiple orders of
   magnitude more complex than even the largest language model to date,
   both in terms of individual neuron structure, and taken as a whole.

   It’s important to remember this so that we don’t fall for marketing
   claims that constantly imply that these tools are fully functioning
   assistants.
     __________________________________________________________________

  The prompt

   After all of this, we have a data set which can be used to generate
   text in response to prompts.

   Prompts such as:

     Who was the first man on the moon?

   The input phrase, or prompt, has no structure beyond the linguistic.
   It’s just a blob of text. You can’t give the model commands or
   parameters separately from other input. Because of this, if your model
   lets a third party enter text, an attacker will always be able to
   bypass whatever restrictions you put on it. Control prompts or prefixes
   will be discovered and countermanded. Delimiters don’t work.
   Fine-tuning the model only limits the harm, but doesn’t prevent it.

   This is called a prompt injection and what it means is that model input
   can’t be secured. You have to assume that anybody that can send text to
   the model has full access to it.

   Language models need to be treated like an unsecured client and only
   very carefully integrated into other systems.

  The response

   What you’re likely to get back from that prompt would be something
   like:

     On July 20, 1969, Neil Armstrong became the first human to step on
     the moon.

   This is NASA’s own phrasing. Most answers on the web are likely to be
   variations on this, so the answer from a language model is likely to be
   so too.
     * The moon landing happens to be a fact, but the language model only
       knows it as a text.

   The prompt we provided is strongly associated in the training data set
   with other sentences that are all variations of NASA’s phrasing of the
   answer. The model won’t answer with just “Neil Armstrong” because it
   isn’t actually answering the question, it’s responding with the text
   that correlates with the question. It doesn’t “know” anything.
     * The language model is fabricating a mathematically plausible
       response, based on word distributions in the training data.
     * There are no facts in a language model or its output. Only
       memorised text.

   It only fabricates. It’s all “hallucinations” all the way down.

   Occasionally those fabrications correlate with facts, but that is a
   mathematical quirk resulting from the fact that, on average, what
   people write roughly correlates with their understanding of a factual
   reality, which in turn roughly correlates with a factual reality.

  A knowledge system?

   To be able to answer that question and pass as a knowledge system, the
   model needs to memorise the answer, or at least parts of the phrase.

   Because “AI” vendors are performing a sleight-of-hand here and
   presenting statistical language synthesis engines as knowledge
   retrieval systems, their focus in training and testing is on “facts”
   and minimising “falsehoods”. The model has no notion of either, as it’s
   entirely a language model, so the only way to square this circle is for
   the model to memorise it all.
     * To be able to answer a question factually, not “hallucinate”, and
       pass as a knowledge system, the model needs to memorise the answer.
     * The model doesn’t know facts, only text.
     * If you want a fact from it, the model will need to memorise text
       that correlates with that fact.

  “Dr. AI”?

   Vendors then compound this by using human exams as benchmarks for
   reasoning performance. The problem is that bar exams, medical exams,
   and diagnosis tests are specifically designed to mostly test rote
   memorisation. That’s what they’re for.

   The human brain is bad at rote memorisation and generally it only
   happens with intensive work and practice. If you want to design a test
   that’s specifically intended to verify that somebody has spent a large
   amount of time studying a subject, you test for rote memorisation.

   Many other benchmarks they use, such as those related to programming
   languages also require memorisation, otherwise the systems would just
   constantly make up APIs.
     * Vendors use human exams as benchmarks.
     * These are specifically designed to test rote memorisation, because
       that’s hard for humans.
     * Programming benchmarks also require memorisation. Otherwise, you’d
       only get pseudocode.

   Between the tailoring of these systems for knowledge retrieval, and the
   use of rote memorisation exams and code generation as benchmarks, the
   tech industry has created systems where memorisation is a core part of
   how they function. In all research to date, memorisation has been key
   to language model performance in a range of benchmarks.^[22][1]

   If you’re familiar with storytelling devices, this here would be a
   [23]Chekhov’s gun. Observe! The gun is above the mantelpiece:

     👉🏻👉🏻 memorisation!

   Make a note of it, because those finger guns are going to be fired
   later.

  Biases

   Beyond question and answer, these systems are great at generating the
   averagely plausible text for a given prompt. In prose, current system
   output smells vaguely of sweaty-but-quiet LinkedIn desperation and
   over-enthusiastic social media. The general style will vary, but it’s
   always going to be the most plausible style and response based on the
   training data.

   One consequence of how these systems are made is that they are
   constantly backwards-facing. Where brains are focused on the present,
   often to their detriment, “AI” models are built using historical data.

   The training data encompasses thousands of diverse voices, styles,
   structures, and tones, but some word distributions will be more common
   in the set than others and those will end up dominating the output. As
   a result, language models tend to lean towards the “racist grandpa who
   has learned to speak fluent LinkedIn” end of the spectrum.^[24][2]

   This has implications for a whole host of use cases:
     * Generated text is going to skew conservative in content and
       marketing copy in structure and vocabulary. (Bigoted, prejudiced,
       but polite and inoffensively phrased.)
     * Even when the cut-off date for the data set is recent, it’s still
       going to skew historical because what’s new is also comparatively
       smaller than the old.
     * Language models will always skew towards the more common, middling,
       mediocre, and predictable.
     * Because most of these models are trained on the web, much of which
       is unhinged, violent, pornographic, and abusive, some of that
       language will be represented in the output.

  Modify, summarise, and “reason”

   The superpower that these systems provide is conversion or
   modification. They can, generally, take text and convert it to another
   style or structure. Take this note and turn it into a formal prose, and
   it will! That’s amazing. I don’t think that’s a trillion-dollar
   industry, but it’s a neat feature that will definitely be useful.

   They can summarise text too, but that’s much less reliable than you’d
   expect. It unsurprisingly works best with text that already provides
   its own summary, such as a newspaper article (first paragraphs always
   summarise the story), academic paper (the abstract), or corporate
   writing (executive summary). Anything that’s a mix of styles, voices,
   or has an unusual structure won’t work as well.

   What little reasoning they do is entirely based on finding through
   correlation and re-enacting prior textual descriptions of reasoning.
   They fail utterly when confronted with adversarial or novel examples.
   They also fail if you rephrase the question so that it no longer
   correlates with the phrasing in the data set.^[25][3]

   So, not actual reasoning. “Reasoning”, if you will. In other “AI” model
   genres these correlations are often called “shortcuts”, which feels
   apt.

   To summarise:
     * Language models are a mathematical expression of the training data
       set.
     * Have very little in common with human brains.
     * Rely on inputs that can’t be secured.
     * Lie. Everything they output is a fabrication.
     * Memorise heavily.
     * Great for modifying text. No sarcasm. Genuinely good at this.
     * Occasionally useful for summarisation if you don’t mind being lied
       to regularly.
     * Don’t actually reason.

Why I believe “AI” for programming is a bad idea

   If you recall from the start of this essay, I began my research into
   machine learning and language models because I was curious to see if
   they could help fix or improve the mess that is modern software
   development.

   There was reason to be hopeful. Programming languages are more uniform
   and structured than prose, so it’s not too unreasonable to expect that
   they might lend themselves to language models. Programming language
   output can often be tested directly, which might help with the
   evaluation of each training run.

   Training a language model on code also seems to benefit the model.
   Models that include substantial code in their data set tend to be
   better at correlative “reasoning” (to a point, still not actual
   reasoning), which makes sense since code is all about representing
   structured logic in text.

   But, there is an inherent [26]Catch 22 to any attempt at fixing
   software industry dysfunction with more software. The structure of the
   industry depends entirely on variables that everybody pretends are
   proxies for end user value, but generally aren’t. This will always tend
   to sabotage our efforts at industrial self-improvement.

   The more I studied language models as a technology the more flaws I
   found until it became clear to me that odds are that the overall effect
   on software development will be harmful. The problem starts with the
   models themselves.

  1. Language models can’t be secured

   This first issue has less to do with the use of language models for
   software development and more to do with their use in software
   products, which is likely to be a priority for many software companies
   over the next few years.

   Prompt injections are not a solved problem. OpenAI has come up with a
   few “solutions” in the past, but none of them actually worked.
   Everybody expects this to be fixed, but nobody has a clue how.

   Language models are fundamentally based on the idea that you give it
   text as input and get text as output. It’s entirely possible that the
   only way to completely fix this is to invent a completely new kind of
   language model and spend a few years training it from scratch.
     * A language model needs to be treated like an unsecured client. It’s
       about as secure as a web page form. It’s vulnerable to a new
       generation of injection vulnerabilities, both direct and indirect,
       that we still don’t quite understand.^[27][4]

   The training data set itself is also a security hazard. I’ve gone into
   this in more detail elsewhere^[28][5], but the short version is that
   training data set is vulnerable to keyword manipulation, both in terms
   of altering sentiment and censorship.

   Again, fully defending against this kind of attack would seem to
   require inventing a completely new kind of language model.

   Neither of these issues affect the use of language models for software
   development, but it does affect our work because we’re the ones who
   will be expected to integrate these systems into existing websites and
   products.

  2. It encourages the worst of our management and development practices

   A language model will never question, push back, doubt, hesitate, or
   waver.

   Your managers are going to use it to flesh out and describe unworkable
   ideas, and it won’t complain. The resulting spec won’t have any bearing
   with reality.

   People on your team will do “user research” by asking a language model,
   which it will do even though the resulting research will be fiction and
   entirely useless.

   It’ll let you implement the worst ideas ever in your code without
   protest. Ask a copilot “how can I roll my own cryptography?” and it’ll
   regurgitate a half-baked expression of sha1 in PHP for you.

   Think of all the times you’ve had an idea for an approach, looked up
   how to do it on the web, and found out that, no, this was a really bad
   idea? I have a couple of those every week when I’m in the middle of a
   project.

   Language models don’t deliver productivity improvements. They increase
   the volume, unchecked by reason.

   A core aspect of the theory-building model of software development is
   code that developers don’t understand is a liability. It means your
   mental model of the software is inaccurate which will lead you to
   create bugs as you modify it or add other components that interact with
   pieces you don’t understand.

   Language model tools for software development are specifically designed
   to create large volumes of code that the programmer doesn’t understand.
   They are liability engines for all but the most experienced developer.
   You can’t solve this problem by having the “AI” understand the codebase
   and how its various components interact with each other because a
   language model isn’t a mind. It can’t have a mental model of anything.
   It only works through correlation.

   These tools will indeed make you go faster, but it’s going to be
   accelerating in the wrong direction. That is objectively worse than
   just standing still.

  3. Its User Interfaces do not work, and we haven’t found interfaces that do
  work

   Human factors studies, the field responsible for designing cockpits and
   the like, discovered that humans suffer from an automation bias.

   What it means is that when you have cognitive automation—something that
   helps you think less—you inevitably think less. That means that you are
   less critical of the output than if you were doing it yourself. That’s
   potentially catastrophic when the output is code, especially since the
   quality of the generated code is, understandably considering how the
   system works, broadly on the level of a novice developer.^[29][6]

   Copilots and chatbots—exacerbated by anthropomorphism—seem to trigger
   our automation biases.

   Microsoft themselves have said that 40% of GitHub Copilot’s output is
   committed unchanged.^[30][7]

   Let’s not get into the question of how we, as an industry, put
   ourselves in the position where Microsoft can follow a line of code
   from their language model, through your text editor, and into your
   supposedly decentralised version control system.

   People overwhelmingly seem to trust the output of a language model.

   If it runs without errors, it must be fine.

   But that’s never the case. We all know this. We’ve all seen running
   code turn out to be buggy as hell. But something in our mind switches
   off when we use tools for cognitive automation.

  4. It’s biased towards the stale and popular

   The biases inherent in these language models are bad enough when it
   comes to prose, but they become a functional problem in code.
     * Its JS code will lean towards React and node, most of it several
       versions old, and away from the less popular corners of the JS
       ecosystem.
     * The code is, inevitably, more likely to be built around CommonJS
       modules instead of the modern ESM modules.
     * It won’t know much about Deno or Cloudflare Workers.
     * It’ll always prefer older APIs over new. Most of these models won’t
       know about any API or module released after 2021. This is going to
       be an issue for languages such as Swift.
     * New platforms and languages don’t exist to it.
     * Existing data will outweigh deprecations and security issues.
     * Popular but obsolete or outdated open source projects will always
       win out over the up-to-date equivalent.

   These systems live in the popular past, like the middle-aged man who
   doesn’t realise he isn’t the popular kid at school any more. Everything
   he thinks is cool is actually very much not cool. More the other thing.

   This is an issue for software because our industry is entirely
   structured around constant change. Software security hinges on it. All
   of our practices are based on constant march towards the new and fancy.
   We go from framework to framework to try and find the magic solution
   that will solve everything. In some cases language models might help
   push back against that, but it’ll also push back against all the very
   many changes that are necessary because the old stuff turned out to be
   broken.
     * The software industry is built on change.
     * Language models are built on a static past.

  5. No matter how the lawsuits go, this threatens the existence of free and
  open source software

   Many AI vendors are mired in lawsuits.^[31][8]

   These lawsuits all concentrate on the relationship between the training
   data set and the model and they do so from a variety of angles. Some
   are based on contract and licensing law. Others are claiming that the
   models violate fair use. It’s hard to predict how they will go. They
   might not all go the same way, as laws will vary across industries and
   jurisdictions.

   No matter the result, we’re likely to be facing a major decline in the
   free and open source ecosystem.
    1. All of these models are trained on open source code without payment
       or even acknowledgement, which is a major disincentive for
       contributors and maintainers. That large corporations might benefit
       from your code is a fixture of open source, but they do
       occasionally give back to the community.
    2. Language models—built on open source code—commonly replace that
       code. Instead of importing a module to do a thing, you prompt your
       Copilot. The code generated is almost certainly based on the open
       source module, at least partially, but it has been laundered
       through the language model, disconnecting the programmer from the
       community, recognition, and what little reward there was.

   Language models demotivate maintainers and drain away both resources
   and users. What you’re likely to be left with are those who are
   building core infrastructure or end-user software out of principle. The
   “free software” side of the community is more likely to survive than
   the rest. The Linux kernel, Gnome, KDE—that sort of thing.

   The “open source” ecosystem, especially that surrounding the web and
   node, is likely to be hit the hardest. The more driven the open source
   project was by its proximity to either an employed contributor or
   actively dependent business, the bigger the impact from a shift to
   language models will be.

   This is a serious problem for the software industry as arguably much of
   the economic value the industry has provided over the past decade comes
   from strip-mining open source and free software.

  6. Licence contamination

   Microsoft and Google don’t train their language models on their own
   code. GitHub’s Copilot isn’t trained on code from Microsoft’s office
   suite, even though many of its products are likely to be some of the
   largest React Native projects in existence. There aren’t many C++ code
   bases as big as Windows. Google’s repository is probably one of the
   biggest collection of python and java code you can find.

   They don’t seem to use it for training, but instead train on
   collections of open source code that contain both permissive and
   copyleft licences.

   Copyleft licences, if used, force you to release your own project under
   their licence. Many of them, even non-copyleft, have patent clauses,
   which is poison for quite a few employers. Even permissive licences
   require attribution, and you can absolutely get sued if you’re caught
   copying open source code without attribution.

   Remember our Chekhov’s gun?

     👉🏻👉🏻 memorisation!

   Well, 👉🏻👉🏻 pewpew!!!

   Turns out blindly copying open source code is problematic.
   Whodathunkit?

   These models all memorise a lot, and they tend to copy what they
   memorise into their output. [32]GitHub’s own numbers peg verbatim
   copies of code that’s at least 150 characters at 1%^[33][9], which is
   roughly the same, in terms of verbatim copying, as what you seem to get
   in other language models.

   For context, that means that if you use a language model for
   development, a copilot or chatbot, three or four times a day, you’re
   going to get a verbatim copy of open source code injected into your
   project about once a month. If every team member uses one, then
   multiply that by the size of the team.

   GitHub’s Copilot has a feature that lets you block verbatim copies.
   This obviously requires both a check, which slows the result down, and
   it will throw out a bunch of useful results, making the language model
   less useful. It’s already not as useful as it’s made out to be and
   pretty darn slow so many people are going to turn off the “please don’t
   plagiarise” checkbox.

   But even GitHub’s checks are insufficient. The keyword there is
   “verbatim”, because language models have a tendency to rephrase their
   output. If GitHub Copilot copies a GPLed implementation of an algorithm
   into your project but changes all the variable names, Copilot won’t
   detect it, it’ll still be plagiarism and the copied code is still under
   the GPL. This isn’t unlikely as this is how language models work.
   Memorisation and then copying with light rephrasing is what they do.

   Training the system only on permissively licensed code doesn’t solve
   the problem. It won’t force your project to adopt an MIT licence or
   anything like that, but you can still be sued if it’s discovered.

   This would seem to give Microsoft and GitHub a good reason not to train
   on the Office code base, for example. If they did, there’s a good
   chance that a prompt to generate DOCX parsing code might “generate” a
   verbatim copy of the DOCX parsing code from Microsoft Word.

   And they can’t have that, can they? This would both undercut their own
   strategic advantage, and it would break the illusion that these systems
   are generating novel code from scratch.

   This should make it clear that what they’re actually doing is
   strip-mine the free and open source software ecosystem.

  How much of a problem is this?

   —It won’t matter. I won’t get caught.

   You personally won’t get caught, but your employer might, and
   Intellectual Property scans or similar code audits tend to come up at
   the absolute worst moments in the history of any given organisation:
     * During due diligence for an acquisition. Could cost the company and
       managers a fortune.
     * In discovery for an unrelated lawsuit. Again, could cost the
       company a fortune.
     * During hacks and other security incidents. Could. Cost. A. Fortune.

   “AI” vendors won’t take any responsibility for this risk. I doubt your
   business insurance covers “automated language model plagiarism”
   lawsuits.

   Language models for software development are a lawsuit waiting to
   happen.

   Unless they are completely reinvented from scratch, language model code
   generators are, in my opinion, unsuitable for anything except for
   prototypes and throwaway projects.

So, obviously, everybody’s going to use them

     * All the potentially bad stuff happens later. Unlikely to affect
       your bonuses or employment.
     * It’ll be years before the first licence contamination lawsuits
       happen.
     * Most employees will be long gone before anybody realises just how
       much of a bad idea it was.
     * But you’ll still get that nice “AI” bump in the stock market.

   What all of these problems have in common is that their impact is
   delayed and most of them will only appear in the form of increased
   frequency of bugs and other defects and general project chaos.

   The biggest issue, licence contamination, will likely take years before
   it starts to hit the industry, and is likely to be mitigated by virtue
   of the fact that many of the heaviest users of “AI”-generated code will
   have folded due to general mismanagement long before anybody cares
   enough to check their code.

   If you were ever wondering if we, as an industry, were capable of
   coming up with a systemic issue to rival the Y2K bug in scale and
   stupidity? Well, here you go.

   You can start using a language model, get the stock market bump,
   present the short term increase in volume as productivity, and be long
   gone before anybody connects the dots between language model use and
   the jump in defects.

   Even if you purposefully tried to come up with a technology that played
   directly into and magnified the software industry’s dysfunctions you
   wouldn’t be able to come up with anything as perfectly imperfect as
   these language models.

   It’s nonsense without consequence.

   Counterproductive novelty that you can indulge in without harming your
   career.

   It might even do your career some good. Show that you’re embracing the
   future.

   But…

  The best is yet to come

   In a few years’ time, once the effects of the “AI” bubble finally
   dissipates…

   Somebody’s going to get paid to fix the crap it left behind.

   The best way to support this newsletter or my blog is to buy one of my
   books, [34]The Intelligence Illusion: a practical guide to the business
   risks of Generative AI or [35]Out of the Software Crisis. Or, you can
   buy them both [36]as a bundle.
     __________________________________________________________________

    1. There’s quite a bit of papers that either highlight the tendency to
       memorise or demonstrate a strong relationship between that tendency
       and eventual performance.
          + [37]An Empirical Study of Memorization in NLP (Zheng & Jiang,
            ACL 2022)
          + [38]Does learning require memorization? a short tale about a
            long tail. (Feldman, 2020)
          + [39]When is memorization of irrelevant training data necessary
            for high-accuracy learning? (Brown et al. 2021)
          + [40]What Neural Networks Memorize and Why: Discovering the
            Long Tail via Influence Estimation (Feldman & Zhang, 2020)
          + [41]Question and Answer Test-Train Overlap in Open-Domain
            Question Answering Datasets (Lewis et al., EACL 2021)
          + [42]Quantifying Memorization Across Neural Language Models
            (Carlini et al. 2022)
          + [43]On Training Sample Memorization: Lessons from Benchmarking
            Generative Modeling with a Large-scale Competition (Bai et al.
            2021)
       [44]↩︎
    2. See the [45]Bias & Safety card at [46]needtoknow.fyi for
       references. [47]↩︎
    3. See the [48]Shortcut “Reasoning” card at [49]needtoknow.fyi for
       references. [50]↩︎
    4. Simon Willison has been covering this issue [51]in a series of blog
       posts. [52]↩︎
    5.
          + [53]The poisoning of ChatGPT
          + [54]Google Bard is a glorious reinvention of black-hat SEO
            spam and keyword-stuffing
       [55]↩︎
    6. See, for example:
          + [56]Asleep at the Keyboard? Assessing the Security of GitHub
            Copilot’s Code Contributions (Hammond Pearce et al., December
            2021)
          + [57]Do Users Write More Insecure Code with AI Assistants?
            (Neil Perry et al., December 2022)
       [58]↩︎
    7. This came out [59]during an investor event and was presented as
       evidence of the high quality of Copilot’s output. [60]↩︎
    8.
          + [61]Getty Images v. Stability AI - Complaint
          + [62]Getty Images is suing the creators of AI art tool Stable
            Diffusion for scraping its content
          + [63]The Wave of AI Lawsuits Have Begun
          + [64]Copyright lawsuits pose a serious threat to generative AI
          + [65]GitHub Copilot litigation
          + [66]Stable Diffusion litigation
       [67]↩︎
    9. Archived link of the [68]GitHub Copilot feature page. [69]↩︎

Join the Newsletter

   Subscribe to the Out of the Software Crisis newsletter to get my weekly
   (at least) essays on how to avoid or get out of software development
   crises.

   Join now and get a free PDF of three bonus essays from Out of the
   Software Crisis.

   ____________________
   (BUTTON)
   Subscribe

   We respect your privacy.

   Unsubscribe at any time.

   [70]Mastodon [71]Twitter [72]GitHub [73]Feed

References

   1. https://softwarecrisis.dev/index.xml
   2. https://softwarecrisis.dev/feed.json
   3. https://softwarecrisis.dev/
   4. https://softwarecrisis.dev/
   5. https://softwarecrisis.baldurbjarnason.com/
   6. https://illusion.baldurbjarnason.com/
   7. https://softwarecrisis.dev/archive/
   8. https://softwarecrisis.dev/author/
   9. https://www.hakkavelin.is/
  10. https://illusion.baldurbjarnason.com/
  11. https://softwarecrisis.baldurbjarnason.com/
  12. https://baldurbjarnason.lemonsqueezy.com/checkout/buy/cfc2f2c6-34af-436f-91c1-cb2e47283c40
  13. https://www.baldurbjarnason.com/2021/software-crisis-2/
  14. https://standishgroup.com/sample_research_files/CHAOSReport2015-Final.pdf
  15. https://softwarecrisis.baldurbjarnason.com/
  16. https://quoteinvestigator.com/2019/09/19/woodpecker/
  17. http://worrydream.com/refs/Brooks-NoSilverBullet.pdf
  18. https://illusion.baldurbjarnason.com/
  19. https://www.baldurbjarnason.com/2022/theory-building/
  20. https://softwarecrisis.baldurbjarnason.com/
  21. https://en.wikipedia.org/wiki/Mitochondrion
  22. https://softwarecrisis.dev/letters/ai-and-software-quality/#fn1
  23. https://en.wikipedia.org/wiki/Chekhov's_gun
  24. https://softwarecrisis.dev/letters/ai-and-software-quality/#fn2
  25. https://softwarecrisis.dev/letters/ai-and-software-quality/#fn3
  26. https://en.wikipedia.org/wiki/Catch-22_(logic)
  27. https://softwarecrisis.dev/letters/ai-and-software-quality/#fn4
  28. https://softwarecrisis.dev/letters/ai-and-software-quality/#fn5
  29. https://softwarecrisis.dev/letters/ai-and-software-quality/#fn6
  30. https://softwarecrisis.dev/letters/ai-and-software-quality/#fn7
  31. https://softwarecrisis.dev/letters/ai-and-software-quality/#fn8
  32. https://archive.ph/2023.01.11-224507/https://github.com/features/copilot#selection-19063.298-19063.462:~:text=Our latest internal research shows that about 1% of the time, a suggestion may contain some code snippets longer than ~150 characters that matches the training set.
  33. https://softwarecrisis.dev/letters/ai-and-software-quality/#fn9
  34. https://illusion.baldurbjarnason.com/
  35. https://softwarecrisis.baldurbjarnason.com/
  36. https://baldurbjarnason.lemonsqueezy.com/checkout/buy/cfc2f2c6-34af-436f-91c1-cb2e47283c40
  37. https://aclanthology.org/2022.acl-long.434
  38. https://doi.org/10.1145/3357713.3384290
  39. https://doi.org/10.1145/3406325.3451131
  40. https://papers.nips.cc/paper/2020/hash/1e14bfe2714193e7af5abc64ecbd6b46-Abstract.html
  41. https://aclanthology.org/2021.eacl-main.86
  42. https://arxiv.org/abs/2202.07646
  43. https://dl.acm.org/doi/10.1145/3447548.3467198
  44. https://softwarecrisis.dev/letters/ai-and-software-quality/#fnref1
  45. https://needtoknow.fyi/card/bias/
  46. https://needtoknow.fyi/
  47. https://softwarecrisis.dev/letters/ai-and-software-quality/#fnref2
  48. https://needtoknow.fyi/card/shortcut-reasoning/
  49. https://needtoknow.fyi/
  50. https://softwarecrisis.dev/letters/ai-and-software-quality/#fnref3
  51. https://simonwillison.net/series/prompt-injection/
  52. https://softwarecrisis.dev/letters/ai-and-software-quality/#fnref4
  53. https://softwarecrisis.dev/letters/the-poisoning-of-chatgpt/
  54. https://softwarecrisis.dev/letters/google-bard-seo/
  55. https://softwarecrisis.dev/letters/ai-and-software-quality/#fnref5
  56. https://doi.org/10.48550/arXiv.2108.09293
  57. https://doi.org/10.48550/arXiv.2211.03622
  58. https://softwarecrisis.dev/letters/ai-and-software-quality/#fnref6
  59. https://www.microsoft.com/en-us/Investor/events/FY-2023/Morgan-Stanley-TMT-Conference#:~:text=Scott Guthrie: I think you're,is now AI-generated and unmodified
  60. https://softwarecrisis.dev/letters/ai-and-software-quality/#fnref7
  61. https://copyrightlately.com/pdfviewer/getty-images-v-stability-ai-complaint/?auto_viewer=true#page=&zoom=auto&pagemode=none
  62. https://www.theverge.com/2023/1/17/23558516/ai-art-copyright-stable-diffusion-getty-images-lawsuit
  63. https://www.plagiarismtoday.com/2023/01/17/the-wave-of-ai-lawsuits-have-begun/
  64. https://www.understandingai.org/p/copyright-lawsuits-pose-a-serious
  65. https://githubcopilotlitigation.com/
  66. https://stablediffusionlitigation.com/
  67. https://softwarecrisis.dev/letters/ai-and-software-quality/#fnref8
  68. https://archive.ph/2023.01.11-224507/https://github.com/features/copilot#selection-19063.298-19063.462:~:text=Our latest internal research shows that about 1% of the time, a suggestion may contain some code snippets longer than ~150 characters that matches the training set.
  69. https://softwarecrisis.dev/letters/ai-and-software-quality/#fnref9
  70. https://toot.cafe/@baldur
  71. https://twitter.com/fakebaldur
  72. https://github.com/baldurbjarnason
  73. https://softwarecrisis.dev/feed.xml