Files
davideisinger.com/static/archive/softwarecrisis-dev-7c7z9g.txt
2023-07-02 22:55:35 -04:00

1070 lines
51 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
#[1]Out of the Software Crisis (Newsletter) [2]Out of the Software
Crisis (Newsletter)
[3]Out of the Software Crisis
Bird flying logo [4]Newsletter [5]Book [6]AI Book [7]Archive [8]Author
Modern software quality, or why I think using language models for programming is
a bad idea
By Baldur Bjarnason
This essay is based on a talk I gave at [9]Hakkavélin, a hackerspace in
Reykjavík. I had a wonderful time presenting to a lovely crowd, full of
inquisitive and critically-minded people. Their questions and the
discussion afterwards led to a number of improvements and
clarifications as I turned my notes into this letter. This resulted in
a substantial expansion of this essay. Many of the expanded points,
such as the ones surrounding language model security, come directly
from these discussions.
Many thanks to all of those who attended. The references for the
presentation are also the references for this essay, which you can find
all the way down in the footnotes section.
The best way to support this newsletter or my blog is to buy one of my
books, [10]The Intelligence Illusion: a practical guide to the business
risks of Generative AI or [11]Out of the Software Crisis. Or, you can
buy them both [12]as a bundle.
The software industry is very bad at software
Heres a true story. Names withheld to protect the innocent.
A chain of stores here in Iceland recently upgraded their point-of-sale
terminals to use new software.
Disaster, obviously, ensued. The barcode scanner stopped working
properly, leading customer to be either overcharged or undercharged.
Everything was extremely slow. The terminals started to lock up
regularly. The new invoice printer sucked. A process that had been
working smoothly was now harder and took more time.
The store, where my “informant” is a manager, deals with a lot of
businesses, many of them stores. When they explain to their customers
why everything is taking so long, their answer is generally the same:
Ah, software upgrade. The same happened to us when we upgraded our
terminals.
This is the norm.
The new software is worse in every way than what its replacing.
Despite having a more cluttered UI, it seems to have omitted a bunch of
important features. Despite being new and “optimised”, its
considerably slower than what its replacing.
This is also the norm.
Switching costs are, more often than not, massive for business
software, and purchases are not decided by anybody who actually uses
it. The quality of the software disconnects from sales performance very
quickly in a growing software company. The company ends up “owning” the
customer and no longer has any incentive to improve the software. In
fact, because adding features is a key marketing and sales tactic, the
software development cycle becomes an act of intentional, controlled
deterioration.
Enormous engineering resources go into finding new ways to minimise the
deterioration—witness Microsofts “ribbon menu”, a widget invented
entirely to manage the feature escalation mandated by marketing.
This is the norm.
This has always been the norm, from the early days of software.
The software industry is bad at software. Great at shipping features
and selling software. Bad at the software itself.
Why I started researching “AI” for programming
In most sectors of the software industry, sales performance and product
quality are disconnected.
By its nature software has enormous margins which further cushion it
from the effect of delivering bad products.
The objective impact of poor software quality on the bottom lines of
companies like Microsoft, Google, Apple, Facebook, or the retail side
of Amazon is a rounding error. The rest only need to deliver usable
early versions, but once you have an established customer base and an
experienced sales team, you can coast for a long, long time without
improving your product in any meaningful way.
You only need to show change. Improvements dont sell, its freshness
that moves product. Its like store tomatoes. Needs to look good and be
fresh. Theyre only going to taste it after theyve paid, so who cares
about the actual quality.
Uptime reliability is the only quality measurement with a real impact
on ad revenue or the success of enterprise contracts, so thats the
only quality measurement that ultimately matters to them.
Bugs, shoddy UX, poor accessibility—even when accessibility is required
by law—are non-factors in modern software management, especially at
larger software companies.
The rest of us in the industry then copy their practices, and we mostly
get away with it. Our margins may not be as enormous as Googles, but
they are still quite good compared to non-software industries.
We have an industry thats largely disconnected from the consequences
of making bad products, which means that we have a lot of successful
but bad products.
The software crisis
Research bears this out. I pointed out in my 2021 essay [13]Software
Crisis 2.0 that very few non-trivial software projects are successful,
even when your benchmarks are fundamentally conservative and short
term.
For example, the following table is from [14]a 2015 report by the
Standish Group on their long term study in software project success:
SUCCESSFUL CHALLENGED FAILED TOTAL
Grand 6% 51% 43% 100%
Large 11% 59% 30% 100%
Medium 12% 62% 26% 100%
Moderate 24% 64% 12% 100%
Small 61% 32% 7% 100%
The Chaos Report 2015 resolution by project size
This is based on data thats collected and anonymised from a number of
organisations in a variety of industries. Youll note that very few
projects outright succeed. Most of them go over budget or dont deliver
the functionality they were supposed to. A frightening number of large
projects outright fail to ship anything usable.
In my book [15]Out of the Software Crisis, I expanded on this by
pointing out that there are many classes and types of bugs and defects
that we dont measure at all, many of them catastrophic, which means
that these estimates are conservative. Software project failure is
substantially higher than commonly estimated, and success if much rarer
than the numbers would indicate.
The true percentage of large software projects that are genuinely
successful in the long term—that dont have any catastrophic bugs,
dont suffer from UX deterioration, dont end up having core issues
that degrade their business value—is probably closer to 13%.
The management crisis
We also have a management crisis.
The methods of top-down-control taught to managers are
counterproductive for software development.
* Managers think design is about decoration when its the key to
making software that generates value.
* Trying to prevent projects that are likely to fail is harmful for
your career, even if the potential failure is wide-ranging and
potentially catastrophic.
* When projects fail, its the critics who tried to prevent disaster
who are blamed, not the people who ran it into the ground.
* Supporting a project that is guaranteed to fail is likely to
benefit your career, establish you as a “team player”, and protects
you from harmful consequences when the project crashes.
* Teams and staff management in the software industry commonly
ignores every innovation and discovery in organisational
psychology, management, and systems-thinking since the early
sixties and operate mostly on management ideas that Henry Ford
considered outdated in the 1920s.
We are a mismanaged industry that habitually fails to deliver usable
software that actually solves the problems its supposed to.
Thus, [16]Weinbergs Law:
If builders built buildings the way programmers wrote programs, then
the first woodpecker that came along would destroy civilization.
Its into this environment that “AI” software development tools appear.
The punditry presented it as a revolutionary improvement in how we make
software. Its supposed to fix everything.
—This time the silver bullet will work!
Because, of course, we have had such a great track record with
[17]silver bullets.
So, I had to dive into it, research it, and figure out how it really
worked. I needed to understand how generative AI works, as a system. I
havent researched any single topic to this degree since I finished my
PhD in 2006.
This research led me to write my book [18]The Intelligence Illusion: a
practical guide to the business risks of Generative AI. In it, I take a
broader view and go over the risks I discovered that come with business
use of generative AI.
But, ultimately, all that work was to answer the one question that I
was ultimately interested in:
Is generative AI good or bad for software development?
To even have a hope of answering this, we first need to define our
terms, because the conclusion is likely to vary a lot depending on how
you define “AI” or even "software development.
A theory of software development as an inclusive system
Software development is the entire system of creating, delivering, and
using a software project, from idea to end-user.
That includes the entire process on the development side—the idea,
planning, management, design, collaboration, programming, testing,
prototyping—as well as the value created by the system when it has been
shipped and is being used.
My model is that of [19]theory-building. From my essay on
theory-building, which itself is an excerpt from [20]Out of the
Software Crisis:
Beyond that, software is a theory. Its a theory about a particular
solution to a problem. Like the proverbial garden, it is composed of
a microscopic ecosystem of artefacts, each of whom has to be treated
like a living thing. The gardener develops a sense of how the parts
connect and affect each other, what makes them thrive, what kills
them off, and how you prompt them to grow. The software project and
its programmers are an indivisible and organic entity that our
industry treats like a toy model made of easily replaceable lego
blocks. They believe a software project and its developers can be
broken apart and reassembled without dying.
What keeps the software alive are the programmers who have an
accurate mental model (theory) of how it is built and works. That
mental model can only be learned by having worked on the project
while it grew or by working alongside somebody who did, who can help
you absorb the theory. Replace enough of the programmers, and their
mental models become disconnected from the reality of the code, and
the code dies. That dead code can only be replaced by new code that
has been grown by the current programmers.
Design and user research is an integral part of the mental model the
programmer needs to build, because none of the software components
ultimately make sense without the end-user.
But, design is also vital because it is, to reuse Donald G.
Reinertsens definition from Managing the Design Factory (p. 11),
design is economically useful information that generally only becomes
useful information through validation of some sort. Otherwise its just
a guess.
The economic part usually comes from the end-user in some way.
This systemic view is inclusive by design as you cant accurately
measure the productivity or quality of a software project unless you
look at it end to end, from idea to end-user.
* If it doesnt work for the end-user, then its a failure.
* If the management is dysfunctional, then the entire system is
dysfunctional.
* If you keep starting projects based on unworkable ideas, then your
programmer productivity doesnt matter.
Lines of code isnt software development. Working software,
productively used, understood by the developers, is software
development.
A high-level crash course in language models
Language models, small or large, are today either used as autocomplete
copilots or as chatbots. Some of these language model tools would be
used by the developer, some by the manager or other staff.
Im treating generative media and image models as a separate topic,
even when theyre used by people in the software industry to generate
icons, graphics, or even UIs. They matter as well, but dont have the
same direct impact on software quality.
To understand the role these systems could play in software
development, we need a little bit more detail on what language models
are, how they are made, and how they work.
Most modern machine learning models are layered networks of parameters,
each representing its connection to its neighbouring parameters. In a
modern transformer-based language model most of these parameters are
floating point numbers—weights—that describe the connection. Positive
numbers are an excitatory connection. Negative numbers are inhibitory.
These models are built by feeding data through a tokeniser that breaks
text into tokens—often one word per token—that are ultimately fed into
an algorithm. That algorithm constructs the network, node by node,
layer by layer, based on the relationships it calculates between the
tokens/words. This is done in several runs and, usually, the developer
of the model will evaluate after each run that the model is progressing
in the right direction, with some doing more thorough evaluation at
specific checkpoints.
The network is, in a very fundamental way, a mathematical derivation of
the language in the data.
A language model is constructed from the data. The transformer code
regulates and guides the process, but the distributions within the data
set are what defines the network.
This process takes time—both collecting and managing the data set and
the build process itself—which inevitably introduces a cut-off point
for the data set. For OpenAI and Anthropic, that cut-off point is in
2021. For Googles PaLM2 its early 2023.
__________________________________________________________________
Aside: not a brain
This is very, very different from how a biological neural network
interacts with data. A biological brain is modified by input and
data—its environment—but its construction is derived from nutrition,
its chemical environment, and genetics.
The data set, conversely, is a deep and fundamental part of the
language model. The algorithms code provides the process while the
weights themselves are derived from the data, and the model itself is
dead and static during input and output.
The construction process of a neural network is called “training”,
which is yet another incredibly inaccurate term used by the industry.
* A pregnant mother isnt “training” the fetus.
* A language model isnt “trained” from the data, but constructed.
This is nonsense.
But this is the term that the AI industry uses, so were stuck with it.
A language model is a mathematical model built as a derivation of its
training data. There is no actual training, only construction.
This is also why its inaccurate to say that these systems are inspired
by their training data. Even though genes and nutrition make an
artists mind they are not in what any reasonable person would call
“their inspiration”. Even when they are sought out for study and
genuine inspiration, its our representations of our understanding of
the genes that are the true source of inspiration. Nobody sticks their
hand in a gelatinous puddle of DNA and spontaneously gets inspired by
the data it encodes.
Training data are construction materials for a language models. A
language model can never be inspired. It is itself a cultural artefact
derived from other cultural artefacts.
The machine learning process is loosely based on decades-old grossly
simplified models of how brains work.
A biological neuron is a complex system in its own right—one of the
more complex cells in an animals body. In a living brain, a biological
neuron will use electricity, multiple different classes of
neurotransmitters, and timing to accomplish its function in ways that
we still dont fully understand. It even has its own [21]built-in
engine for chemical energy.
The brain as a whole is composed of not just a massive neural network,
but also layers of hormonal chemical networks that dynamically modify
its function, both granularly and as a whole.
The digital neuron—a single signed floating point number—is to a
biological neuron what a flat-head screwdriver is to a Tesla.
They both contain metal and thats about the extent of their
similarity.
The human brain contains roughly 100 billion neuron cells, a layered
chemical network, and a cerebrovascular system that all integrate as a
whole to create a functioning, self-aware system capable of general
reasoning and autonomous behaviour. This system is multiple orders of
magnitude more complex than even the largest language model to date,
both in terms of individual neuron structure, and taken as a whole.
Its important to remember this so that we dont fall for marketing
claims that constantly imply that these tools are fully functioning
assistants.
__________________________________________________________________
The prompt
After all of this, we have a data set which can be used to generate
text in response to prompts.
Prompts such as:
Who was the first man on the moon?
The input phrase, or prompt, has no structure beyond the linguistic.
Its just a blob of text. You cant give the model commands or
parameters separately from other input. Because of this, if your model
lets a third party enter text, an attacker will always be able to
bypass whatever restrictions you put on it. Control prompts or prefixes
will be discovered and countermanded. Delimiters dont work.
Fine-tuning the model only limits the harm, but doesnt prevent it.
This is called a prompt injection and what it means is that model input
cant be secured. You have to assume that anybody that can send text to
the model has full access to it.
Language models need to be treated like an unsecured client and only
very carefully integrated into other systems.
The response
What youre likely to get back from that prompt would be something
like:
On July 20, 1969, Neil Armstrong became the first human to step on
the moon.
This is NASAs own phrasing. Most answers on the web are likely to be
variations on this, so the answer from a language model is likely to be
so too.
* The moon landing happens to be a fact, but the language model only
knows it as a text.
The prompt we provided is strongly associated in the training data set
with other sentences that are all variations of NASAs phrasing of the
answer. The model wont answer with just “Neil Armstrong” because it
isnt actually answering the question, its responding with the text
that correlates with the question. It doesnt “know” anything.
* The language model is fabricating a mathematically plausible
response, based on word distributions in the training data.
* There are no facts in a language model or its output. Only
memorised text.
It only fabricates. Its all “hallucinations” all the way down.
Occasionally those fabrications correlate with facts, but that is a
mathematical quirk resulting from the fact that, on average, what
people write roughly correlates with their understanding of a factual
reality, which in turn roughly correlates with a factual reality.
A knowledge system?
To be able to answer that question and pass as a knowledge system, the
model needs to memorise the answer, or at least parts of the phrase.
Because “AI” vendors are performing a sleight-of-hand here and
presenting statistical language synthesis engines as knowledge
retrieval systems, their focus in training and testing is on “facts”
and minimising “falsehoods”. The model has no notion of either, as its
entirely a language model, so the only way to square this circle is for
the model to memorise it all.
* To be able to answer a question factually, not “hallucinate”, and
pass as a knowledge system, the model needs to memorise the answer.
* The model doesnt know facts, only text.
* If you want a fact from it, the model will need to memorise text
that correlates with that fact.
“Dr. AI”?
Vendors then compound this by using human exams as benchmarks for
reasoning performance. The problem is that bar exams, medical exams,
and diagnosis tests are specifically designed to mostly test rote
memorisation. Thats what theyre for.
The human brain is bad at rote memorisation and generally it only
happens with intensive work and practice. If you want to design a test
thats specifically intended to verify that somebody has spent a large
amount of time studying a subject, you test for rote memorisation.
Many other benchmarks they use, such as those related to programming
languages also require memorisation, otherwise the systems would just
constantly make up APIs.
* Vendors use human exams as benchmarks.
* These are specifically designed to test rote memorisation, because
thats hard for humans.
* Programming benchmarks also require memorisation. Otherwise, youd
only get pseudocode.
Between the tailoring of these systems for knowledge retrieval, and the
use of rote memorisation exams and code generation as benchmarks, the
tech industry has created systems where memorisation is a core part of
how they function. In all research to date, memorisation has been key
to language model performance in a range of benchmarks.^[22][1]
If youre familiar with storytelling devices, this here would be a
[23]Chekhovs gun. Observe! The gun is above the mantelpiece:
👉🏻👉🏻 memorisation!
Make a note of it, because those finger guns are going to be fired
later.
Biases
Beyond question and answer, these systems are great at generating the
averagely plausible text for a given prompt. In prose, current system
output smells vaguely of sweaty-but-quiet LinkedIn desperation and
over-enthusiastic social media. The general style will vary, but its
always going to be the most plausible style and response based on the
training data.
One consequence of how these systems are made is that they are
constantly backwards-facing. Where brains are focused on the present,
often to their detriment, “AI” models are built using historical data.
The training data encompasses thousands of diverse voices, styles,
structures, and tones, but some word distributions will be more common
in the set than others and those will end up dominating the output. As
a result, language models tend to lean towards the “racist grandpa who
has learned to speak fluent LinkedIn” end of the spectrum.^[24][2]
This has implications for a whole host of use cases:
* Generated text is going to skew conservative in content and
marketing copy in structure and vocabulary. (Bigoted, prejudiced,
but polite and inoffensively phrased.)
* Even when the cut-off date for the data set is recent, its still
going to skew historical because whats new is also comparatively
smaller than the old.
* Language models will always skew towards the more common, middling,
mediocre, and predictable.
* Because most of these models are trained on the web, much of which
is unhinged, violent, pornographic, and abusive, some of that
language will be represented in the output.
Modify, summarise, and “reason”
The superpower that these systems provide is conversion or
modification. They can, generally, take text and convert it to another
style or structure. Take this note and turn it into a formal prose, and
it will! Thats amazing. I dont think thats a trillion-dollar
industry, but its a neat feature that will definitely be useful.
They can summarise text too, but thats much less reliable than youd
expect. It unsurprisingly works best with text that already provides
its own summary, such as a newspaper article (first paragraphs always
summarise the story), academic paper (the abstract), or corporate
writing (executive summary). Anything thats a mix of styles, voices,
or has an unusual structure wont work as well.
What little reasoning they do is entirely based on finding through
correlation and re-enacting prior textual descriptions of reasoning.
They fail utterly when confronted with adversarial or novel examples.
They also fail if you rephrase the question so that it no longer
correlates with the phrasing in the data set.^[25][3]
So, not actual reasoning. “Reasoning”, if you will. In other “AI” model
genres these correlations are often called “shortcuts”, which feels
apt.
To summarise:
* Language models are a mathematical expression of the training data
set.
* Have very little in common with human brains.
* Rely on inputs that cant be secured.
* Lie. Everything they output is a fabrication.
* Memorise heavily.
* Great for modifying text. No sarcasm. Genuinely good at this.
* Occasionally useful for summarisation if you dont mind being lied
to regularly.
* Dont actually reason.
Why I believe “AI” for programming is a bad idea
If you recall from the start of this essay, I began my research into
machine learning and language models because I was curious to see if
they could help fix or improve the mess that is modern software
development.
There was reason to be hopeful. Programming languages are more uniform
and structured than prose, so its not too unreasonable to expect that
they might lend themselves to language models. Programming language
output can often be tested directly, which might help with the
evaluation of each training run.
Training a language model on code also seems to benefit the model.
Models that include substantial code in their data set tend to be
better at correlative “reasoning” (to a point, still not actual
reasoning), which makes sense since code is all about representing
structured logic in text.
But, there is an inherent [26]Catch 22 to any attempt at fixing
software industry dysfunction with more software. The structure of the
industry depends entirely on variables that everybody pretends are
proxies for end user value, but generally arent. This will always tend
to sabotage our efforts at industrial self-improvement.
The more I studied language models as a technology the more flaws I
found until it became clear to me that odds are that the overall effect
on software development will be harmful. The problem starts with the
models themselves.
1. Language models cant be secured
This first issue has less to do with the use of language models for
software development and more to do with their use in software
products, which is likely to be a priority for many software companies
over the next few years.
Prompt injections are not a solved problem. OpenAI has come up with a
few “solutions” in the past, but none of them actually worked.
Everybody expects this to be fixed, but nobody has a clue how.
Language models are fundamentally based on the idea that you give it
text as input and get text as output. Its entirely possible that the
only way to completely fix this is to invent a completely new kind of
language model and spend a few years training it from scratch.
* A language model needs to be treated like an unsecured client. Its
about as secure as a web page form. Its vulnerable to a new
generation of injection vulnerabilities, both direct and indirect,
that we still dont quite understand.^[27][4]
The training data set itself is also a security hazard. Ive gone into
this in more detail elsewhere^[28][5], but the short version is that
training data set is vulnerable to keyword manipulation, both in terms
of altering sentiment and censorship.
Again, fully defending against this kind of attack would seem to
require inventing a completely new kind of language model.
Neither of these issues affect the use of language models for software
development, but it does affect our work because were the ones who
will be expected to integrate these systems into existing websites and
products.
2. It encourages the worst of our management and development practices
A language model will never question, push back, doubt, hesitate, or
waver.
Your managers are going to use it to flesh out and describe unworkable
ideas, and it wont complain. The resulting spec wont have any bearing
with reality.
People on your team will do “user research” by asking a language model,
which it will do even though the resulting research will be fiction and
entirely useless.
Itll let you implement the worst ideas ever in your code without
protest. Ask a copilot “how can I roll my own cryptography?” and itll
regurgitate a half-baked expression of sha1 in PHP for you.
Think of all the times youve had an idea for an approach, looked up
how to do it on the web, and found out that, no, this was a really bad
idea? I have a couple of those every week when Im in the middle of a
project.
Language models dont deliver productivity improvements. They increase
the volume, unchecked by reason.
A core aspect of the theory-building model of software development is
code that developers dont understand is a liability. It means your
mental model of the software is inaccurate which will lead you to
create bugs as you modify it or add other components that interact with
pieces you dont understand.
Language model tools for software development are specifically designed
to create large volumes of code that the programmer doesnt understand.
They are liability engines for all but the most experienced developer.
You cant solve this problem by having the “AI” understand the codebase
and how its various components interact with each other because a
language model isnt a mind. It cant have a mental model of anything.
It only works through correlation.
These tools will indeed make you go faster, but its going to be
accelerating in the wrong direction. That is objectively worse than
just standing still.
3. Its User Interfaces do not work, and we havent found interfaces that do
work
Human factors studies, the field responsible for designing cockpits and
the like, discovered that humans suffer from an automation bias.
What it means is that when you have cognitive automation—something that
helps you think less—you inevitably think less. That means that you are
less critical of the output than if you were doing it yourself. Thats
potentially catastrophic when the output is code, especially since the
quality of the generated code is, understandably considering how the
system works, broadly on the level of a novice developer.^[29][6]
Copilots and chatbots—exacerbated by anthropomorphism—seem to trigger
our automation biases.
Microsoft themselves have said that 40% of GitHub Copilots output is
committed unchanged.^[30][7]
Lets not get into the question of how we, as an industry, put
ourselves in the position where Microsoft can follow a line of code
from their language model, through your text editor, and into your
supposedly decentralised version control system.
People overwhelmingly seem to trust the output of a language model.
If it runs without errors, it must be fine.
But thats never the case. We all know this. Weve all seen running
code turn out to be buggy as hell. But something in our mind switches
off when we use tools for cognitive automation.
4. Its biased towards the stale and popular
The biases inherent in these language models are bad enough when it
comes to prose, but they become a functional problem in code.
* Its JS code will lean towards React and node, most of it several
versions old, and away from the less popular corners of the JS
ecosystem.
* The code is, inevitably, more likely to be built around CommonJS
modules instead of the modern ESM modules.
* It wont know much about Deno or Cloudflare Workers.
* Itll always prefer older APIs over new. Most of these models wont
know about any API or module released after 2021. This is going to
be an issue for languages such as Swift.
* New platforms and languages dont exist to it.
* Existing data will outweigh deprecations and security issues.
* Popular but obsolete or outdated open source projects will always
win out over the up-to-date equivalent.
These systems live in the popular past, like the middle-aged man who
doesnt realise he isnt the popular kid at school any more. Everything
he thinks is cool is actually very much not cool. More the other thing.
This is an issue for software because our industry is entirely
structured around constant change. Software security hinges on it. All
of our practices are based on constant march towards the new and fancy.
We go from framework to framework to try and find the magic solution
that will solve everything. In some cases language models might help
push back against that, but itll also push back against all the very
many changes that are necessary because the old stuff turned out to be
broken.
* The software industry is built on change.
* Language models are built on a static past.
5. No matter how the lawsuits go, this threatens the existence of free and
open source software
Many AI vendors are mired in lawsuits.^[31][8]
These lawsuits all concentrate on the relationship between the training
data set and the model and they do so from a variety of angles. Some
are based on contract and licensing law. Others are claiming that the
models violate fair use. Its hard to predict how they will go. They
might not all go the same way, as laws will vary across industries and
jurisdictions.
No matter the result, were likely to be facing a major decline in the
free and open source ecosystem.
1. All of these models are trained on open source code without payment
or even acknowledgement, which is a major disincentive for
contributors and maintainers. That large corporations might benefit
from your code is a fixture of open source, but they do
occasionally give back to the community.
2. Language models—built on open source code—commonly replace that
code. Instead of importing a module to do a thing, you prompt your
Copilot. The code generated is almost certainly based on the open
source module, at least partially, but it has been laundered
through the language model, disconnecting the programmer from the
community, recognition, and what little reward there was.
Language models demotivate maintainers and drain away both resources
and users. What youre likely to be left with are those who are
building core infrastructure or end-user software out of principle. The
“free software” side of the community is more likely to survive than
the rest. The Linux kernel, Gnome, KDE—that sort of thing.
The “open source” ecosystem, especially that surrounding the web and
node, is likely to be hit the hardest. The more driven the open source
project was by its proximity to either an employed contributor or
actively dependent business, the bigger the impact from a shift to
language models will be.
This is a serious problem for the software industry as arguably much of
the economic value the industry has provided over the past decade comes
from strip-mining open source and free software.
6. Licence contamination
Microsoft and Google dont train their language models on their own
code. GitHubs Copilot isnt trained on code from Microsofts office
suite, even though many of its products are likely to be some of the
largest React Native projects in existence. There arent many C++ code
bases as big as Windows. Googles repository is probably one of the
biggest collection of python and java code you can find.
They dont seem to use it for training, but instead train on
collections of open source code that contain both permissive and
copyleft licences.
Copyleft licences, if used, force you to release your own project under
their licence. Many of them, even non-copyleft, have patent clauses,
which is poison for quite a few employers. Even permissive licences
require attribution, and you can absolutely get sued if youre caught
copying open source code without attribution.
Remember our Chekhovs gun?
👉🏻👉🏻 memorisation!
Well, 👉🏻👉🏻 pewpew!!!
Turns out blindly copying open source code is problematic.
Whodathunkit?
These models all memorise a lot, and they tend to copy what they
memorise into their output. [32]GitHubs own numbers peg verbatim
copies of code thats at least 150 characters at 1%^[33][9], which is
roughly the same, in terms of verbatim copying, as what you seem to get
in other language models.
For context, that means that if you use a language model for
development, a copilot or chatbot, three or four times a day, youre
going to get a verbatim copy of open source code injected into your
project about once a month. If every team member uses one, then
multiply that by the size of the team.
GitHubs Copilot has a feature that lets you block verbatim copies.
This obviously requires both a check, which slows the result down, and
it will throw out a bunch of useful results, making the language model
less useful. Its already not as useful as its made out to be and
pretty darn slow so many people are going to turn off the “please dont
plagiarise” checkbox.
But even GitHubs checks are insufficient. The keyword there is
“verbatim”, because language models have a tendency to rephrase their
output. If GitHub Copilot copies a GPLed implementation of an algorithm
into your project but changes all the variable names, Copilot wont
detect it, itll still be plagiarism and the copied code is still under
the GPL. This isnt unlikely as this is how language models work.
Memorisation and then copying with light rephrasing is what they do.
Training the system only on permissively licensed code doesnt solve
the problem. It wont force your project to adopt an MIT licence or
anything like that, but you can still be sued if its discovered.
This would seem to give Microsoft and GitHub a good reason not to train
on the Office code base, for example. If they did, theres a good
chance that a prompt to generate DOCX parsing code might “generate” a
verbatim copy of the DOCX parsing code from Microsoft Word.
And they cant have that, can they? This would both undercut their own
strategic advantage, and it would break the illusion that these systems
are generating novel code from scratch.
This should make it clear that what theyre actually doing is
strip-mine the free and open source software ecosystem.
How much of a problem is this?
—It wont matter. I wont get caught.
You personally wont get caught, but your employer might, and
Intellectual Property scans or similar code audits tend to come up at
the absolute worst moments in the history of any given organisation:
* During due diligence for an acquisition. Could cost the company and
managers a fortune.
* In discovery for an unrelated lawsuit. Again, could cost the
company a fortune.
* During hacks and other security incidents. Could. Cost. A. Fortune.
“AI” vendors wont take any responsibility for this risk. I doubt your
business insurance covers “automated language model plagiarism”
lawsuits.
Language models for software development are a lawsuit waiting to
happen.
Unless they are completely reinvented from scratch, language model code
generators are, in my opinion, unsuitable for anything except for
prototypes and throwaway projects.
So, obviously, everybodys going to use them
* All the potentially bad stuff happens later. Unlikely to affect
your bonuses or employment.
* Itll be years before the first licence contamination lawsuits
happen.
* Most employees will be long gone before anybody realises just how
much of a bad idea it was.
* But youll still get that nice “AI” bump in the stock market.
What all of these problems have in common is that their impact is
delayed and most of them will only appear in the form of increased
frequency of bugs and other defects and general project chaos.
The biggest issue, licence contamination, will likely take years before
it starts to hit the industry, and is likely to be mitigated by virtue
of the fact that many of the heaviest users of “AI”-generated code will
have folded due to general mismanagement long before anybody cares
enough to check their code.
If you were ever wondering if we, as an industry, were capable of
coming up with a systemic issue to rival the Y2K bug in scale and
stupidity? Well, here you go.
You can start using a language model, get the stock market bump,
present the short term increase in volume as productivity, and be long
gone before anybody connects the dots between language model use and
the jump in defects.
Even if you purposefully tried to come up with a technology that played
directly into and magnified the software industrys dysfunctions you
wouldnt be able to come up with anything as perfectly imperfect as
these language models.
Its nonsense without consequence.
Counterproductive novelty that you can indulge in without harming your
career.
It might even do your career some good. Show that youre embracing the
future.
But…
The best is yet to come
In a few years time, once the effects of the “AI” bubble finally
dissipates…
Somebodys going to get paid to fix the crap it left behind.
The best way to support this newsletter or my blog is to buy one of my
books, [34]The Intelligence Illusion: a practical guide to the business
risks of Generative AI or [35]Out of the Software Crisis. Or, you can
buy them both [36]as a bundle.
__________________________________________________________________
1. Theres quite a bit of papers that either highlight the tendency to
memorise or demonstrate a strong relationship between that tendency
and eventual performance.
+ [37]An Empirical Study of Memorization in NLP (Zheng & Jiang,
ACL 2022)
+ [38]Does learning require memorization? a short tale about a
long tail. (Feldman, 2020)
+ [39]When is memorization of irrelevant training data necessary
for high-accuracy learning? (Brown et al. 2021)
+ [40]What Neural Networks Memorize and Why: Discovering the
Long Tail via Influence Estimation (Feldman & Zhang, 2020)
+ [41]Question and Answer Test-Train Overlap in Open-Domain
Question Answering Datasets (Lewis et al., EACL 2021)
+ [42]Quantifying Memorization Across Neural Language Models
(Carlini et al. 2022)
+ [43]On Training Sample Memorization: Lessons from Benchmarking
Generative Modeling with a Large-scale Competition (Bai et al.
2021)
[44]↩︎
2. See the [45]Bias & Safety card at [46]needtoknow.fyi for
references. [47]↩︎
3. See the [48]Shortcut “Reasoning” card at [49]needtoknow.fyi for
references. [50]↩︎
4. Simon Willison has been covering this issue [51]in a series of blog
posts. [52]↩︎
5.
+ [53]The poisoning of ChatGPT
+ [54]Google Bard is a glorious reinvention of black-hat SEO
spam and keyword-stuffing
[55]↩︎
6. See, for example:
+ [56]Asleep at the Keyboard? Assessing the Security of GitHub
Copilots Code Contributions (Hammond Pearce et al., December
2021)
+ [57]Do Users Write More Insecure Code with AI Assistants?
(Neil Perry et al., December 2022)
[58]↩︎
7. This came out [59]during an investor event and was presented as
evidence of the high quality of Copilots output. [60]↩︎
8.
+ [61]Getty Images v. Stability AI - Complaint
+ [62]Getty Images is suing the creators of AI art tool Stable
Diffusion for scraping its content
+ [63]The Wave of AI Lawsuits Have Begun
+ [64]Copyright lawsuits pose a serious threat to generative AI
+ [65]GitHub Copilot litigation
+ [66]Stable Diffusion litigation
[67]↩︎
9. Archived link of the [68]GitHub Copilot feature page. [69]↩︎
Baldur Bjarnason May 30th, 2023
Join the Newsletter
Subscribe to the Out of the Software Crisis newsletter to get my weekly
(at least) essays on how to avoid or get out of software development
crises.
Join now and get a free PDF of three bonus essays from Out of the
Software Crisis.
____________________
(BUTTON)
Subscribe
We respect your privacy.
Unsubscribe at any time.
[70]Mastodon [71]Twitter [72]GitHub [73]Feed
References
1. file:///index.xml
2. file:///feed.json
3. file:///
4. file:///
5. https://softwarecrisis.baldurbjarnason.com/
6. https://illusion.baldurbjarnason.com/
7. file:///archive/
8. file:///author/
9. https://www.hakkavelin.is/
10. https://illusion.baldurbjarnason.com/
11. https://softwarecrisis.baldurbjarnason.com/
12. https://baldurbjarnason.lemonsqueezy.com/checkout/buy/cfc2f2c6-34af-436f-91c1-cb2e47283c40
13. https://www.baldurbjarnason.com/2021/software-crisis-2/
14. https://standishgroup.com/sample_research_files/CHAOSReport2015-Final.pdf
15. https://softwarecrisis.baldurbjarnason.com/
16. https://quoteinvestigator.com/2019/09/19/woodpecker/
17. http://worrydream.com/refs/Brooks-NoSilverBullet.pdf
18. https://illusion.baldurbjarnason.com/
19. https://www.baldurbjarnason.com/2022/theory-building/
20. https://softwarecrisis.baldurbjarnason.com/
21. https://en.wikipedia.org/wiki/Mitochondrion
22. file:///var/folders/q9/qlz2w5251kzdfgn0np7z2s4c0000gn/T/L83622-2637TMP.html#fn1
23. https://en.wikipedia.org/wiki/Chekhov's_gun
24. file:///var/folders/q9/qlz2w5251kzdfgn0np7z2s4c0000gn/T/L83622-2637TMP.html#fn2
25. file:///var/folders/q9/qlz2w5251kzdfgn0np7z2s4c0000gn/T/L83622-2637TMP.html#fn3
26. https://en.wikipedia.org/wiki/Catch-22_(logic)
27. file:///var/folders/q9/qlz2w5251kzdfgn0np7z2s4c0000gn/T/L83622-2637TMP.html#fn4
28. file:///var/folders/q9/qlz2w5251kzdfgn0np7z2s4c0000gn/T/L83622-2637TMP.html#fn5
29. file:///var/folders/q9/qlz2w5251kzdfgn0np7z2s4c0000gn/T/L83622-2637TMP.html#fn6
30. file:///var/folders/q9/qlz2w5251kzdfgn0np7z2s4c0000gn/T/L83622-2637TMP.html#fn7
31. file:///var/folders/q9/qlz2w5251kzdfgn0np7z2s4c0000gn/T/L83622-2637TMP.html#fn8
32. https://archive.ph/2023.01.11-224507/https://github.com/features/copilot#selection-19063.298-19063.462:~:text=Our latest internal research shows that about 1% of the time, a suggestion may contain some code snippets longer than ~150 characters that matches the training set.
33. file:///var/folders/q9/qlz2w5251kzdfgn0np7z2s4c0000gn/T/L83622-2637TMP.html#fn9
34. https://illusion.baldurbjarnason.com/
35. https://softwarecrisis.baldurbjarnason.com/
36. https://baldurbjarnason.lemonsqueezy.com/checkout/buy/cfc2f2c6-34af-436f-91c1-cb2e47283c40
37. https://aclanthology.org/2022.acl-long.434
38. https://doi.org/10.1145/3357713.3384290
39. https://doi.org/10.1145/3406325.3451131
40. https://papers.nips.cc/paper/2020/hash/1e14bfe2714193e7af5abc64ecbd6b46-Abstract.html
41. https://aclanthology.org/2021.eacl-main.86
42. https://arxiv.org/abs/2202.07646
43. https://dl.acm.org/doi/10.1145/3447548.3467198
44. file:///var/folders/q9/qlz2w5251kzdfgn0np7z2s4c0000gn/T/L83622-2637TMP.html#fnref1
45. https://needtoknow.fyi/card/bias/
46. https://needtoknow.fyi/
47. file:///var/folders/q9/qlz2w5251kzdfgn0np7z2s4c0000gn/T/L83622-2637TMP.html#fnref2
48. https://needtoknow.fyi/card/shortcut-reasoning/
49. https://needtoknow.fyi/
50. file:///var/folders/q9/qlz2w5251kzdfgn0np7z2s4c0000gn/T/L83622-2637TMP.html#fnref3
51. https://simonwillison.net/series/prompt-injection/
52. file:///var/folders/q9/qlz2w5251kzdfgn0np7z2s4c0000gn/T/L83622-2637TMP.html#fnref4
53. https://softwarecrisis.dev/letters/the-poisoning-of-chatgpt/
54. https://softwarecrisis.dev/letters/google-bard-seo/
55. file:///var/folders/q9/qlz2w5251kzdfgn0np7z2s4c0000gn/T/L83622-2637TMP.html#fnref5
56. https://doi.org/10.48550/arXiv.2108.09293
57. https://doi.org/10.48550/arXiv.2211.03622
58. file:///var/folders/q9/qlz2w5251kzdfgn0np7z2s4c0000gn/T/L83622-2637TMP.html#fnref6
59. https://www.microsoft.com/en-us/Investor/events/FY-2023/Morgan-Stanley-TMT-Conference#:~:text=Scott Guthrie: I think you're,is now AI-generated and unmodified
60. file:///var/folders/q9/qlz2w5251kzdfgn0np7z2s4c0000gn/T/L83622-2637TMP.html#fnref7
61. https://copyrightlately.com/pdfviewer/getty-images-v-stability-ai-complaint/?auto_viewer=true#page=&zoom=auto&pagemode=none
62. https://www.theverge.com/2023/1/17/23558516/ai-art-copyright-stable-diffusion-getty-images-lawsuit
63. https://www.plagiarismtoday.com/2023/01/17/the-wave-of-ai-lawsuits-have-begun/
64. https://www.understandingai.org/p/copyright-lawsuits-pose-a-serious
65. https://githubcopilotlitigation.com/
66. https://stablediffusionlitigation.com/
67. file:///var/folders/q9/qlz2w5251kzdfgn0np7z2s4c0000gn/T/L83622-2637TMP.html#fnref8
68. https://archive.ph/2023.01.11-224507/https://github.com/features/copilot#selection-19063.298-19063.462:~:text=Our latest internal research shows that about 1% of the time, a suggestion may contain some code snippets longer than ~150 characters that matches the training set.
69. file:///var/folders/q9/qlz2w5251kzdfgn0np7z2s4c0000gn/T/L83622-2637TMP.html#fnref9
70. https://toot.cafe/@baldur
71. https://twitter.com/fakebaldur
72. https://github.com/baldurbjarnason
73. file:///feed.xml