292 lines
17 KiB
Plaintext
292 lines
17 KiB
Plaintext
A Manifesto for Preserving Content on the Web
|
||
|
||
This Page is Designed to Last
|
||
|
||
By [1]Jeff Huang, published 2019-12-19, updated 2021-08-24
|
||
|
||
The end of the year is an opportunity to clean up and reset for the
|
||
upcoming new semester. I found myself clearing out old bookmarks—yes,
|
||
bookmarks: that formerly beloved browser feature that seems to have
|
||
lost the battle to 'address bar autocomplete'. But this nostalgic act
|
||
of tidying led me to despair.
|
||
|
||
Bookmark after bookmark led to dead link after dead link. What's
|
||
vanished: unique pieces of writing on kuro5hin about tech culture; a
|
||
collection of mathematical puzzles and their associated discussion by
|
||
academics that my father introduced me to; Woodman's Reverse
|
||
Engineering tutorials from my high school years, where I first tasted
|
||
the feeling of control over software; even my most recent bookmark, a
|
||
series of posts on Google+ exposing usb-c chargers' non-compliance with
|
||
the specification, all disappeared.
|
||
|
||
This is more than just link rot, it's the increasing complexity of
|
||
keeping alive indie content on the web, leading to a reliance on
|
||
platforms and time-sorted publication formats (blogs, feeds, tweets).
|
||
|
||
Of course, I have also contributed to the problem. A paper I published
|
||
7 years ago has an abstract that includes a demo link, which has been
|
||
taken over by a spammy page with a pumpkin picture on it. Part of that
|
||
lapse was laziness to avoid having to renew and keep a functioning web
|
||
application up year after year.
|
||
|
||
I've recommended my students to push websites to Heroku, and publish
|
||
portfolios on Wix. Yet every platform with irreplaceable content dies
|
||
off some day. Geocities, LiveJournal, what.cd, now Yahoo Groups. One
|
||
day, Medium, Twitter, and even hosting services like GitHub Pages will
|
||
be plundered then discarded when they can no longer grow or cannot find
|
||
a working business model.
|
||
|
||
The problem is multi-faceted. First, content takes effort to maintain.
|
||
The content may need updating to remain relevant, and will eventually
|
||
have to be rehosted. A lot of content, what used to be the vast
|
||
majority of content, was put up by individuals. But individuals (maybe
|
||
you?) lose interest, so one day maybe you just don't want to deal with
|
||
migrating a website to a new hosting provider.
|
||
|
||
Second, a growing set of libraries and frameworks are making the web
|
||
more sophisticated but also more complex. First came jquery, then
|
||
bootstrap, npm, angular, grunt, webpack, and more. If you are a web
|
||
developer who is keeping up with the latest, then that's not a problem.
|
||
|
||
But if not, maybe you are an embedded systems programmer or startup CTO
|
||
or enterprise Java developer or chemistry PhD student, sure you could
|
||
probably figure out how to set up some web server and toolchain, but
|
||
will you keep this up year after year, decade after decade? Probably
|
||
not, and when the next year when you encounter a package dependency
|
||
problem or figure out how to regenerate your html files, you might just
|
||
throw your hands up and zip up the files to deal with "later". Even
|
||
simple technology stacks like static site generators (e.g., Jekyll)
|
||
require a workflow and will stop working at some point. You fall into
|
||
npm dependency hell, and forget the command to package a release. And
|
||
having a website with multiple html pages is complex; how would you
|
||
know how each page links to each other? index.html.old, Copy of
|
||
about.html, index.html (1), nav.html?
|
||
|
||
Third, and this has been touted by others already (and even
|
||
[2]rebutted), the disappearance of the public web in favor of mobile
|
||
and web apps, walled gardens (Facebook pages), just-in-time WebSockets
|
||
loading, and AMP decreases the proportion of the web on the world wide
|
||
web, which now seems more like a continental web than a "world wide
|
||
web".
|
||
|
||
So for these problems, what can we do about it? It's not such a simple
|
||
problem that can be solved in this one article. The Wayback Machine and
|
||
archive.org helps keep some content around for longer. And sometimes an
|
||
altruistic individual rehosts the content elsewhere.
|
||
|
||
But the solution needs to be multi-pronged. How do we make web content
|
||
that can last and be maintained for at least 10 years? As someone
|
||
studying human-computer interaction, I naturally think of the
|
||
stakeholders we aren't supporting. Right now putting up web content is
|
||
optimized for either the professional web developer (who use the latest
|
||
frameworks and workflows) or the non-tech savvy user (who use a
|
||
platform).
|
||
|
||
But I think we should consider both 1) the casual web content
|
||
"maintainer", someone who doesn't constantly stay up to date with the
|
||
latest web technologies, which means the website needs to have low
|
||
maintenance needs; 2) and the crawlers who preserve the content and
|
||
[3]personal archivers, the "archiver", which means the website should
|
||
be easy to save and interpret.
|
||
|
||
So my proposal is seven unconventional guidelines in how we handle
|
||
websites designed to be informative, to make them easy to maintain and
|
||
preserve. The guiding intention is that the maintainer will try to keep
|
||
the website up for at least 10 years, maybe even 20 or 30 years. These
|
||
are not controversial views necessarily, but are aspirations that are
|
||
not mainstream—a manifesto for a long-lasting website.
|
||
1. Return to vanilla HTML/CSS – I think we've reached the point where
|
||
html/css is more powerful, and nicer to use than ever before.
|
||
Instead of starting with a giant template filled with .js includes,
|
||
it's now okay to just write plain HTML from scratch again. CSS
|
||
Flexbox and Grid, canvas, Selectors, box-shadow, the video element,
|
||
filter, etc. eliminate a lot of the need for JavaScript libraries.
|
||
We can avoid jquery and bootstrap when they're not needed. The more
|
||
libraries incorporated into the website, the more fragile it
|
||
becomes. Skip the polyfills and CSS prefixes, and stick with the
|
||
CSS attributes that work across all browsers. And frequently
|
||
validate your HTML; it could save you a headache in the future when
|
||
you encounter a bug.
|
||
2. Don't minimize that HTML – minimizing (compressing) your HTML and
|
||
associated CSS/JS seems like it saves precious bandwidth and all
|
||
the big companies are doing it. But why not? Well, you don't save
|
||
much because your web pages should be gzipped before being sent
|
||
over the network, so preemptively shrinking your content probably
|
||
doesn't do much to save bandwidth if anything at all. But even if
|
||
it did save a few bytes (it's just text in the end), you now need
|
||
to have a build process and to add this to your workflow, so
|
||
updating a website just became more complex. If there's a bug or
|
||
future incompatibility in the html, the minimized form is harder to
|
||
debug. And it's unfriendly to your users; so many people got their
|
||
start with HTML by smashing that View Source button, and minimizing
|
||
your HTML prevents this ideal of learning by seeing what they did.
|
||
Minimizing HTML does not preserve its educational quality, and what
|
||
gets archived is only the resulting codejunk.
|
||
3. Prefer one page over several – several pages are hard to maintain.
|
||
You can lose track of which pages link to what, and it also leads
|
||
to some system of page templates to reduce redundancy. How many
|
||
pages can one person really maintain? Having one file, probably
|
||
just an index.html, is simple and unforgettable. Make use of that
|
||
infinite vertical scroll. You never have to dig around your files
|
||
or grep to see where some content lies. And how should your version
|
||
control that file? Should you use git? Shove them in an 'old/'
|
||
folder? Well I like the simple approach of naming old files with
|
||
the date they are retired, like index.20191213.html. Using the ISO
|
||
format of the date makes it so that it sorts easily, and there's no
|
||
confusion between American and European date formats. If I have
|
||
multiple versions in one day, I would use a style similar to that
|
||
which is customary in log files, of index.20191213.1.html. A nice
|
||
side effect is then you can access an older version of the file if
|
||
you remember the date, without logging into the web host.
|
||
4. End all forms of hotlinking – this cautionary word seems to have
|
||
disappeared from internet vocabulary, but it's one of the reasons
|
||
I've seen a perfectly good website fall apart for no reason. Stop
|
||
directly including images from other websites, stop "borrowing"
|
||
stylesheets by just linking to them, and especially stop linking to
|
||
JavaScript files, even the ones hosted by the original developers.
|
||
Hotlinking is [4]usually considered rude since your visitors use
|
||
someone else's bandwidth, it makes the user experience slower, you
|
||
let another website track your users, and worse of all if the
|
||
location you're linking to changes their folder structure or just
|
||
goes offline, then the failure cascades to your website as well.
|
||
Google Analytics is unnecessary; store your own server logs and set
|
||
up [5]GoAccess or cut them up however you like, giving you more
|
||
detailed statistics. Don't give away your logs to Google for free.
|
||
5. Stick with native fonts – we're focusing on content first, so
|
||
decorative and unusual typefaces are completely unnecessary. Stick
|
||
with either the 13 web-safe fonts or a [6]system font stack that
|
||
matches the default font to the operating system of your visitor.
|
||
Using the system font stack might look a bit different between
|
||
operating systems, but your layout shouldn't be so brittle that an
|
||
extra word wrap will ruin it. Then you don't have to worry about
|
||
the flashing font problem either. Your focus should be about
|
||
delivering the content to the user effectively and making the
|
||
choice of font be invisible, rather than getting noticed to stroke
|
||
your design ego.
|
||
6. Obsessively compress your images – faster for your users, less
|
||
space to archive, and easier to maintain when you don't have to
|
||
back up a humongous folder. Your images can have the same high
|
||
quality, but be smaller. [7]Minify your SVGs, losslessly compress
|
||
your PNGs, generate JPEGs to exactly fit the width of the image.
|
||
It's worth spending some time figuring out the most optimal way to
|
||
compress and [8]reduce the size of your images without losing
|
||
quality. And once [9]WebP gains support on Safari, switch over to
|
||
that format. Ruthlessly minimize the total size of your website and
|
||
keep it as small as possible. Every MB can cost someone real money,
|
||
and in fact, my mobile carrier (Google Fi) charges a cent per MB,
|
||
so a 25 MB website which is fairly common nowadays, costs a quarter
|
||
itself, about as much as a newspaper when I was a child.
|
||
7. Eliminate the broken URL risk – there are [10]monitoring services
|
||
that will tell you when your URL is down, preventing you from
|
||
realizing one day that your homepage hasn't been loading for a
|
||
month and the search engines have deindexed it. Because 10 years is
|
||
longer than most hard drives or operating systems are meant to
|
||
last. But to eliminate the risk of a URL breaking completely, set
|
||
up a second monitoring service. Because if the first one stops for
|
||
any reason (they move to a pay model, they shut down, you forget to
|
||
renew something, etc.) you will still get one notification when
|
||
your URL is down, then realize the other monitoring service is down
|
||
because you didn't get the second notification. Remember that we're
|
||
trying to keep something up for over 10 years (ideally way longer,
|
||
even 30 years), and a lot of services will shut down during this
|
||
period, so two monitoring services is safer.
|
||
|
||
After doing these things, go ahead and place a bit of text in the
|
||
footer, "The page was designed to last", linking to this page
|
||
explaining what that means. The words promise that the maintainer will
|
||
do their best to follow the ideas in this manifesto.
|
||
|
||
Before you protest, this is obviously not for web applications. If you
|
||
are making an application, then make your web or mobile app with the
|
||
workflow you need. I don't even know any web applications that have
|
||
remained similarly functioning over 10 years so it seems like a lost
|
||
cause anyway (except Philip Guo's python tutor, due to his
|
||
[11]minimalist strategy for maintaining it). It's also not for websites
|
||
maintained by an organization like Wikipedia or Twitter. The salaries
|
||
for an IT team is probably enough to keep a website alive for a while.
|
||
|
||
In fact, it's not even that important you strictly follow the 7
|
||
"rules", as they're more of a provocation than strict rules.
|
||
|
||
But let's say some small part of the web starts designing websites to
|
||
last for content that is meant to last. What happens then? Well, people
|
||
may prefer to link to them since they have a promise of working in the
|
||
future. People more generally may be more mindful of making their pages
|
||
more permanent. And users and archivers both save bandwidth when
|
||
visiting and storing these pages.
|
||
|
||
The effects are long term, but the achievements are incremental and can
|
||
be implemented by website owners without being dependent on anyone else
|
||
or waiting for a network effect. You can do this now for your website,
|
||
and that already would be a positive outcome. Like using a recycled
|
||
shopping bag instead of a taking a plastic one, it's a small individual
|
||
action.
|
||
|
||
This article is meant to provoke and lead to individual action, not
|
||
propose a complete solution to the decaying web. It's a small simple
|
||
step for a complex sociotechnical system. So I'd love to see this
|
||
happen. I intend to keep this page up for at least 10 years.
|
||
|
||
If you are interested in receiving updates to [12]irchiver, our project
|
||
for a personal archive of the web pages you visit, please [13]subscribe
|
||
here.
|
||
|
||
Thanks to my Ph.D. students Shaun Wallace, Nediyana Daskalova, Talie
|
||
Massachi, Alexandra Papoutsaki, my colleagues James Tompkin, Stephen
|
||
Bach, my teaching assistant Kathleen Chai, and my research assistant
|
||
Yusuf Karim for feedback on earlier drafts.
|
||
|
||
See discussions on [14]Hacker News and [15]reddit /r/programming
|
||
|
||
Also in this series
|
||
|
||
[16]Behind the scenes: the struggle for each paper to get published
|
||
|
||
[17]Illustrative notes for obsessing over publishing aesthetics
|
||
|
||
Other articles I've written
|
||
|
||
[18]My productivity app is a never-ending .txt file
|
||
|
||
[19]The Coronavirus pandemic has changed our sleep behavior
|
||
|
||
[20]Extracting data from tracking devices by going to the cloud
|
||
|
||
[21]CS Faculty Composition and Hiring Trends
|
||
|
||
[22]Bias in Computer Science Rankings
|
||
|
||
[23]Who Wins CS Best Paper Awards?
|
||
|
||
[24]Verified Computer Science Ph.D. Stipends
|
||
|
||
This page is [25]designed to last.
|
||
|
||
References
|
||
|
||
1. https://jeffhuang.com/
|
||
2. https://gomakethings.com/the-web-is-not-dying/
|
||
3. https://archivebox.io/
|
||
4. https://webmasters.stackexchange.com/questions/25315/hotlinking-what-is-it-and-why-shouldnt-people-do-it
|
||
5. https://goaccess.io/
|
||
6. https://systemfontstack.com/
|
||
7. https://victorzhou.com/blog/minify-svgs/
|
||
8. https://evilmartians.com/chronicles/images-done-right-web-graphics-good-to-the-last-byte-optimization-techniques
|
||
9. https://caniuse.com/#feat=webp
|
||
10. https://uptimerobot.com/
|
||
11. https://pg.ucsd.edu/publications/Python-Tutor-scalable-sustainable-research-software_UIST-2021.pdf
|
||
12. https://irchiver.com/
|
||
13. https://docs.google.com/forms/d/e/1FAIpQLSeTCgnwF1gjrc1O8mfJ_5TmT_TLowFQ2DUhsollmqPG84pAFQ/viewform?usp=pp_url&entry.1299571007=irchiver:+your+full-resolution+personal+web+archive+and+search&entry.1760653896=designed_to_last
|
||
14. https://news.ycombinator.com/item?id=21840140
|
||
15. https://www.reddit.com/r/programming/comments/ed88ra/this_page_is_designed_to_last_a_manifesto_for/
|
||
16. https://jeffhuang.com/struggle_for_each_paper/
|
||
17. https://jeffhuang.com/illustrative-notes-for-publishing-aesthetics/
|
||
18. https://jeffhuang.com/productivity_text_file/
|
||
19. https://jeffhuang.com/covid_sleep/
|
||
20. https://jeffhuang.com/extracting_data_from_tracking_devices/
|
||
21. https://jeffhuang.com/computer-science-open-data/#cs-faculty-composition-and-hiring-trends
|
||
22. https://jeffhuang.com/computer-science-open-data/#bias-in-computer-science-rankings
|
||
23. https://jeffhuang.com/computer-science-open-data/#who-wins-cs-best-paper-awards
|
||
24. https://jeffhuang.com/computer-science-open-data/#verified-computer-science-phd-stipends
|
||
25. http://jeffhuang.com/designed_to_last/
|