274 lines
16 KiB
Plaintext
274 lines
16 KiB
Plaintext
A Manifesto for Preserving Content on the Web
|
||
|
||
This Page is Designed to Last
|
||
|
||
By [1]Jeff Huang, published 2019-12-19, updated 2021-08-24
|
||
|
||
The end of the year is an opportunity to clean up and reset for the upcoming
|
||
new semester. I found myself clearing out old bookmarks—yes, bookmarks: that
|
||
formerly beloved browser feature that seems to have lost the battle to 'address
|
||
bar autocomplete'. But this nostalgic act of tidying led me to despair.
|
||
|
||
Bookmark after bookmark led to dead link after dead link. What's vanished:
|
||
unique pieces of writing on kuro5hin about tech culture; a collection of
|
||
mathematical puzzles and their associated discussion by academics that my
|
||
father introduced me to; Woodman's Reverse Engineering tutorials from my high
|
||
school years, where I first tasted the feeling of control over software; even
|
||
my most recent bookmark, a series of posts on Google+ exposing usb-c chargers'
|
||
non-compliance with the specification, all disappeared.
|
||
|
||
This is more than just link rot, it's the increasing complexity of keeping
|
||
alive indie content on the web, leading to a reliance on platforms and
|
||
time-sorted publication formats (blogs, feeds, tweets).
|
||
|
||
Of course, I have also contributed to the problem. A paper I published 7 years
|
||
ago has an abstract that includes a demo link, which has been taken over by a
|
||
spammy page with a pumpkin picture on it. Part of that lapse was laziness to
|
||
avoid having to renew and keep a functioning web application up year after
|
||
year.
|
||
|
||
I've recommended my students to push websites to Heroku, and publish portfolios
|
||
on Wix. Yet every platform with irreplaceable content dies off some day.
|
||
Geocities, LiveJournal, what.cd, now Yahoo Groups. One day, Medium, Twitter,
|
||
and even hosting services like GitHub Pages will be plundered then discarded
|
||
when they can no longer grow or cannot find a working business model.
|
||
|
||
The problem is multi-faceted. First, content takes effort to maintain. The
|
||
content may need updating to remain relevant, and will eventually have to be
|
||
rehosted. A lot of content, what used to be the vast majority of content, was
|
||
put up by individuals. But individuals (maybe you?) lose interest, so one day
|
||
maybe you just don't want to deal with migrating a website to a new hosting
|
||
provider.
|
||
|
||
Second, a growing set of libraries and frameworks are making the web more
|
||
sophisticated but also more complex. First came jquery, then bootstrap, npm,
|
||
angular, grunt, webpack, and more. If you are a web developer who is keeping up
|
||
with the latest, then that's not a problem.
|
||
|
||
But if not, maybe you are an embedded systems programmer or startup CTO or
|
||
enterprise Java developer or chemistry PhD student, sure you could probably
|
||
figure out how to set up some web server and toolchain, but will you keep this
|
||
up year after year, decade after decade? Probably not, and when the next year
|
||
when you encounter a package dependency problem or figure out how to regenerate
|
||
your html files, you might just throw your hands up and zip up the files to
|
||
deal with "later". Even simple technology stacks like static site generators
|
||
(e.g., Jekyll) require a workflow and will stop working at some point. You fall
|
||
into npm dependency hell, and forget the command to package a release. And
|
||
having a website with multiple html pages is complex; how would you know how
|
||
each page links to each other? index.html.old, Copy of about.html, index.html
|
||
(1), nav.html?
|
||
|
||
Third, and this has been touted by others already (and even [2]rebutted), the
|
||
disappearance of the public web in favor of mobile and web apps, walled gardens
|
||
(Facebook pages), just-in-time WebSockets loading, and AMP decreases the
|
||
proportion of the web on the world wide web, which now seems more like a
|
||
continental web than a "world wide web".
|
||
|
||
So for these problems, what can we do about it? It's not such a simple problem
|
||
that can be solved in this one article. The Wayback Machine and archive.org
|
||
helps keep some content around for longer. And sometimes an altruistic
|
||
individual rehosts the content elsewhere.
|
||
|
||
But the solution needs to be multi-pronged. How do we make web content that can
|
||
last and be maintained for at least 10 years? As someone studying
|
||
human-computer interaction, I naturally think of the stakeholders we aren't
|
||
supporting. Right now putting up web content is optimized for either the
|
||
professional web developer (who use the latest frameworks and workflows) or the
|
||
non-tech savvy user (who use a platform).
|
||
|
||
But I think we should consider both 1) the casual web content "maintainer",
|
||
someone who doesn't constantly stay up to date with the latest web
|
||
technologies, which means the website needs to have low maintenance needs; 2)
|
||
and the crawlers who preserve the content and [3]personal archivers, the
|
||
"archiver", which means the website should be easy to save and interpret.
|
||
|
||
So my proposal is seven unconventional guidelines in how we handle websites
|
||
designed to be informative, to make them easy to maintain and preserve. The
|
||
guiding intention is that the maintainer will try to keep the website up for at
|
||
least 10 years, maybe even 20 or 30 years. These are not controversial views
|
||
necessarily, but are aspirations that are not mainstream—a manifesto for a
|
||
long-lasting website.
|
||
|
||
1. Return to vanilla HTML/CSS – I think we've reached the point where html/css
|
||
is more powerful, and nicer to use than ever before. Instead of starting
|
||
with a giant template filled with .js includes, it's now okay to just write
|
||
plain HTML from scratch again. CSS Flexbox and Grid, canvas, Selectors,
|
||
box-shadow, the video element, filter, etc. eliminate a lot of the need for
|
||
JavaScript libraries. We can avoid jquery and bootstrap when they're not
|
||
needed. The more libraries incorporated into the website, the more fragile
|
||
it becomes. Skip the polyfills and CSS prefixes, and stick with the CSS
|
||
attributes that work across all browsers. And frequently validate your
|
||
HTML; it could save you a headache in the future when you encounter a bug.
|
||
2. Don't minimize that HTML – minimizing (compressing) your HTML and
|
||
associated CSS/JS seems like it saves precious bandwidth and all the big
|
||
companies are doing it. But why not? Well, you don't save much because your
|
||
web pages should be gzipped before being sent over the network, so
|
||
preemptively shrinking your content probably doesn't do much to save
|
||
bandwidth if anything at all. But even if it did save a few bytes (it's
|
||
just text in the end), you now need to have a build process and to add this
|
||
to your workflow, so updating a website just became more complex. If
|
||
there's a bug or future incompatibility in the html, the minimized form is
|
||
harder to debug. And it's unfriendly to your users; so many people got
|
||
their start with HTML by smashing that View Source button, and minimizing
|
||
your HTML prevents this ideal of learning by seeing what they did.
|
||
Minimizing HTML does not preserve its educational quality, and what gets
|
||
archived is only the resulting codejunk.
|
||
3. Prefer one page over several – several pages are hard to maintain. You can
|
||
lose track of which pages link to what, and it also leads to some system of
|
||
page templates to reduce redundancy. How many pages can one person really
|
||
maintain? Having one file, probably just an index.html, is simple and
|
||
unforgettable. Make use of that infinite vertical scroll. You never have to
|
||
dig around your files or grep to see where some content lies. And how
|
||
should your version control that file? Should you use git? Shove them in an
|
||
'old/' folder? Well I like the simple approach of naming old files with the
|
||
date they are retired, like index.20191213.html. Using the ISO format of
|
||
the date makes it so that it sorts easily, and there's no confusion between
|
||
American and European date formats. If I have multiple versions in one day,
|
||
I would use a style similar to that which is customary in log files, of
|
||
index.20191213.1.html. A nice side effect is then you can access an older
|
||
version of the file if you remember the date, without logging into the web
|
||
host.
|
||
4. End all forms of hotlinking – this cautionary word seems to have
|
||
disappeared from internet vocabulary, but it's one of the reasons I've seen
|
||
a perfectly good website fall apart for no reason. Stop directly including
|
||
images from other websites, stop "borrowing" stylesheets by just linking to
|
||
them, and especially stop linking to JavaScript files, even the ones hosted
|
||
by the original developers. Hotlinking is [4]usually considered rude since
|
||
your visitors use someone else's bandwidth, it makes the user experience
|
||
slower, you let another website track your users, and worse of all if the
|
||
location you're linking to changes their folder structure or just goes
|
||
offline, then the failure cascades to your website as well. Google
|
||
Analytics is unnecessary; store your own server logs and set up [5]GoAccess
|
||
or cut them up however you like, giving you more detailed statistics. Don't
|
||
give away your logs to Google for free.
|
||
5. Stick with native fonts – we're focusing on content first, so decorative
|
||
and unusual typefaces are completely unnecessary. Stick with either the 13
|
||
web-safe fonts or a [6]system font stack that matches the default font to
|
||
the operating system of your visitor. Using the system font stack might
|
||
look a bit different between operating systems, but your layout shouldn't
|
||
be so brittle that an extra word wrap will ruin it. Then you don't have to
|
||
worry about the flashing font problem either. Your focus should be about
|
||
delivering the content to the user effectively and making the choice of
|
||
font be invisible, rather than getting noticed to stroke your design ego.
|
||
6. Obsessively compress your images – faster for your users, less space to
|
||
archive, and easier to maintain when you don't have to back up a humongous
|
||
folder. Your images can have the same high quality, but be smaller. [7]
|
||
Minify your SVGs, losslessly compress your PNGs, generate JPEGs to exactly
|
||
fit the width of the image. It's worth spending some time figuring out the
|
||
most optimal way to compress and [8]reduce the size of your images without
|
||
losing quality. And once [9]WebP gains support on Safari, switch over to
|
||
that format. Ruthlessly minimize the total size of your website and keep it
|
||
as small as possible. Every MB can cost someone real money, and in fact, my
|
||
mobile carrier (Google Fi) charges a cent per MB, so a 25 MB website which
|
||
is fairly common nowadays, costs a quarter itself, about as much as a
|
||
newspaper when I was a child.
|
||
7. Eliminate the broken URL risk – there are [10]monitoring services that will
|
||
tell you when your URL is down, preventing you from realizing one day that
|
||
your homepage hasn't been loading for a month and the search engines have
|
||
deindexed it. Because 10 years is longer than most hard drives or operating
|
||
systems are meant to last. But to eliminate the risk of a URL breaking
|
||
completely, set up a second monitoring service. Because if the first one
|
||
stops for any reason (they move to a pay model, they shut down, you forget
|
||
to renew something, etc.) you will still get one notification when your URL
|
||
is down, then realize the other monitoring service is down because you
|
||
didn't get the second notification. Remember that we're trying to keep
|
||
something up for over 10 years (ideally way longer, even 30 years), and a
|
||
lot of services will shut down during this period, so two monitoring
|
||
services is safer.
|
||
|
||
After doing these things, go ahead and place a bit of text in the footer, "The
|
||
page was designed to last", linking to this page explaining what that means.
|
||
The words promise that the maintainer will do their best to follow the ideas in
|
||
this manifesto.
|
||
|
||
Before you protest, this is obviously not for web applications. If you are
|
||
making an application, then make your web or mobile app with the workflow you
|
||
need. I don't even know any web applications that have remained similarly
|
||
functioning over 10 years so it seems like a lost cause anyway (except Philip
|
||
Guo's python tutor, due to his [11]minimalist strategy for maintaining it).
|
||
It's also not for websites maintained by an organization like Wikipedia or
|
||
Twitter. The salaries for an IT team is probably enough to keep a website alive
|
||
for a while.
|
||
|
||
In fact, it's not even that important you strictly follow the 7 "rules", as
|
||
they're more of a provocation than strict rules.
|
||
|
||
But let's say some small part of the web starts designing websites to last for
|
||
content that is meant to last. What happens then? Well, people may prefer to
|
||
link to them since they have a promise of working in the future. People more
|
||
generally may be more mindful of making their pages more permanent. And users
|
||
and archivers both save bandwidth when visiting and storing these pages.
|
||
|
||
The effects are long term, but the achievements are incremental and can be
|
||
implemented by website owners without being dependent on anyone else or waiting
|
||
for a network effect. You can do this now for your website, and that already
|
||
would be a positive outcome. Like using a recycled shopping bag instead of a
|
||
taking a plastic one, it's a small individual action.
|
||
|
||
This article is meant to provoke and lead to individual action, not propose a
|
||
complete solution to the decaying web. It's a small simple step for a complex
|
||
sociotechnical system. So I'd love to see this happen. I intend to keep this
|
||
page up for at least 10 years.
|
||
|
||
If you are interested in receiving updates to [12]irchiver, our project for a
|
||
personal archive of the web pages you visit, please [13]subscribe here.
|
||
|
||
Thanks to my Ph.D. students Shaun Wallace, Nediyana Daskalova, Talie Massachi,
|
||
Alexandra Papoutsaki, my colleagues James Tompkin, Stephen Bach, my teaching
|
||
assistant Kathleen Chai, and my research assistant Yusuf Karim for feedback on
|
||
earlier drafts.
|
||
|
||
See discussions on [14]Hacker News and [15]reddit /r/programming
|
||
|
||
Also in this series
|
||
|
||
[16]Behind the scenes: the struggle for each paper to get published
|
||
|
||
[17]Illustrative notes for obsessing over publishing aesthetics
|
||
|
||
Other articles I've written
|
||
|
||
[18]My productivity app is a never-ending .txt file
|
||
|
||
[19]The Coronavirus pandemic has changed our sleep behavior
|
||
|
||
[20]Extracting data from tracking devices by going to the cloud
|
||
|
||
[21]CS Faculty Composition and Hiring Trends
|
||
|
||
[22]Bias in Computer Science Rankings
|
||
|
||
[23]Who Wins CS Best Paper Awards?
|
||
|
||
[24]Verified Computer Science Ph.D. Stipends
|
||
|
||
This page is [25]designed to last.
|
||
|
||
References:
|
||
|
||
[1] https://jeffhuang.com/
|
||
[2] https://gomakethings.com/the-web-is-not-dying/
|
||
[3] https://archivebox.io/
|
||
[4] https://webmasters.stackexchange.com/questions/25315/hotlinking-what-is-it-and-why-shouldnt-people-do-it
|
||
[5] https://goaccess.io/
|
||
[6] https://systemfontstack.com/
|
||
[7] https://victorzhou.com/blog/minify-svgs/
|
||
[8] https://evilmartians.com/chronicles/images-done-right-web-graphics-good-to-the-last-byte-optimization-techniques
|
||
[9] https://caniuse.com/#feat=webp
|
||
[10] https://uptimerobot.com/
|
||
[11] https://pg.ucsd.edu/publications/Python-Tutor-scalable-sustainable-research-software_UIST-2021.pdf
|
||
[12] https://irchiver.com/
|
||
[13] https://docs.google.com/forms/d/e/1FAIpQLSeTCgnwF1gjrc1O8mfJ_5TmT_TLowFQ2DUhsollmqPG84pAFQ/viewform?usp=pp_url&entry.1299571007=irchiver:+your+full-resolution+personal+web+archive+and+search&entry.1760653896=designed_to_last
|
||
[14] https://news.ycombinator.com/item?id=21840140
|
||
[15] https://www.reddit.com/r/programming/comments/ed88ra/this_page_is_designed_to_last_a_manifesto_for/
|
||
[16] https://jeffhuang.com/struggle_for_each_paper/
|
||
[17] https://jeffhuang.com/illustrative-notes-for-publishing-aesthetics/
|
||
[18] https://jeffhuang.com/productivity_text_file/
|
||
[19] https://jeffhuang.com/covid_sleep/
|
||
[20] https://jeffhuang.com/extracting_data_from_tracking_devices/
|
||
[21] https://jeffhuang.com/computer-science-open-data/#cs-faculty-composition-and-hiring-trends
|
||
[22] https://jeffhuang.com/computer-science-open-data/#bias-in-computer-science-rankings
|
||
[23] https://jeffhuang.com/computer-science-open-data/#who-wins-cs-best-paper-awards
|
||
[24] https://jeffhuang.com/computer-science-open-data/#verified-computer-science-phd-stipends
|
||
[25] http://jeffhuang.com/designed_to_last/
|