Files
davideisinger.com/static/archive/jeffhuang-com-njdbjn.txt
2024-01-17 12:05:58 -05:00

274 lines
16 KiB
Plaintext
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
A Manifesto for Preserving Content on the Web
This Page is Designed to Last
By [1]Jeff Huang, published 2019-12-19, updated 2021-08-24
The end of the year is an opportunity to clean up and reset for the upcoming
new semester. I found myself clearing out old bookmarks—yes, bookmarks: that
formerly beloved browser feature that seems to have lost the battle to 'address
bar autocomplete'. But this nostalgic act of tidying led me to despair.
Bookmark after bookmark led to dead link after dead link. What's vanished:
unique pieces of writing on kuro5hin about tech culture; a collection of
mathematical puzzles and their associated discussion by academics that my
father introduced me to; Woodman's Reverse Engineering tutorials from my high
school years, where I first tasted the feeling of control over software; even
my most recent bookmark, a series of posts on Google+ exposing usb-c chargers'
non-compliance with the specification, all disappeared.
This is more than just link rot, it's the increasing complexity of keeping
alive indie content on the web, leading to a reliance on platforms and
time-sorted publication formats (blogs, feeds, tweets).
Of course, I have also contributed to the problem. A paper I published 7 years
ago has an abstract that includes a demo link, which has been taken over by a
spammy page with a pumpkin picture on it. Part of that lapse was laziness to
avoid having to renew and keep a functioning web application up year after
year.
I've recommended my students to push websites to Heroku, and publish portfolios
on Wix. Yet every platform with irreplaceable content dies off some day.
Geocities, LiveJournal, what.cd, now Yahoo Groups. One day, Medium, Twitter,
and even hosting services like GitHub Pages will be plundered then discarded
when they can no longer grow or cannot find a working business model.
The problem is multi-faceted. First, content takes effort to maintain. The
content may need updating to remain relevant, and will eventually have to be
rehosted. A lot of content, what used to be the vast majority of content, was
put up by individuals. But individuals (maybe you?) lose interest, so one day
maybe you just don't want to deal with migrating a website to a new hosting
provider.
Second, a growing set of libraries and frameworks are making the web more
sophisticated but also more complex. First came jquery, then bootstrap, npm,
angular, grunt, webpack, and more. If you are a web developer who is keeping up
with the latest, then that's not a problem.
But if not, maybe you are an embedded systems programmer or startup CTO or
enterprise Java developer or chemistry PhD student, sure you could probably
figure out how to set up some web server and toolchain, but will you keep this
up year after year, decade after decade? Probably not, and when the next year
when you encounter a package dependency problem or figure out how to regenerate
your html files, you might just throw your hands up and zip up the files to
deal with "later". Even simple technology stacks like static site generators
(e.g., Jekyll) require a workflow and will stop working at some point. You fall
into npm dependency hell, and forget the command to package a release. And
having a website with multiple html pages is complex; how would you know how
each page links to each other? index.html.old, Copy of about.html, index.html
(1), nav.html?
Third, and this has been touted by others already (and even [2]rebutted), the
disappearance of the public web in favor of mobile and web apps, walled gardens
(Facebook pages), just-in-time WebSockets loading, and AMP decreases the
proportion of the web on the world wide web, which now seems more like a
continental web than a "world wide web".
So for these problems, what can we do about it? It's not such a simple problem
that can be solved in this one article. The Wayback Machine and archive.org
helps keep some content around for longer. And sometimes an altruistic
individual rehosts the content elsewhere.
But the solution needs to be multi-pronged. How do we make web content that can
last and be maintained for at least 10 years? As someone studying
human-computer interaction, I naturally think of the stakeholders we aren't
supporting. Right now putting up web content is optimized for either the
professional web developer (who use the latest frameworks and workflows) or the
non-tech savvy user (who use a platform).
But I think we should consider both 1) the casual web content "maintainer",
someone who doesn't constantly stay up to date with the latest web
technologies, which means the website needs to have low maintenance needs; 2)
and the crawlers who preserve the content and [3]personal archivers, the
"archiver", which means the website should be easy to save and interpret.
So my proposal is seven unconventional guidelines in how we handle websites
designed to be informative, to make them easy to maintain and preserve. The
guiding intention is that the maintainer will try to keep the website up for at
least 10 years, maybe even 20 or 30 years. These are not controversial views
necessarily, but are aspirations that are not mainstream—a manifesto for a
long-lasting website.
1. Return to vanilla HTML/CSS I think we've reached the point where html/css
is more powerful, and nicer to use than ever before. Instead of starting
with a giant template filled with .js includes, it's now okay to just write
plain HTML from scratch again. CSS Flexbox and Grid, canvas, Selectors,
box-shadow, the video element, filter, etc. eliminate a lot of the need for
JavaScript libraries. We can avoid jquery and bootstrap when they're not
needed. The more libraries incorporated into the website, the more fragile
it becomes. Skip the polyfills and CSS prefixes, and stick with the CSS
attributes that work across all browsers. And frequently validate your
HTML; it could save you a headache in the future when you encounter a bug.
2. Don't minimize that HTML minimizing (compressing) your HTML and
associated CSS/JS seems like it saves precious bandwidth and all the big
companies are doing it. But why not? Well, you don't save much because your
web pages should be gzipped before being sent over the network, so
preemptively shrinking your content probably doesn't do much to save
bandwidth if anything at all. But even if it did save a few bytes (it's
just text in the end), you now need to have a build process and to add this
to your workflow, so updating a website just became more complex. If
there's a bug or future incompatibility in the html, the minimized form is
harder to debug. And it's unfriendly to your users; so many people got
their start with HTML by smashing that View Source button, and minimizing
your HTML prevents this ideal of learning by seeing what they did.
Minimizing HTML does not preserve its educational quality, and what gets
archived is only the resulting codejunk.
3. Prefer one page over several several pages are hard to maintain. You can
lose track of which pages link to what, and it also leads to some system of
page templates to reduce redundancy. How many pages can one person really
maintain? Having one file, probably just an index.html, is simple and
unforgettable. Make use of that infinite vertical scroll. You never have to
dig around your files or grep to see where some content lies. And how
should your version control that file? Should you use git? Shove them in an
'old/' folder? Well I like the simple approach of naming old files with the
date they are retired, like index.20191213.html. Using the ISO format of
the date makes it so that it sorts easily, and there's no confusion between
American and European date formats. If I have multiple versions in one day,
I would use a style similar to that which is customary in log files, of
index.20191213.1.html. A nice side effect is then you can access an older
version of the file if you remember the date, without logging into the web
host.
4. End all forms of hotlinking this cautionary word seems to have
disappeared from internet vocabulary, but it's one of the reasons I've seen
a perfectly good website fall apart for no reason. Stop directly including
images from other websites, stop "borrowing" stylesheets by just linking to
them, and especially stop linking to JavaScript files, even the ones hosted
by the original developers. Hotlinking is [4]usually considered rude since
your visitors use someone else's bandwidth, it makes the user experience
slower, you let another website track your users, and worse of all if the
location you're linking to changes their folder structure or just goes
offline, then the failure cascades to your website as well. Google
Analytics is unnecessary; store your own server logs and set up [5]GoAccess
or cut them up however you like, giving you more detailed statistics. Don't
give away your logs to Google for free.
5. Stick with native fonts we're focusing on content first, so decorative
and unusual typefaces are completely unnecessary. Stick with either the 13
web-safe fonts or a [6]system font stack that matches the default font to
the operating system of your visitor. Using the system font stack might
look a bit different between operating systems, but your layout shouldn't
be so brittle that an extra word wrap will ruin it. Then you don't have to
worry about the flashing font problem either. Your focus should be about
delivering the content to the user effectively and making the choice of
font be invisible, rather than getting noticed to stroke your design ego.
6. Obsessively compress your images faster for your users, less space to
archive, and easier to maintain when you don't have to back up a humongous
folder. Your images can have the same high quality, but be smaller. [7]
Minify your SVGs, losslessly compress your PNGs, generate JPEGs to exactly
fit the width of the image. It's worth spending some time figuring out the
most optimal way to compress and [8]reduce the size of your images without
losing quality. And once [9]WebP gains support on Safari, switch over to
that format. Ruthlessly minimize the total size of your website and keep it
as small as possible. Every MB can cost someone real money, and in fact, my
mobile carrier (Google Fi) charges a cent per MB, so a 25 MB website which
is fairly common nowadays, costs a quarter itself, about as much as a
newspaper when I was a child.
7. Eliminate the broken URL risk there are [10]monitoring services that will
tell you when your URL is down, preventing you from realizing one day that
your homepage hasn't been loading for a month and the search engines have
deindexed it. Because 10 years is longer than most hard drives or operating
systems are meant to last. But to eliminate the risk of a URL breaking
completely, set up a second monitoring service. Because if the first one
stops for any reason (they move to a pay model, they shut down, you forget
to renew something, etc.) you will still get one notification when your URL
is down, then realize the other monitoring service is down because you
didn't get the second notification. Remember that we're trying to keep
something up for over 10 years (ideally way longer, even 30 years), and a
lot of services will shut down during this period, so two monitoring
services is safer.
After doing these things, go ahead and place a bit of text in the footer, "The
page was designed to last", linking to this page explaining what that means.
The words promise that the maintainer will do their best to follow the ideas in
this manifesto.
Before you protest, this is obviously not for web applications. If you are
making an application, then make your web or mobile app with the workflow you
need. I don't even know any web applications that have remained similarly
functioning over 10 years so it seems like a lost cause anyway (except Philip
Guo's python tutor, due to his [11]minimalist strategy for maintaining it).
It's also not for websites maintained by an organization like Wikipedia or
Twitter. The salaries for an IT team is probably enough to keep a website alive
for a while.
In fact, it's not even that important you strictly follow the 7 "rules", as
they're more of a provocation than strict rules.
But let's say some small part of the web starts designing websites to last for
content that is meant to last. What happens then? Well, people may prefer to
link to them since they have a promise of working in the future. People more
generally may be more mindful of making their pages more permanent. And users
and archivers both save bandwidth when visiting and storing these pages.
The effects are long term, but the achievements are incremental and can be
implemented by website owners without being dependent on anyone else or waiting
for a network effect. You can do this now for your website, and that already
would be a positive outcome. Like using a recycled shopping bag instead of a
taking a plastic one, it's a small individual action.
This article is meant to provoke and lead to individual action, not propose a
complete solution to the decaying web. It's a small simple step for a complex
sociotechnical system. So I'd love to see this happen. I intend to keep this
page up for at least 10 years.
If you are interested in receiving updates to [12]irchiver, our project for a
personal archive of the web pages you visit, please [13]subscribe here.
Thanks to my Ph.D. students Shaun Wallace, Nediyana Daskalova, Talie Massachi,
Alexandra Papoutsaki, my colleagues James Tompkin, Stephen Bach, my teaching
assistant Kathleen Chai, and my research assistant Yusuf Karim for feedback on
earlier drafts.
See discussions on [14]Hacker News and [15]reddit /r/programming
Also in this series
[16]Behind the scenes: the struggle for each paper to get published
[17]Illustrative notes for obsessing over publishing aesthetics
Other articles I've written
[18]My productivity app is a never-ending .txt file
[19]The Coronavirus pandemic has changed our sleep behavior
[20]Extracting data from tracking devices by going to the cloud
[21]CS Faculty Composition and Hiring Trends
[22]Bias in Computer Science Rankings
[23]Who Wins CS Best Paper Awards?
[24]Verified Computer Science Ph.D. Stipends
This page is [25]designed to last.
References:
[1] https://jeffhuang.com/
[2] https://gomakethings.com/the-web-is-not-dying/
[3] https://archivebox.io/
[4] https://webmasters.stackexchange.com/questions/25315/hotlinking-what-is-it-and-why-shouldnt-people-do-it
[5] https://goaccess.io/
[6] https://systemfontstack.com/
[7] https://victorzhou.com/blog/minify-svgs/
[8] https://evilmartians.com/chronicles/images-done-right-web-graphics-good-to-the-last-byte-optimization-techniques
[9] https://caniuse.com/#feat=webp
[10] https://uptimerobot.com/
[11] https://pg.ucsd.edu/publications/Python-Tutor-scalable-sustainable-research-software_UIST-2021.pdf
[12] https://irchiver.com/
[13] https://docs.google.com/forms/d/e/1FAIpQLSeTCgnwF1gjrc1O8mfJ_5TmT_TLowFQ2DUhsollmqPG84pAFQ/viewform?usp=pp_url&entry.1299571007=irchiver:+your+full-resolution+personal+web+archive+and+search&entry.1760653896=designed_to_last
[14] https://news.ycombinator.com/item?id=21840140
[15] https://www.reddit.com/r/programming/comments/ed88ra/this_page_is_designed_to_last_a_manifesto_for/
[16] https://jeffhuang.com/struggle_for_each_paper/
[17] https://jeffhuang.com/illustrative-notes-for-publishing-aesthetics/
[18] https://jeffhuang.com/productivity_text_file/
[19] https://jeffhuang.com/covid_sleep/
[20] https://jeffhuang.com/extracting_data_from_tracking_devices/
[21] https://jeffhuang.com/computer-science-open-data/#cs-faculty-composition-and-hiring-trends
[22] https://jeffhuang.com/computer-science-open-data/#bias-in-computer-science-rankings
[23] https://jeffhuang.com/computer-science-open-data/#who-wins-cs-best-paper-awards
[24] https://jeffhuang.com/computer-science-open-data/#verified-computer-science-phd-stipends
[25] http://jeffhuang.com/designed_to_last/