Files
davideisinger.com/static/archive/jeffhuang-com-njdbjn.txt
2023-04-29 20:50:00 -04:00

292 lines
17 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
A Manifesto for Preserving Content on the Web
This Page is Designed to Last
By [1]Jeff Huang, published 2019-12-19, updated 2021-08-24
The end of the year is an opportunity to clean up and reset for the
upcoming new semester. I found myself clearing out old bookmarks—yes,
bookmarks: that formerly beloved browser feature that seems to have
lost the battle to 'address bar autocomplete'. But this nostalgic act
of tidying led me to despair.
Bookmark after bookmark led to dead link after dead link. What's
vanished: unique pieces of writing on kuro5hin about tech culture; a
collection of mathematical puzzles and their associated discussion by
academics that my father introduced me to; Woodman's Reverse
Engineering tutorials from my high school years, where I first tasted
the feeling of control over software; even my most recent bookmark, a
series of posts on Google+ exposing usb-c chargers' non-compliance with
the specification, all disappeared.
This is more than just link rot, it's the increasing complexity of
keeping alive indie content on the web, leading to a reliance on
platforms and time-sorted publication formats (blogs, feeds, tweets).
Of course, I have also contributed to the problem. A paper I published
7 years ago has an abstract that includes a demo link, which has been
taken over by a spammy page with a pumpkin picture on it. Part of that
lapse was laziness to avoid having to renew and keep a functioning web
application up year after year.
I've recommended my students to push websites to Heroku, and publish
portfolios on Wix. Yet every platform with irreplaceable content dies
off some day. Geocities, LiveJournal, what.cd, now Yahoo Groups. One
day, Medium, Twitter, and even hosting services like GitHub Pages will
be plundered then discarded when they can no longer grow or cannot find
a working business model.
The problem is multi-faceted. First, content takes effort to maintain.
The content may need updating to remain relevant, and will eventually
have to be rehosted. A lot of content, what used to be the vast
majority of content, was put up by individuals. But individuals (maybe
you?) lose interest, so one day maybe you just don't want to deal with
migrating a website to a new hosting provider.
Second, a growing set of libraries and frameworks are making the web
more sophisticated but also more complex. First came jquery, then
bootstrap, npm, angular, grunt, webpack, and more. If you are a web
developer who is keeping up with the latest, then that's not a problem.
But if not, maybe you are an embedded systems programmer or startup CTO
or enterprise Java developer or chemistry PhD student, sure you could
probably figure out how to set up some web server and toolchain, but
will you keep this up year after year, decade after decade? Probably
not, and when the next year when you encounter a package dependency
problem or figure out how to regenerate your html files, you might just
throw your hands up and zip up the files to deal with "later". Even
simple technology stacks like static site generators (e.g., Jekyll)
require a workflow and will stop working at some point. You fall into
npm dependency hell, and forget the command to package a release. And
having a website with multiple html pages is complex; how would you
know how each page links to each other? index.html.old, Copy of
about.html, index.html (1), nav.html?
Third, and this has been touted by others already (and even
[2]rebutted), the disappearance of the public web in favor of mobile
and web apps, walled gardens (Facebook pages), just-in-time WebSockets
loading, and AMP decreases the proportion of the web on the world wide
web, which now seems more like a continental web than a "world wide
web".
So for these problems, what can we do about it? It's not such a simple
problem that can be solved in this one article. The Wayback Machine and
archive.org helps keep some content around for longer. And sometimes an
altruistic individual rehosts the content elsewhere.
But the solution needs to be multi-pronged. How do we make web content
that can last and be maintained for at least 10 years? As someone
studying human-computer interaction, I naturally think of the
stakeholders we aren't supporting. Right now putting up web content is
optimized for either the professional web developer (who use the latest
frameworks and workflows) or the non-tech savvy user (who use a
platform).
But I think we should consider both 1) the casual web content
"maintainer", someone who doesn't constantly stay up to date with the
latest web technologies, which means the website needs to have low
maintenance needs; 2) and the crawlers who preserve the content and
[3]personal archivers, the "archiver", which means the website should
be easy to save and interpret.
So my proposal is seven unconventional guidelines in how we handle
websites designed to be informative, to make them easy to maintain and
preserve. The guiding intention is that the maintainer will try to keep
the website up for at least 10 years, maybe even 20 or 30 years. These
are not controversial views necessarily, but are aspirations that are
not mainstream—a manifesto for a long-lasting website.
1. Return to vanilla HTML/CSS I think we've reached the point where
html/css is more powerful, and nicer to use than ever before.
Instead of starting with a giant template filled with .js includes,
it's now okay to just write plain HTML from scratch again. CSS
Flexbox and Grid, canvas, Selectors, box-shadow, the video element,
filter, etc. eliminate a lot of the need for JavaScript libraries.
We can avoid jquery and bootstrap when they're not needed. The more
libraries incorporated into the website, the more fragile it
becomes. Skip the polyfills and CSS prefixes, and stick with the
CSS attributes that work across all browsers. And frequently
validate your HTML; it could save you a headache in the future when
you encounter a bug.
2. Don't minimize that HTML minimizing (compressing) your HTML and
associated CSS/JS seems like it saves precious bandwidth and all
the big companies are doing it. But why not? Well, you don't save
much because your web pages should be gzipped before being sent
over the network, so preemptively shrinking your content probably
doesn't do much to save bandwidth if anything at all. But even if
it did save a few bytes (it's just text in the end), you now need
to have a build process and to add this to your workflow, so
updating a website just became more complex. If there's a bug or
future incompatibility in the html, the minimized form is harder to
debug. And it's unfriendly to your users; so many people got their
start with HTML by smashing that View Source button, and minimizing
your HTML prevents this ideal of learning by seeing what they did.
Minimizing HTML does not preserve its educational quality, and what
gets archived is only the resulting codejunk.
3. Prefer one page over several several pages are hard to maintain.
You can lose track of which pages link to what, and it also leads
to some system of page templates to reduce redundancy. How many
pages can one person really maintain? Having one file, probably
just an index.html, is simple and unforgettable. Make use of that
infinite vertical scroll. You never have to dig around your files
or grep to see where some content lies. And how should your version
control that file? Should you use git? Shove them in an 'old/'
folder? Well I like the simple approach of naming old files with
the date they are retired, like index.20191213.html. Using the ISO
format of the date makes it so that it sorts easily, and there's no
confusion between American and European date formats. If I have
multiple versions in one day, I would use a style similar to that
which is customary in log files, of index.20191213.1.html. A nice
side effect is then you can access an older version of the file if
you remember the date, without logging into the web host.
4. End all forms of hotlinking this cautionary word seems to have
disappeared from internet vocabulary, but it's one of the reasons
I've seen a perfectly good website fall apart for no reason. Stop
directly including images from other websites, stop "borrowing"
stylesheets by just linking to them, and especially stop linking to
JavaScript files, even the ones hosted by the original developers.
Hotlinking is [4]usually considered rude since your visitors use
someone else's bandwidth, it makes the user experience slower, you
let another website track your users, and worse of all if the
location you're linking to changes their folder structure or just
goes offline, then the failure cascades to your website as well.
Google Analytics is unnecessary; store your own server logs and set
up [5]GoAccess or cut them up however you like, giving you more
detailed statistics. Don't give away your logs to Google for free.
5. Stick with native fonts we're focusing on content first, so
decorative and unusual typefaces are completely unnecessary. Stick
with either the 13 web-safe fonts or a [6]system font stack that
matches the default font to the operating system of your visitor.
Using the system font stack might look a bit different between
operating systems, but your layout shouldn't be so brittle that an
extra word wrap will ruin it. Then you don't have to worry about
the flashing font problem either. Your focus should be about
delivering the content to the user effectively and making the
choice of font be invisible, rather than getting noticed to stroke
your design ego.
6. Obsessively compress your images faster for your users, less
space to archive, and easier to maintain when you don't have to
back up a humongous folder. Your images can have the same high
quality, but be smaller. [7]Minify your SVGs, losslessly compress
your PNGs, generate JPEGs to exactly fit the width of the image.
It's worth spending some time figuring out the most optimal way to
compress and [8]reduce the size of your images without losing
quality. And once [9]WebP gains support on Safari, switch over to
that format. Ruthlessly minimize the total size of your website and
keep it as small as possible. Every MB can cost someone real money,
and in fact, my mobile carrier (Google Fi) charges a cent per MB,
so a 25 MB website which is fairly common nowadays, costs a quarter
itself, about as much as a newspaper when I was a child.
7. Eliminate the broken URL risk there are [10]monitoring services
that will tell you when your URL is down, preventing you from
realizing one day that your homepage hasn't been loading for a
month and the search engines have deindexed it. Because 10 years is
longer than most hard drives or operating systems are meant to
last. But to eliminate the risk of a URL breaking completely, set
up a second monitoring service. Because if the first one stops for
any reason (they move to a pay model, they shut down, you forget to
renew something, etc.) you will still get one notification when
your URL is down, then realize the other monitoring service is down
because you didn't get the second notification. Remember that we're
trying to keep something up for over 10 years (ideally way longer,
even 30 years), and a lot of services will shut down during this
period, so two monitoring services is safer.
After doing these things, go ahead and place a bit of text in the
footer, "The page was designed to last", linking to this page
explaining what that means. The words promise that the maintainer will
do their best to follow the ideas in this manifesto.
Before you protest, this is obviously not for web applications. If you
are making an application, then make your web or mobile app with the
workflow you need. I don't even know any web applications that have
remained similarly functioning over 10 years so it seems like a lost
cause anyway (except Philip Guo's python tutor, due to his
[11]minimalist strategy for maintaining it). It's also not for websites
maintained by an organization like Wikipedia or Twitter. The salaries
for an IT team is probably enough to keep a website alive for a while.
In fact, it's not even that important you strictly follow the 7
"rules", as they're more of a provocation than strict rules.
But let's say some small part of the web starts designing websites to
last for content that is meant to last. What happens then? Well, people
may prefer to link to them since they have a promise of working in the
future. People more generally may be more mindful of making their pages
more permanent. And users and archivers both save bandwidth when
visiting and storing these pages.
The effects are long term, but the achievements are incremental and can
be implemented by website owners without being dependent on anyone else
or waiting for a network effect. You can do this now for your website,
and that already would be a positive outcome. Like using a recycled
shopping bag instead of a taking a plastic one, it's a small individual
action.
This article is meant to provoke and lead to individual action, not
propose a complete solution to the decaying web. It's a small simple
step for a complex sociotechnical system. So I'd love to see this
happen. I intend to keep this page up for at least 10 years.
If you are interested in receiving updates to [12]irchiver, our project
for a personal archive of the web pages you visit, please [13]subscribe
here.
Thanks to my Ph.D. students Shaun Wallace, Nediyana Daskalova, Talie
Massachi, Alexandra Papoutsaki, my colleagues James Tompkin, Stephen
Bach, my teaching assistant Kathleen Chai, and my research assistant
Yusuf Karim for feedback on earlier drafts.
See discussions on [14]Hacker News and [15]reddit /r/programming
Also in this series
[16]Behind the scenes: the struggle for each paper to get published
[17]Illustrative notes for obsessing over publishing aesthetics
Other articles I've written
[18]My productivity app is a never-ending .txt file
[19]The Coronavirus pandemic has changed our sleep behavior
[20]Extracting data from tracking devices by going to the cloud
[21]CS Faculty Composition and Hiring Trends
[22]Bias in Computer Science Rankings
[23]Who Wins CS Best Paper Awards?
[24]Verified Computer Science Ph.D. Stipends
This page is [25]designed to last.
References
1. https://jeffhuang.com/
2. https://gomakethings.com/the-web-is-not-dying/
3. https://archivebox.io/
4. https://webmasters.stackexchange.com/questions/25315/hotlinking-what-is-it-and-why-shouldnt-people-do-it
5. https://goaccess.io/
6. https://systemfontstack.com/
7. https://victorzhou.com/blog/minify-svgs/
8. https://evilmartians.com/chronicles/images-done-right-web-graphics-good-to-the-last-byte-optimization-techniques
9. https://caniuse.com/#feat=webp
10. https://uptimerobot.com/
11. https://pg.ucsd.edu/publications/Python-Tutor-scalable-sustainable-research-software_UIST-2021.pdf
12. https://irchiver.com/
13. https://docs.google.com/forms/d/e/1FAIpQLSeTCgnwF1gjrc1O8mfJ_5TmT_TLowFQ2DUhsollmqPG84pAFQ/viewform?usp=pp_url&entry.1299571007=Irchiver:+browser+history+search&entry.1760653896=designed_to_last
14. https://news.ycombinator.com/item?id=21840140
15. https://www.reddit.com/r/programming/comments/ed88ra/this_page_is_designed_to_last_a_manifesto_for/
16. https://jeffhuang.com/struggle_for_each_paper/
17. https://jeffhuang.com/illustrative-notes-for-publishing-aesthetics/
18. https://jeffhuang.com/productivity_text_file/
19. https://jeffhuang.com/covid_sleep/
20. https://jeffhuang.com/extracting_data_from_tracking_devices/
21. https://jeffhuang.com/computer-science-open-data/#cs-faculty-composition-and-hiring-trends
22. https://jeffhuang.com/computer-science-open-data/#bias-in-computer-science-rankings
23. https://jeffhuang.com/computer-science-open-data/#who-wins-cs-best-paper-awards
24. https://jeffhuang.com/computer-science-open-data/#verified-computer-science-phd-stipends
25. http://jeffhuang.com/designed_to_last/