Move web archives into dedicated directory
This commit is contained in:
291
static/archive/jeffhuang-com-njdbjn.txt
Normal file
291
static/archive/jeffhuang-com-njdbjn.txt
Normal file
@@ -0,0 +1,291 @@
|
||||
A Manifesto for Preserving Content on the Web
|
||||
|
||||
This Page is Designed to Last
|
||||
|
||||
By [1]Jeff Huang, published 2019-12-19, updated 2021-08-24
|
||||
|
||||
The end of the year is an opportunity to clean up and reset for the
|
||||
upcoming new semester. I found myself clearing out old bookmarks—yes,
|
||||
bookmarks: that formerly beloved browser feature that seems to have
|
||||
lost the battle to 'address bar autocomplete'. But this nostalgic act
|
||||
of tidying led me to despair.
|
||||
|
||||
Bookmark after bookmark led to dead link after dead link. What's
|
||||
vanished: unique pieces of writing on kuro5hin about tech culture; a
|
||||
collection of mathematical puzzles and their associated discussion by
|
||||
academics that my father introduced me to; Woodman's Reverse
|
||||
Engineering tutorials from my high school years, where I first tasted
|
||||
the feeling of control over software; even my most recent bookmark, a
|
||||
series of posts on Google+ exposing usb-c chargers' non-compliance with
|
||||
the specification, all disappeared.
|
||||
|
||||
This is more than just link rot, it's the increasing complexity of
|
||||
keeping alive indie content on the web, leading to a reliance on
|
||||
platforms and time-sorted publication formats (blogs, feeds, tweets).
|
||||
|
||||
Of course, I have also contributed to the problem. A paper I published
|
||||
7 years ago has an abstract that includes a demo link, which has been
|
||||
taken over by a spammy page with a pumpkin picture on it. Part of that
|
||||
lapse was laziness to avoid having to renew and keep a functioning web
|
||||
application up year after year.
|
||||
|
||||
I've recommended my students to push websites to Heroku, and publish
|
||||
portfolios on Wix. Yet every platform with irreplaceable content dies
|
||||
off some day. Geocities, LiveJournal, what.cd, now Yahoo Groups. One
|
||||
day, Medium, Twitter, and even hosting services like GitHub Pages will
|
||||
be plundered then discarded when they can no longer grow or cannot find
|
||||
a working business model.
|
||||
|
||||
The problem is multi-faceted. First, content takes effort to maintain.
|
||||
The content may need updating to remain relevant, and will eventually
|
||||
have to be rehosted. A lot of content, what used to be the vast
|
||||
majority of content, was put up by individuals. But individuals (maybe
|
||||
you?) lose interest, so one day maybe you just don't want to deal with
|
||||
migrating a website to a new hosting provider.
|
||||
|
||||
Second, a growing set of libraries and frameworks are making the web
|
||||
more sophisticated but also more complex. First came jquery, then
|
||||
bootstrap, npm, angular, grunt, webpack, and more. If you are a web
|
||||
developer who is keeping up with the latest, then that's not a problem.
|
||||
|
||||
But if not, maybe you are an embedded systems programmer or startup CTO
|
||||
or enterprise Java developer or chemistry PhD student, sure you could
|
||||
probably figure out how to set up some web server and toolchain, but
|
||||
will you keep this up year after year, decade after decade? Probably
|
||||
not, and when the next year when you encounter a package dependency
|
||||
problem or figure out how to regenerate your html files, you might just
|
||||
throw your hands up and zip up the files to deal with "later". Even
|
||||
simple technology stacks like static site generators (e.g., Jekyll)
|
||||
require a workflow and will stop working at some point. You fall into
|
||||
npm dependency hell, and forget the command to package a release. And
|
||||
having a website with multiple html pages is complex; how would you
|
||||
know how each page links to each other? index.html.old, Copy of
|
||||
about.html, index.html (1), nav.html?
|
||||
|
||||
Third, and this has been touted by others already (and even
|
||||
[2]rebutted), the disappearance of the public web in favor of mobile
|
||||
and web apps, walled gardens (Facebook pages), just-in-time WebSockets
|
||||
loading, and AMP decreases the proportion of the web on the world wide
|
||||
web, which now seems more like a continental web than a "world wide
|
||||
web".
|
||||
|
||||
So for these problems, what can we do about it? It's not such a simple
|
||||
problem that can be solved in this one article. The Wayback Machine and
|
||||
archive.org helps keep some content around for longer. And sometimes an
|
||||
altruistic individual rehosts the content elsewhere.
|
||||
|
||||
But the solution needs to be multi-pronged. How do we make web content
|
||||
that can last and be maintained for at least 10 years? As someone
|
||||
studying human-computer interaction, I naturally think of the
|
||||
stakeholders we aren't supporting. Right now putting up web content is
|
||||
optimized for either the professional web developer (who use the latest
|
||||
frameworks and workflows) or the non-tech savvy user (who use a
|
||||
platform).
|
||||
|
||||
But I think we should consider both 1) the casual web content
|
||||
"maintainer", someone who doesn't constantly stay up to date with the
|
||||
latest web technologies, which means the website needs to have low
|
||||
maintenance needs; 2) and the crawlers who preserve the content and
|
||||
[3]personal archivers, the "archiver", which means the website should
|
||||
be easy to save and interpret.
|
||||
|
||||
So my proposal is seven unconventional guidelines in how we handle
|
||||
websites designed to be informative, to make them easy to maintain and
|
||||
preserve. The guiding intention is that the maintainer will try to keep
|
||||
the website up for at least 10 years, maybe even 20 or 30 years. These
|
||||
are not controversial views necessarily, but are aspirations that are
|
||||
not mainstream—a manifesto for a long-lasting website.
|
||||
1. Return to vanilla HTML/CSS – I think we've reached the point where
|
||||
html/css is more powerful, and nicer to use than ever before.
|
||||
Instead of starting with a giant template filled with .js includes,
|
||||
it's now okay to just write plain HTML from scratch again. CSS
|
||||
Flexbox and Grid, canvas, Selectors, box-shadow, the video element,
|
||||
filter, etc. eliminate a lot of the need for JavaScript libraries.
|
||||
We can avoid jquery and bootstrap when they're not needed. The more
|
||||
libraries incorporated into the website, the more fragile it
|
||||
becomes. Skip the polyfills and CSS prefixes, and stick with the
|
||||
CSS attributes that work across all browsers. And frequently
|
||||
validate your HTML; it could save you a headache in the future when
|
||||
you encounter a bug.
|
||||
2. Don't minimize that HTML – minimizing (compressing) your HTML and
|
||||
associated CSS/JS seems like it saves precious bandwidth and all
|
||||
the big companies are doing it. But why not? Well, you don't save
|
||||
much because your web pages should be gzipped before being sent
|
||||
over the network, so preemptively shrinking your content probably
|
||||
doesn't do much to save bandwidth if anything at all. But even if
|
||||
it did save a few bytes (it's just text in the end), you now need
|
||||
to have a build process and to add this to your workflow, so
|
||||
updating a website just became more complex. If there's a bug or
|
||||
future incompatibility in the html, the minimized form is harder to
|
||||
debug. And it's unfriendly to your users; so many people got their
|
||||
start with HTML by smashing that View Source button, and minimizing
|
||||
your HTML prevents this ideal of learning by seeing what they did.
|
||||
Minimizing HTML does not preserve its educational quality, and what
|
||||
gets archived is only the resulting codejunk.
|
||||
3. Prefer one page over several – several pages are hard to maintain.
|
||||
You can lose track of which pages link to what, and it also leads
|
||||
to some system of page templates to reduce redundancy. How many
|
||||
pages can one person really maintain? Having one file, probably
|
||||
just an index.html, is simple and unforgettable. Make use of that
|
||||
infinite vertical scroll. You never have to dig around your files
|
||||
or grep to see where some content lies. And how should your version
|
||||
control that file? Should you use git? Shove them in an 'old/'
|
||||
folder? Well I like the simple approach of naming old files with
|
||||
the date they are retired, like index.20191213.html. Using the ISO
|
||||
format of the date makes it so that it sorts easily, and there's no
|
||||
confusion between American and European date formats. If I have
|
||||
multiple versions in one day, I would use a style similar to that
|
||||
which is customary in log files, of index.20191213.1.html. A nice
|
||||
side effect is then you can access an older version of the file if
|
||||
you remember the date, without logging into the web host.
|
||||
4. End all forms of hotlinking – this cautionary word seems to have
|
||||
disappeared from internet vocabulary, but it's one of the reasons
|
||||
I've seen a perfectly good website fall apart for no reason. Stop
|
||||
directly including images from other websites, stop "borrowing"
|
||||
stylesheets by just linking to them, and especially stop linking to
|
||||
JavaScript files, even the ones hosted by the original developers.
|
||||
Hotlinking is [4]usually considered rude since your visitors use
|
||||
someone else's bandwidth, it makes the user experience slower, you
|
||||
let another website track your users, and worse of all if the
|
||||
location you're linking to changes their folder structure or just
|
||||
goes offline, then the failure cascades to your website as well.
|
||||
Google Analytics is unnecessary; store your own server logs and set
|
||||
up [5]GoAccess or cut them up however you like, giving you more
|
||||
detailed statistics. Don't give away your logs to Google for free.
|
||||
5. Stick with native fonts – we're focusing on content first, so
|
||||
decorative and unusual typefaces are completely unnecessary. Stick
|
||||
with either the 13 web-safe fonts or a [6]system font stack that
|
||||
matches the default font to the operating system of your visitor.
|
||||
Using the system font stack might look a bit different between
|
||||
operating systems, but your layout shouldn't be so brittle that an
|
||||
extra word wrap will ruin it. Then you don't have to worry about
|
||||
the flashing font problem either. Your focus should be about
|
||||
delivering the content to the user effectively and making the
|
||||
choice of font be invisible, rather than getting noticed to stroke
|
||||
your design ego.
|
||||
6. Obsessively compress your images – faster for your users, less
|
||||
space to archive, and easier to maintain when you don't have to
|
||||
back up a humongous folder. Your images can have the same high
|
||||
quality, but be smaller. [7]Minify your SVGs, losslessly compress
|
||||
your PNGs, generate JPEGs to exactly fit the width of the image.
|
||||
It's worth spending some time figuring out the most optimal way to
|
||||
compress and [8]reduce the size of your images without losing
|
||||
quality. And once [9]WebP gains support on Safari, switch over to
|
||||
that format. Ruthlessly minimize the total size of your website and
|
||||
keep it as small as possible. Every MB can cost someone real money,
|
||||
and in fact, my mobile carrier (Google Fi) charges a cent per MB,
|
||||
so a 25 MB website which is fairly common nowadays, costs a quarter
|
||||
itself, about as much as a newspaper when I was a child.
|
||||
7. Eliminate the broken URL risk – there are [10]monitoring services
|
||||
that will tell you when your URL is down, preventing you from
|
||||
realizing one day that your homepage hasn't been loading for a
|
||||
month and the search engines have deindexed it. Because 10 years is
|
||||
longer than most hard drives or operating systems are meant to
|
||||
last. But to eliminate the risk of a URL breaking completely, set
|
||||
up a second monitoring service. Because if the first one stops for
|
||||
any reason (they move to a pay model, they shut down, you forget to
|
||||
renew something, etc.) you will still get one notification when
|
||||
your URL is down, then realize the other monitoring service is down
|
||||
because you didn't get the second notification. Remember that we're
|
||||
trying to keep something up for over 10 years (ideally way longer,
|
||||
even 30 years), and a lot of services will shut down during this
|
||||
period, so two monitoring services is safer.
|
||||
|
||||
After doing these things, go ahead and place a bit of text in the
|
||||
footer, "The page was designed to last", linking to this page
|
||||
explaining what that means. The words promise that the maintainer will
|
||||
do their best to follow the ideas in this manifesto.
|
||||
|
||||
Before you protest, this is obviously not for web applications. If you
|
||||
are making an application, then make your web or mobile app with the
|
||||
workflow you need. I don't even know any web applications that have
|
||||
remained similarly functioning over 10 years so it seems like a lost
|
||||
cause anyway (except Philip Guo's python tutor, due to his
|
||||
[11]minimalist strategy for maintaining it). It's also not for websites
|
||||
maintained by an organization like Wikipedia or Twitter. The salaries
|
||||
for an IT team is probably enough to keep a website alive for a while.
|
||||
|
||||
In fact, it's not even that important you strictly follow the 7
|
||||
"rules", as they're more of a provocation than strict rules.
|
||||
|
||||
But let's say some small part of the web starts designing websites to
|
||||
last for content that is meant to last. What happens then? Well, people
|
||||
may prefer to link to them since they have a promise of working in the
|
||||
future. People more generally may be more mindful of making their pages
|
||||
more permanent. And users and archivers both save bandwidth when
|
||||
visiting and storing these pages.
|
||||
|
||||
The effects are long term, but the achievements are incremental and can
|
||||
be implemented by website owners without being dependent on anyone else
|
||||
or waiting for a network effect. You can do this now for your website,
|
||||
and that already would be a positive outcome. Like using a recycled
|
||||
shopping bag instead of a taking a plastic one, it's a small individual
|
||||
action.
|
||||
|
||||
This article is meant to provoke and lead to individual action, not
|
||||
propose a complete solution to the decaying web. It's a small simple
|
||||
step for a complex sociotechnical system. So I'd love to see this
|
||||
happen. I intend to keep this page up for at least 10 years.
|
||||
|
||||
If you are interested in receiving updates to [12]irchiver, our project
|
||||
for a personal archive of the web pages you visit, please [13]subscribe
|
||||
here.
|
||||
|
||||
Thanks to my Ph.D. students Shaun Wallace, Nediyana Daskalova, Talie
|
||||
Massachi, Alexandra Papoutsaki, my colleagues James Tompkin, Stephen
|
||||
Bach, my teaching assistant Kathleen Chai, and my research assistant
|
||||
Yusuf Karim for feedback on earlier drafts.
|
||||
|
||||
See discussions on [14]Hacker News and [15]reddit /r/programming
|
||||
|
||||
Also in this series
|
||||
|
||||
[16]Behind the scenes: the struggle for each paper to get published
|
||||
|
||||
[17]Illustrative notes for obsessing over publishing aesthetics
|
||||
|
||||
Other articles I've written
|
||||
|
||||
[18]My productivity app is a never-ending .txt file
|
||||
|
||||
[19]The Coronavirus pandemic has changed our sleep behavior
|
||||
|
||||
[20]Extracting data from tracking devices by going to the cloud
|
||||
|
||||
[21]CS Faculty Composition and Hiring Trends
|
||||
|
||||
[22]Bias in Computer Science Rankings
|
||||
|
||||
[23]Who Wins CS Best Paper Awards?
|
||||
|
||||
[24]Verified Computer Science Ph.D. Stipends
|
||||
|
||||
This page is [25]designed to last.
|
||||
|
||||
References
|
||||
|
||||
1. https://jeffhuang.com/
|
||||
2. https://gomakethings.com/the-web-is-not-dying/
|
||||
3. https://archivebox.io/
|
||||
4. https://webmasters.stackexchange.com/questions/25315/hotlinking-what-is-it-and-why-shouldnt-people-do-it
|
||||
5. https://goaccess.io/
|
||||
6. https://systemfontstack.com/
|
||||
7. https://victorzhou.com/blog/minify-svgs/
|
||||
8. https://evilmartians.com/chronicles/images-done-right-web-graphics-good-to-the-last-byte-optimization-techniques
|
||||
9. https://caniuse.com/#feat=webp
|
||||
10. https://uptimerobot.com/
|
||||
11. https://pg.ucsd.edu/publications/Python-Tutor-scalable-sustainable-research-software_UIST-2021.pdf
|
||||
12. https://irchiver.com/
|
||||
13. https://docs.google.com/forms/d/e/1FAIpQLSeTCgnwF1gjrc1O8mfJ_5TmT_TLowFQ2DUhsollmqPG84pAFQ/viewform?usp=pp_url&entry.1299571007=Irchiver:+browser+history+search&entry.1760653896=designed_to_last
|
||||
14. https://news.ycombinator.com/item?id=21840140
|
||||
15. https://www.reddit.com/r/programming/comments/ed88ra/this_page_is_designed_to_last_a_manifesto_for/
|
||||
16. https://jeffhuang.com/struggle_for_each_paper/
|
||||
17. https://jeffhuang.com/illustrative-notes-for-publishing-aesthetics/
|
||||
18. https://jeffhuang.com/productivity_text_file/
|
||||
19. https://jeffhuang.com/covid_sleep/
|
||||
20. https://jeffhuang.com/extracting_data_from_tracking_devices/
|
||||
21. https://jeffhuang.com/computer-science-open-data/#cs-faculty-composition-and-hiring-trends
|
||||
22. https://jeffhuang.com/computer-science-open-data/#bias-in-computer-science-rankings
|
||||
23. https://jeffhuang.com/computer-science-open-data/#who-wins-cs-best-paper-awards
|
||||
24. https://jeffhuang.com/computer-science-open-data/#verified-computer-science-phd-stipends
|
||||
25. http://jeffhuang.com/designed_to_last/
|
||||
Reference in New Issue
Block a user