mdrenum post

2023-11-14 22:33:33 -05:00
parent 0b7eb06a76
commit 73c1187069
2 changed files with 377 additions and 0 deletions
--- a/content/journal/keep-markdown-links-in-order-with-mdrenum/index.md
+++ b/content/journal/keep-markdown-links-in-order-with-mdrenum/index.md
@@ -0,0 +1,91 @@
 ---
 title: "Keep Markdown Links in Order With mdrenum"
 date: 2023-11-14T22:06:48-05:00
 draft: false
 references:
 - title: "Tidying Markdown reference links - All this"
  url: https://leancrew.com/all-this/2012/09/tidying-markdown-reference-links/
  date: 2023-11-15T03:08:29Z
  file: leancrew-com-7l5uqs.txt
 ---
 I write all these posts in Markdown, and I tend to include a lot of links. I use numbered [reference-style links][1] and I like the numbers to be in sequential order. ([Here's the source of this post][2] to see what I mean.) I wrote a [Ruby script][3] to automate the process of renumbering links when I add a new one, and as mentioned in [last month's dispatch][4], I spent some time iterating on it to work with some new posts containing code blocks that I'd imported into my [Elsewhere][5] section.
 [1]: https://www.markdownguide.org/basic-syntax/#reference-style-links
 [2]: https://github.com/dce/davideisinger.com/blob/main/content/journal/keep-markdown-links-in-order-with-mdrenum/index.md?plain=1
 [3]: https://github.com/dce/davideisinger.com/blob/a2da87fee76fed027b389fcdeb449ad7aa4b6c6d/bin/renumber
 [4]: /journal/dispatch-9-november-2023/
 [5]: /elsewhere
 <!--more-->
 As I was working on the script, it was pretty easy to think of cases in which it would fail -- it can handle fenced code blocks, for example, but not ones set off by spaces. I thought it'd be cool to build something in Go that uses a proper Markdown parser instead of regular expressions. This might strike you as an esoteric undertaking, but as [Dr. Drang put it][6] when he embarked on a similar journey:
 > But there is an attraction to putting everything in apple pie order, even when no one but me will ever see it.
 [6]: https://leancrew.com/all-this/2012/09/tidying-markdown-reference-links/
 ## First Attempts with Go
 My very first attempt involved the [`gomarkdown`][7] package. It was super straightforward to turn a Markdown document into an <abbr title="abstract syntax tree">AST</abbr>, but after an hour or so of investigation, it was pretty clear that I wasn't going to be able to get the original text and position of the links. I switched over to [`goldmark`][8], which is what this website uses to turn Markdown into HTML. This seemed a lot more promising -- it has functions for retrieving the content of nodes, as well as `start` and `stop` attributes that indicate position in the original text. I thought I had it nailed, but as I started writing tests, I realized there were certain cases where I couldn't perfectly locate the links -- two links smashed right up against one another, as an example. I spent a long time trying to come up with something that covered all the weird edge cases, but eventually gave up in frustration.
 [7]: https://pkg.go.dev/github.com/gomarkdown/markdown/ast
 [8]: https://github.com/yuin/goldmark
 Both of these libraries are built to take Markdown, parse it, and turn it into HTML. That's fine, that's what Markdown is for, but for my use case, they came up short. I briefly considered forking `goldmark` to add the functionality I needed, but instead decided to look elsewhere.
 ## A Promising JavaScript Library
 I searched for generic Markdown/AST libraries just to see what else was out there, and a [helpful Stackoverflow comment][9] led me to [`mdast-util-from-markdown`][10], a JavaScript library for working with Markdown without a specific output format. I pulled it down and ran the example code, and it was immediately obvious that it would provide the data I needed.
 [9]: https://stackoverflow.com/a/74062924
 [10]: https://github.com/syntax-tree/mdast-util-from-markdown
 But now I had a new problem: I like JavaScript (and especially TypeScript) just fine, but I find the ecosystem around it bewildering, and furthermore, most of it is tailored for delivering complex functionality to browsers, not distributing simple command-line programs. I even went so far as to investigate using AI to convert the JS code to Go; the [solution][11] I found has some pretty severe character limitations, but I wonder if seamlessly converting code written in one language to another will be a thing in five years.
 [11]: https://www.codeconvert.ai/typescript-to-golang-converter
 ## New JS Runtimes to the Rescue
 On a whim, I decided to check out [Deno][12], a newer alternative to Node.js for server-side JS. Turns out it has the ability to [compile JS into standalone executables][13]. I downloaded it and ran it against the example code, and it worked! I got a (rather large) executable with the same output as running my script with Node. A coworker recommended I check out [Bun][14], which has [a similar compilation feature][15] -- it worked just as well, and the resulting executable was about a third the size as Deno's, so I opted to go with that.
 [12]: https://deno.com/
 [13]: https://docs.deno.com/runtime/manual/tools/compiler
 [14]: https://bun.sh/
 [15]: https://bun.sh/docs/bundler/executables
 Once I had a working proof-of-concept and a toolchain I was happy with, the rest was all fun; writing recursive functions that work with tree structures to do useful work is extremely my shit ([here's an old post I wrote about _The Little Schemer_][16] along these same lines). I added [Jest][17] and pulled in all my Go tests, as well as [Prettier][18] to stand in for `gofmt`. I wrapped things up earlier this week and published the result, which I've imaginatively called `mdrenum`, to [GitHub][19].
 [16]: /elsewhere/the-little-schemer-will-expand-blow-your-mind/
 [17]: https://jestjs.io/
 [18]: https://prettier.io/
 [19]: https://github.com/dce/mdrenum
 Bun (compiler) + TypeScript (type checking) + Prettier (code formatting) is a pretty acceptable Go substitute. The resulting executable is big (~45MB, as compared with ~2MB for my Go solution), but, hey, disk space is cheap and this actually works.
 ## Integrating with Helix
 I've been a [happy Helix user][20] for the last several months, and I thought it'd be cool to configure it to automatically renumber links every time I save a Markdown file. [The docs][21] do a pretty good job explaining how to add a language-specific formatter:
 [20]: /journal/a-month-with-helix/
 [21]: https://docs.helix-editor.com/languages.html#language-configuration
 > The formatter for the language, it will take precedence over the lsp when defined. The formatter must be able to take the original file as input from stdin and write the formatted file to stdout
 [This was pretty simple to add to the program][22], and then I added the following to `~/.config/helix/languages.toml`:
 ```toml
 [[language]]
 name = "markdown"
 auto-format = true
 formatter = { command = "mdrenum" , args = ["--stdin"] }
 ```
 [22]: https://github.com/dce/mdrenum/blob/main/src/cli.ts#L7-L18
 This totally works, and I'll say that it's uniquely satisfying to save a document and see the link numbers get instantly reordered properly.
 ---
 Thanks for coming on this journey with me, and if this seems like a tool that might be useful to you, grab it from [GitHub][19] and open an issue if you have any questions.
--- a/static/archive/leancrew-com-7l5uqs.txt
+++ b/static/archive/leancrew-com-7l5uqs.txt
@@ -0,0 +1,286 @@
   #[1]RSS Feed for ANIAT [2]JSON Feed for ANIAT
   [snowman-200.jpg]
 [3]And now it’s all this
 I just said what I said and it was wrong
 Or was taken wrong
   [4]Next post [5]Previous post
 [6]Tidying Markdown reference links
   September 17, 2012 at 9:15 PM by Dr. Drang
   Oscar Wilde—who would have been great on Twitter—[7]said “I couldn’t
   help it. I can resist everything except temptation.” That’s my excuse
   for this post.
   Several days ago I got an email from a reader, asking if I knew of a
   script that would tidy up [8]Markdown reference links in a document.
   She wanted them reordered and renumbered at the end of the document to
   match the order in which they appear in the body of the text. I didn’t
   know of one^[9]1 and suggested she write it herself and let me know
   when it’s done. I’ve been getting progress reports, but her script
   isn’t finished yet.
   There’s certainly no need to tidy the links up that way. Markdown
   doesn’t care what order the reference links appear in or the labels
   that are assigned to them. I’ve written dozens of posts in which the
   order of the references at the end of the Markdown source were way off
   from the order of the links in body. But…
   But there is an attraction to putting everything in apple pie order,
   even when no one but me will ever see it. Last night I succumbed and
   wrote a script to tidy up the links. Sorry, Phaedra.
   Here’s an example of a short Markdown document with out-of-order
   reference links:
 Species and their hybrids, How simply are these facts! How
 strange that the pollen of each But we may thus have
 [succeeded][2] in selecting so many exceptions to this rule.
 but the species would not all the same species living on the
 White Mountains, in the arctic regions of that large island.
 The exceptions which are now large, and triumphant, and
 which are known to every naturalist: scarcely a single
 [character][4] in the descendants of the Glacial period,
 would have been of use to the plants, have been accumulated
 and if, in both regions.
 Supposed to be extinct and unknown, form. We have seen that
 it yields readily, when subjected as [under confinement][3],
 to new and improved varieties will have been much
 compressed, we may assume that the species, which are
 already present in the ordinary spines serve as a prehensile
 or snapping apparatus. Thus every gradation, from animals
 with true lungs are descended from a marsupial form), "and
 if so, there can be followed by which viscid matter, such as
 that of making [slaves][1]. Let it be remembered that
 selection may be extended--to the stigma of.
 [1]: http://daringfireball.net/markdown/
 [2]: http://www.google.com/
 [3]: http://docs.python.org/library/index.html
 [4]: http://www.kungfugrippe.com/
   Note that the references are numbered 1, 2, 3, 4 at the bottom of the
   document, but that they appear in the body in the order 2, 4, 3, 1. The
   purpose of the script is to change the document to
 Species and their hybrids, How simply are these facts! How
 strange that the pollen of each But we may thus have
 [succeeded][1] in selecting so many exceptions to this rule.
 but the species would not all the same species living on the
 White Mountains, in the arctic regions of that large island.
 The exceptions which are now large, and triumphant, and
 which are known to every naturalist: scarcely a single
 [character][2] in the descendants of the Glacial period,
 would have been of use to the plants, have been accumulated
 and if, in both regions.
 Supposed to be extinct and unknown, form. We have seen that
 it yields readily, when subjected as [under confinement][3],
 to new and improved varieties will have been much
 compressed, we may assume that the species, which are
 already present in the ordinary spines serve as a prehensile
 or snapping apparatus. Thus every gradation, from animals
 with true lungs are descended from a marsupial form), "and
 if so, there can be followed by which viscid matter, such as
 that of making [slaves][4]. Let it be remembered that
 selection may be extended--to the stigma of.
 [1]: http://www.google.com/
 [2]: http://docs.python.org/library/index.html
 [3]: http://www.kungfugrippe.com/
 [4]: http://daringfireball.net/markdown/
   Now the links are numbered 1, 2, 3, 4 in both the text and the end
   references. The HTML produced when this document is run through a
   Markdown processor will be the same as the previous one—the links will
   still go to the right places—but the Markdown source looks better.
   Here’s the script that does it:
 python:
 1:  #!/usr/bin/python
 2:
 3:  import sys
 4:  import re
 5:
 6:  '''Read a Markdown file via standard input and tidy its
 7:  reference links. The reference links will be numbered in
 8:  the order they appear in the text and placed at the bottom
 9:  of the file.'''
 10:
 11:  # The regex for finding reference links in the text. Don't find
 12:  # footnotes by mistake.
 13:  link = re.compile(r'\[([^\]]+)\]\[([^^\]]+)\]')
 14:
 15:  # The regex for finding the label. Again, don't find footnotes
 16:  # by mistake.
 17:  label = re.compile(r'^\[([^^\]]+)\]:\s+(.+)$', re.MULTILINE)
 18:
 19:  def refrepl(m):
 20:    'Rewrite reference links with the reordered link numbers.'
 21:    return '[%s][%d]' % (m.group(1), order.index(m.group(2)) + 1)
 22:
 23:  # Read in the file and find all the links and references.
 24:  text = sys.stdin.read()
 25:  links = link.findall(text)
 26:  labels = dict(label.findall(text))
 27:
 28:  # Determine the order of the links in the text. If a link is used
 29:  # more than once, its order is its first position.
 30:  order = []
 31:  for i in links:
 32:    if order.count(i[1]) == 0:
 33:      order.append(i[1])
 34:
 35:  # Make a list of the references in order of appearance.
 36:  newlabels = [ '[%d]: %s' % (i + 1, labels[j]) for (i, j) in enumerate(order
 ) ]
 37:
 38:  # Remove the old references and put the new ones at the end of the text.
 39:  text = label.sub('', text).rstrip() + '\n'*3 + '\n'.join(newlabels)
 40:
 41:  # Rewrite the links with the new reference numbers.
 42:  text = link.sub(refrepl, text)
 43:
 44:  print text
   The regular expressions in Lines 13 and 17 are fairly easy to
   understand. The first one looks for the links in the body of the text
   and the second looks for the labels.
   The key to the script are the four data structures: links, labels,
   order, and newlabels. For our example document, links is the list of
   tuples
 [('succeeded', '2'),
 ('single character', '4'),
 ('under confinement', '3'),
 ('slaves', '1')]
   labels is the dictionary
 {'1': 'http://daringfireball.net/markdown/',
 '3': 'http://docs.python.org/library/index.html',
 '2': 'http://www.google.com/',
 '4': 'http://www.kungfugrippe.com/'}
   order is the list
 ['2', '4', '3', '1']
   and newlabels is the list of strings
 ['[1]: http://www.google.com/',
 '[2]: http://docs.python.org/library/index.html',
 '[3]: http://www.kungfugrippe.com/',
 '[4]: http://daringfireball.net/markdown/']
   links and labels are built via the regex findall method in Lines 25-26.
   links is the direct output of the method and maintains the order in
   which the links appear in the text. labels is that same output, but
   converted to a dictionary. Its order, which we don’t care about, is
   lost in the conversion, but it can be used to easily access the URL
   from the link label.
   order is the order in which the link labels first appear in the text.
   The if statement in Line 32 ensures that repeated links don’t overwrite
   each other.
   newlabels is built from labels and order in Line 36. It’s the list of
   labels after the renumbering. Line 39 deletes the original label lines
   and puts the new ones at the end of the document.
   Finally, Line 42 replaces all the link labels in the body of the text
   with the new values. Rather than a replacement string, it uses a simple
   replacement function defined in Lines 19-21 to do so.
   Barring any bugs I haven’t found yet, this script (or filter) will work
   on any Markdown document and can be used either directly from the
   command line or through whatever system your text editor uses to call
   external scripts. I have it stored in BBEdit’s Text Filters folder
   under the name “Tidy Markdown Reference Links.py,” so I can call it
   from the Text ‣ Apply Text Filter submenu.
   I should mention that although this script is fairly compact and
   simple, it didn’t spring from my head fully formed. There were starts
   and stops as I figured out which data structures were needed and how
   they could be built. Each little subsection of the script was tested as
   I went along. The order list was originally a list of tuples; it wasn’t
   until I had a working version of the entire script that I realized that
   it could be simplified down to a list of link labels. That change
   shortened the script by five lines or so and, more importantly,
   clarified its logic.
   Despite these improvements, the script is hardly foolproof. The
   Markdown source of this very post confuses the hell out it. Not only
   does it think there are links in the sample document (which you’d
   probably guess), it also thinks the [%s][%d] in Line 21 of the script
   is a link (and the one in this sentence, too). And why wouldn’t it? To
   distinguish between real links and things that look like links in
   embedded source code, the script would have to be able to parse
   Markdown, not just match a couple of short regular expressions. This is
   a variant on what Hamish Sanderson said in the comments on [10]an
   earlier post.
   At the moment, I’m not willing to sacrifice the simplicity of the Tidy
   script to get it to handle weird posts like this one. But if I find
   that it fails often with the kind of input I commonly give it, I’ll
   have to revisit that decision.
   As Wilde also said, “Experience is the name everyone gives to their
   mistakes.”
     __________________________________________________________________
    1. I didn’t think [11]Seth Brown’s formd did that, but [12]this tweet
       from Brett Terpsta says I was wrong about that. [13]↩
   [14]Next post [15]Previous post
 Site search
   ____________________ Go!
 Meta
     * drdrang at leancrew
     * [16]Blog archive
     * [17]RSS feed
     * [18]JSON feed
     * [19]Mastodon
     * [20]GitHub repositories
 Recent posts
 Credits
   [21]Powered by MathJax
   This work is licensed under a [22]Creative Commons Attribution-Share
   Alike 3.0 Unported License.
   © 2005–2023, Dr. Drang
 References
   1. https://leancrew.com/all-this/feed/
   2. https://leancrew.com/all-this/feed.json
   3. https://leancrew.com/all-this/
   4. https://leancrew.com/all-this/2012/09/some-kind-of-druid-dudes-lifting-the-veil/
   5. https://leancrew.com/all-this/2012/09/implementing-pubsubhubbub/
   6. https://leancrew.com/all-this/2012/09/tidying-markdown-reference-links/
   7. http://www.gutenberg.org/dirs/etext97/lwfan10h.htm
   8. http://daringfireball.net/projects/markdown/syntax#link
   9. file:///var/folders/q9/qlz2w5251kzdfgn0np7z2s4c0000gn/T/L97479-6275TMP.html#fn:formd
  10. http://www.leancrew.com/all-this/2012/09/applescript-syntax-highlighting-finally/
  11. http://www.drbunsen.org/formd-a-markdown-formatting-tool.html
  12. https://twitter.com/ttscoff/status/247398632377184256
  13. file:///var/folders/q9/qlz2w5251kzdfgn0np7z2s4c0000gn/T/L97479-6275TMP.html#fnref:formd
  14. https://leancrew.com/all-this/2012/09/some-kind-of-druid-dudes-lifting-the-veil/
  15. https://leancrew.com/all-this/2012/09/implementing-pubsubhubbub/
  16. https://leancrew.com/all-this/archive/
  17. https://leancrew.com/all-this/feed/
  18. https://leancrew.com/all-this/feed.json
  19. https://fosstodon.org/@drdrang
  20. http://github.com/drdrang
  21. http://www.mathjax.org/
  22. http://creativecommons.org/licenses/by-sa/3.0/