copy-edit viget posts

2023-10-24 20:48:09 -04:00
parent 0438a6d828
commit f86f391e82
77 changed files with 1663 additions and 1380 deletions
--- a/content/elsewhere/extract-embedded-text-from-pdfs-with-poppler-in-ruby/index.md
+++ b/content/elsewhere/extract-embedded-text-from-pdfs-with-poppler-in-ruby/index.md
@@ -2,7 +2,6 @@
 title: "Extract Embedded Text from PDFs with Poppler in Ruby"
 date: 2022-02-10T00:00:00+00:00
 draft: false
-needs_review: true
 canonical_url: https://www.viget.com/articles/extract-embedded-text-from-pdfs-with-poppler-in-ruby/
 ---

@@ -10,15 +9,14 @@ A recent client request had us adding an archive of magazine issues
 dating back to the 1980s. Pretty straightforward stuff, with the hiccup
 that they wanted the magazine content to be searchable. Fortunately, the
 example PDFs they provided us had embedded text
-content[^1^](#fn1){#fnref1 .footnote-ref role="doc-noteref"}, i.e. the
+content[^1], i.e. the
 text was selectable. The trick was to figure out how to programmatically
 extract that content.

 Our first attempt involved the [`pdf-reader`
 gem](https://rubygems.org/gems/pdf-reader/versions/2.2.1), which worked
 admirably with the caveat that it had a little bit of trouble with
-multi-column / art-directed layouts[^2^](#fn2){#fnref2 .footnote-ref
-role="doc-noteref"}, which was a lot of the content we were dealing
+multi-column / art-directed layouts[^2], which was a lot of the content we were dealing
 with.

 A bit of research uncovered [Poppler](https://poppler.freedesktop.org/),
@@ -32,28 +30,38 @@ great and here's how to do it.

 Poppler installs as a standalone library. On Mac:

-    brew install poppler
+```
+brew install poppler
+```

 On (Debian-based) Linux:

-    apt-get install libgirepository1.0-dev libpoppler-glib-dev
+```
+apt-get install libgirepository1.0-dev libpoppler-glib-dev
+```

 In a (Debian-based) Dockerfile:

-    RUN apt-get update && 
-      apt-get install -y libgirepository1.0-dev libpoppler-glib-dev && 
-      rm -rf /var/lib/apt/lists/*
+```dockerfile
+RUN apt-get update &&
+  apt-get install -y libgirepository1.0-dev libpoppler-glib-dev &&
+  rm -rf /var/lib/apt/lists/*
+````

 Then, in your `Gemfile`:

-    gem "poppler"
+```ruby
+gem "poppler"
+````

 ## Use it in your application

 Extracting text from a PDF document is super straightforward:

-    document = Poppler::Document.new(path_to_pdf)
-    document.map { |page| page.get_text }.join
+```ruby
+document = Poppler::Document.new(path_to_pdf)
+document.map { |page| page.get_text }.join
+```

 The results are really good, and Poppler understands complex page
 layouts to an impressive degree. Additionally, the library seems to
@@ -65,15 +73,12 @@ need to extract text from a PDF, Poppler is a good choice.
 3.0*](https://commons.wikimedia.org/w/index.php?curid=39946499)


------------------------------------------------------------------------
+[^1]: Note that we're not talking about extracting text from images/OCR;
+if you need to take an image-based PDF and add a selectable text
+layer to it, I recommend
+[OCRmyPDF](https://pypi.org/project/ocrmypdf/).

-1.  [Note that we're not talking about extracting text from images/OCR;
-    if you need to take an image-based PDF and add a selectable text
-    layer to it, I recommend
-    [OCRmyPDF](https://pypi.org/project/ocrmypdf/).
-    [↩︎](#fnref1){.footnote-back role="doc-backlink"}]{#fn1}
-
-2.  [So for a page like this:]{#fn2}
+[^2]: So for a page like this:

        +-----------------+---------------------+
        | This is a story | my life got flipped |
@@ -82,5 +87,4 @@ need to extract text from a PDF, Poppler is a good choice.

    `pdf-reader` would parse this into "This is a story my life got
    flipped all about how turned upside-down," which led to issues when
-    searching for multi-word phrases. [↩︎](#fnref2){.footnote-back
-    role="doc-backlink"}
+    searching for multi-word phrases.