Pull in Viget posts

2023-10-22 23:52:56 -04:00
parent 625d374135
commit 0438a6d828
77 changed files with 8219 additions and 5 deletions
--- a/content/elsewhere/extract-embedded-text-from-pdfs-with-poppler-in-ruby/index.md
+++ b/content/elsewhere/extract-embedded-text-from-pdfs-with-poppler-in-ruby/index.md
@@ -0,0 +1,86 @@
+---
+title: "Extract Embedded Text from PDFs with Poppler in Ruby"
+date: 2022-02-10T00:00:00+00:00
+draft: false
+needs_review: true
+canonical_url: https://www.viget.com/articles/extract-embedded-text-from-pdfs-with-poppler-in-ruby/
+---
+
+A recent client request had us adding an archive of magazine issues
+dating back to the 1980s. Pretty straightforward stuff, with the hiccup
+that they wanted the magazine content to be searchable. Fortunately, the
+example PDFs they provided us had embedded text
+content[^1^](#fn1){#fnref1 .footnote-ref role="doc-noteref"}, i.e. the
+text was selectable. The trick was to figure out how to programmatically
+extract that content.
+
+Our first attempt involved the [`pdf-reader`
+gem](https://rubygems.org/gems/pdf-reader/versions/2.2.1), which worked
+admirably with the caveat that it had a little bit of trouble with
+multi-column / art-directed layouts[^2^](#fn2){#fnref2 .footnote-ref
+role="doc-noteref"}, which was a lot of the content we were dealing
+with.
+
+A bit of research uncovered [Poppler](https://poppler.freedesktop.org/),
+"a free software utility library for rendering Portable Document Format
+(PDF) documents," which includes text extraction functionality and has a
+corresponding [Ruby
+library](https://rubygems.org/gems/poppler/versions/3.4.9). This worked
+great and here's how to do it.
+
+## Install Poppler
+
+Poppler installs as a standalone library. On Mac:
+
+    brew install poppler
+
+On (Debian-based) Linux:
+
+    apt-get install libgirepository1.0-dev libpoppler-glib-dev
+
+In a (Debian-based) Dockerfile:
+
+    RUN apt-get update && 
+      apt-get install -y libgirepository1.0-dev libpoppler-glib-dev && 
+      rm -rf /var/lib/apt/lists/*
+
+Then, in your `Gemfile`:
+
+    gem "poppler"
+
+## Use it in your application
+
+Extracting text from a PDF document is super straightforward:
+
+    document = Poppler::Document.new(path_to_pdf)
+    document.map { |page| page.get_text }.join
+
+The results are really good, and Poppler understands complex page
+layouts to an impressive degree. Additionally, the library seems to
+support a lot more [advanced
+functionality](https://www.rubydoc.info/gems/poppler/3.4.9). If you ever
+need to extract text from a PDF, Poppler is a good choice.
+
+[*John Popper photo by Gage Skidmore, CC BY-SA
+3.0*](https://commons.wikimedia.org/w/index.php?curid=39946499)
+
+
+------------------------------------------------------------------------
+
+1.  [Note that we're not talking about extracting text from images/OCR;
+    if you need to take an image-based PDF and add a selectable text
+    layer to it, I recommend
+    [OCRmyPDF](https://pypi.org/project/ocrmypdf/).
+    [↩︎](#fnref1){.footnote-back role="doc-backlink"}]{#fn1}
+
+2.  [So for a page like this:]{#fn2}
+
+        +-----------------+---------------------+
+        | This is a story | my life got flipped |
+        | all about how   | turned upside-down  |
+        +-----------------+---------------------+
+
+    `pdf-reader` would parse this into "This is a story my life got
+    flipped all about how turned upside-down," which led to issues when
+    searching for multi-word phrases. [↩︎](#fnref2){.footnote-back
+    role="doc-backlink"}