copy-edit viget posts
This commit is contained in:
@@ -2,7 +2,6 @@
|
||||
title: "Extract Embedded Text from PDFs with Poppler in Ruby"
|
||||
date: 2022-02-10T00:00:00+00:00
|
||||
draft: false
|
||||
needs_review: true
|
||||
canonical_url: https://www.viget.com/articles/extract-embedded-text-from-pdfs-with-poppler-in-ruby/
|
||||
---
|
||||
|
||||
@@ -10,15 +9,14 @@ A recent client request had us adding an archive of magazine issues
|
||||
dating back to the 1980s. Pretty straightforward stuff, with the hiccup
|
||||
that they wanted the magazine content to be searchable. Fortunately, the
|
||||
example PDFs they provided us had embedded text
|
||||
content[^1^](#fn1){#fnref1 .footnote-ref role="doc-noteref"}, i.e. the
|
||||
content[^1], i.e. the
|
||||
text was selectable. The trick was to figure out how to programmatically
|
||||
extract that content.
|
||||
|
||||
Our first attempt involved the [`pdf-reader`
|
||||
gem](https://rubygems.org/gems/pdf-reader/versions/2.2.1), which worked
|
||||
admirably with the caveat that it had a little bit of trouble with
|
||||
multi-column / art-directed layouts[^2^](#fn2){#fnref2 .footnote-ref
|
||||
role="doc-noteref"}, which was a lot of the content we were dealing
|
||||
multi-column / art-directed layouts[^2], which was a lot of the content we were dealing
|
||||
with.
|
||||
|
||||
A bit of research uncovered [Poppler](https://poppler.freedesktop.org/),
|
||||
@@ -32,28 +30,38 @@ great and here's how to do it.
|
||||
|
||||
Poppler installs as a standalone library. On Mac:
|
||||
|
||||
brew install poppler
|
||||
```
|
||||
brew install poppler
|
||||
```
|
||||
|
||||
On (Debian-based) Linux:
|
||||
|
||||
apt-get install libgirepository1.0-dev libpoppler-glib-dev
|
||||
```
|
||||
apt-get install libgirepository1.0-dev libpoppler-glib-dev
|
||||
```
|
||||
|
||||
In a (Debian-based) Dockerfile:
|
||||
|
||||
RUN apt-get update &&
|
||||
apt-get install -y libgirepository1.0-dev libpoppler-glib-dev &&
|
||||
rm -rf /var/lib/apt/lists/*
|
||||
```dockerfile
|
||||
RUN apt-get update &&
|
||||
apt-get install -y libgirepository1.0-dev libpoppler-glib-dev &&
|
||||
rm -rf /var/lib/apt/lists/*
|
||||
````
|
||||
|
||||
Then, in your `Gemfile`:
|
||||
|
||||
gem "poppler"
|
||||
```ruby
|
||||
gem "poppler"
|
||||
````
|
||||
|
||||
## Use it in your application
|
||||
|
||||
Extracting text from a PDF document is super straightforward:
|
||||
|
||||
document = Poppler::Document.new(path_to_pdf)
|
||||
document.map { |page| page.get_text }.join
|
||||
```ruby
|
||||
document = Poppler::Document.new(path_to_pdf)
|
||||
document.map { |page| page.get_text }.join
|
||||
```
|
||||
|
||||
The results are really good, and Poppler understands complex page
|
||||
layouts to an impressive degree. Additionally, the library seems to
|
||||
@@ -65,15 +73,12 @@ need to extract text from a PDF, Poppler is a good choice.
|
||||
3.0*](https://commons.wikimedia.org/w/index.php?curid=39946499)
|
||||
|
||||
|
||||
------------------------------------------------------------------------
|
||||
[^1]: Note that we're not talking about extracting text from images/OCR;
|
||||
if you need to take an image-based PDF and add a selectable text
|
||||
layer to it, I recommend
|
||||
[OCRmyPDF](https://pypi.org/project/ocrmypdf/).
|
||||
|
||||
1. [Note that we're not talking about extracting text from images/OCR;
|
||||
if you need to take an image-based PDF and add a selectable text
|
||||
layer to it, I recommend
|
||||
[OCRmyPDF](https://pypi.org/project/ocrmypdf/).
|
||||
[↩︎](#fnref1){.footnote-back role="doc-backlink"}]{#fn1}
|
||||
|
||||
2. [So for a page like this:]{#fn2}
|
||||
[^2]: So for a page like this:
|
||||
|
||||
+-----------------+---------------------+
|
||||
| This is a story | my life got flipped |
|
||||
@@ -82,5 +87,4 @@ need to extract text from a PDF, Poppler is a good choice.
|
||||
|
||||
`pdf-reader` would parse this into "This is a story my life got
|
||||
flipped all about how turned upside-down," which led to issues when
|
||||
searching for multi-word phrases. [↩︎](#fnref2){.footnote-back
|
||||
role="doc-backlink"}
|
||||
searching for multi-word phrases.
|
||||
Reference in New Issue
Block a user