Pull in Viget posts

2023-10-22 23:52:56 -04:00
parent 625d374135
commit 0438a6d828
77 changed files with 8219 additions and 5 deletions
--- a/content/elsewhere/lets-write-a-dang-elasticsearch-plugin/index.md
+++ b/content/elsewhere/lets-write-a-dang-elasticsearch-plugin/index.md
@@ -0,0 +1,427 @@
+---
+title: "Let’s Write a Dang ElasticSearch Plugin"
+date: 2021-03-15T00:00:00+00:00
+draft: false
+needs_review: true
+canonical_url: https://www.viget.com/articles/lets-write-a-dang-elasticsearch-plugin/
+---
+
+One of our current projects involves a complex interactive query builder
+to search a large collection of news items. Some of the conditionals
+fall outside of the sweet spot of Postgres (e.g. word X must appear
+within Y words of word Z), and so we opted to pull in
+[ElasticSearch](https://www.elastic.co/elasticsearch/) alongside it.
+It\'s worked perfectly, hitting all of our condition and grouping needs
+with one exception: we need to be able to filter for articles that
+contain a term a minimum number of times (so \"Apple\" must appear in
+the article 3 times, for example). Frustratingly, Elastic *totally* has
+this information via its
+[`term_vector`](https://www.elastic.co/guide/en/elasticsearch/reference/current/term-vector.html)
+feature, but you can\'t use that data inside a query, as least as far as
+I can tell.
+
+The solution, it seems, is to write a custom plugin. I figured it out,
+eventually, but it was a lot of trial-and-error as the documentation I
+was able to find is largely outdated or incomplete. So I figured I\'d
+take what I learned while it\'s still fresh in my mind in the hopes that
+someone else might have an easier time of it. That\'s what internet
+friends are for, after all.
+
+Quick note before we start: all the version numbers you see are current
+and working as of February 25, 2021. Hopefully this post ages well, but
+if you try this out and hit issues, bumping the versions of Elastic,
+Gradle, and maybe even Java is probably a good place to start. Also, I
+use `projectname` a lot in the code examples --- that\'s not a special
+word and you should change it to something that makes sense for you.
+
+[]{#1-set-up-a-java-development-environment}
+
+## 1. Set up a Java development environment [\#](#1-set-up-a-java-development-environment "Direct link to 1. Set up a Java development environment"){.anchor aria-label="Direct link to 1. Set up a Java development environment"}
+
+First off, you\'re gonna be writing some Java. That\'s not my usual
+thing, so the first step was to get a working environment to compile my
+code. To do that, we\'ll use [Docker](https://www.docker.com/). Here\'s
+a `Dockerfile`:
+
+``` {.code-block .line-numbers}
+FROM adoptopenjdk/openjdk12:jdk-12.0.2_10-ubuntu
+
+RUN apt-get update && 
+  apt-get install -y zip unzip && 
+  rm -rf /var/lib/apt/lists/*
+
+SHELL ["/bin/bash", "-c"]
+
+RUN curl -s "https://get.sdkman.io" | bash && 
+  source "/root/.sdkman/bin/sdkman-init.sh" && 
+  sdk install gradle 6.8.2
+
+WORKDIR /plugin
+```
+
+We use a base image with all the Java stuff but also a working Ubuntu
+install so that we can do normal Linux-y things inside our container.
+From your terminal, build the image:
+
+`> docker build . -t projectname-java`
+
+Then, spin up the container and start an interactive shell, mounting
+your local working directory into `/plugin`:
+
+`> docker run --rm -it -v ${PWD}:/plugin projectname-java bash`
+
+[]{#2-configure-gradle}
+
+## 2. Configure Gradle [\#](#2-configure-gradle "Direct link to 2. Configure Gradle"){.anchor aria-label="Direct link to 2. Configure Gradle"}
+
+[Gradle](https://gradle.org/) is a \"build automation tool for
+multi-language software development,\" and what Elastic recommends for
+plugin development. Configuring Gradle to build the plugin properly was
+the hardest part of this whole endeavor. Throw this into `build.gradle`
+in your project root:
+
+``` {.code-block .line-numbers}
+buildscript {
+  repositories {
+    mavenLocal()
+    mavenCentral()
+    jcenter()
+  }
+
+  dependencies {
+    classpath "org.elasticsearch.gradle:build-tools:7.11.1"
+  }
+}
+
+apply plugin: 'java'
+
+compileJava {
+  sourceCompatibility = JavaVersion.VERSION_12
+  targetCompatibility = JavaVersion.VERSION_12
+}
+
+apply plugin: 'elasticsearch.esplugin'
+
+group = "com.projectname"
+version = "0.0.1"
+
+esplugin {
+  name 'contains-multiple'
+  description 'A script for finding documents that match a term a certain number of times'
+  classname 'com.projectname.containsmultiple.ContainsMultiplePlugin'
+  licenseFile rootProject.file('LICENSE.txt')
+  noticeFile rootProject.file('NOTICE.txt')
+}
+
+validateNebulaPom.enabled = false
+```
+
+You\'ll also need files named `LICENSE.txt` and `NOTICE.txt` --- mine
+are empty, since the plugin is for internal use only. If you\'re going
+to be releasing your plugin in some public way, maybe talk to a lawyer
+about what to put in those files.
+
+[]{#3-write-the-dang-plugin}
+
+## 3. Write the dang plugin [\#](#3-write-the-dang-plugin "Direct link to 3. Write the dang plugin"){.anchor aria-label="Direct link to 3. Write the dang plugin"}
+
+To write the actual plugin, I started with [this example
+plugin](https://github.com/elastic/elasticsearch/blob/master/plugins/examples/script-expert-scoring/src/main/java/org/elasticsearch/example/expertscript/ExpertScriptPlugin.java)
+which scores a document based on the frequency of a given term. My use
+case was fortunately quite similar, though I\'m using a `filter` query,
+meaning I just want a boolean, i.e. does this document contain this term
+the requisite number of times? As such, I implemented a
+[`FilterScript`](https://www.javadoc.io/doc/org.elasticsearch/elasticsearch/latest/org/elasticsearch/script/FilterScript.html)
+rather than the `ScoreScript` implemented in the example code.
+
+This file lives in (deep breath)
+`src/main/java/com/projectname/containsmultiple/ContainsMultiplePlugin.java`:
+
+``` {.code-block .line-numbers}
+package com.projectname.containsmultiple;
+
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.index.PostingsEnum;
+import org.apache.lucene.index.Term;
+import org.elasticsearch.common.settings.Settings;
+import org.elasticsearch.plugins.Plugin;
+import org.elasticsearch.plugins.ScriptPlugin;
+import org.elasticsearch.script.FilterScript;
+import org.elasticsearch.script.FilterScript.LeafFactory;
+import org.elasticsearch.script.ScriptContext;
+import org.elasticsearch.script.ScriptEngine;
+import org.elasticsearch.script.ScriptFactory;
+import org.elasticsearch.search.lookup.SearchLookup;
+
+import java.io.IOException;
+import java.io.UncheckedIOException;
+import java.util.Collection;
+import java.util.Map;
+import java.util.Set;
+
+/**
+ * A script for finding documents that match a term a certain number of times
+ */
+public class ContainsMultiplePlugin extends Plugin implements ScriptPlugin {
+
+    @Override
+    public ScriptEngine getScriptEngine(
+        Settings settings,
+        Collection<ScriptContext<?>> contexts
+    ) {
+        return new ContainsMultipleEngine();
+    }
+
+    // tag::contains_multiple
+    private static class ContainsMultipleEngine implements ScriptEngine {
+        @Override
+        public String getType() {
+            return "expert_scripts";
+        }
+
+        @Override
+        public <T> T compile(
+            String scriptName,
+            String scriptSource,
+            ScriptContext<T> context,
+            Map<String, String> params
+        ) {
+            if (context.equals(FilterScript.CONTEXT) == false) {
+                throw new IllegalArgumentException(getType()
+                        + " scripts cannot be used for context ["
+                        + context.name + "]");
+            }
+            // we use the script "source" as the script identifier
+            if ("contains_multiple".equals(scriptSource)) {
+                FilterScript.Factory factory = new ContainsMultipleFactory();
+                return context.factoryClazz.cast(factory);
+            }
+            throw new IllegalArgumentException("Unknown script name "
+                    + scriptSource);
+        }
+
+        @Override
+        public void close() {
+            // optionally close resources
+        }
+
+        @Override
+        public Set<ScriptContext<?>> getSupportedContexts() {
+            return Set.of(FilterScript.CONTEXT);
+        }
+
+        private static class ContainsMultipleFactory implements FilterScript.Factory,
+                                                      ScriptFactory {
+            @Override
+            public boolean isResultDeterministic() {
+                return true;
+            }
+
+            @Override
+            public LeafFactory newFactory(
+                Map<String, Object> params,
+                SearchLookup lookup
+            ) {
+                return new ContainsMultipleLeafFactory(params, lookup);
+            }
+        }
+
+        private static class ContainsMultipleLeafFactory implements LeafFactory {
+            private final Map<String, Object> params;
+            private final SearchLookup lookup;
+            private final String field;
+            private final String term;
+            private final int count;
+
+            private ContainsMultipleLeafFactory(
+                        Map<String, Object> params, SearchLookup lookup) {
+                if (params.containsKey("field") == false) {
+                    throw new IllegalArgumentException(
+                            "Missing parameter [field]");
+                }
+                if (params.containsKey("term") == false) {
+                    throw new IllegalArgumentException(
+                            "Missing parameter [term]");
+                }
+                if (params.containsKey("count") == false) {
+                    throw new IllegalArgumentException(
+                            "Missing parameter [count]");
+                }
+                this.params = params;
+                this.lookup = lookup;
+                field = params.get("field").toString();
+                term = params.get("term").toString();
+                count = Integer.parseInt(params.get("count").toString());
+            }
+
+            @Override
+            public FilterScript newInstance(LeafReaderContext context)
+                    throws IOException {
+                PostingsEnum postings = context.reader().postings(
+                        new Term(field, term));
+                if (postings == null) {
+                    /*
+                     * the field and/or term don't exist in this segment,
+                     * so always return 0
+                     */
+                    return new FilterScript(params, lookup, context) {
+                        @Override
+                        public boolean execute() {
+                            return false;
+                        }
+                    };
+                }
+                return new FilterScript(params, lookup, context) {
+                    int currentDocid = -1;
+                    @Override
+                    public void setDocument(int docid) {
+                        /*
+                         * advance has undefined behavior calling with
+                         * a docid <= its current docid
+                         */
+                        if (postings.docID() < docid) {
+                            try {
+                                postings.advance(docid);
+                            } catch (IOException e) {
+                                throw new UncheckedIOException(e);
+                            }
+                        }
+                        currentDocid = docid;
+                    }
+                    @Override
+                    public boolean execute() {
+                        if (postings.docID() != currentDocid) {
+                            /*
+                             * advance moved past the current doc, so this
+                             * doc has no occurrences of the term
+                             */
+                            return false;
+                        }
+                        try {
+                            return postings.freq() >= count;
+                        } catch (IOException e) {
+                            throw new UncheckedIOException(e);
+                        }
+                    }
+                };
+            }
+        }
+    }
+    // end::contains_multiple
+}
+```
+
+[]{#4-add-it-to-elasticSearch}
+
+## 4. Add it to ElasticSearch [\#](#4-add-it-to-elasticSearch "Direct link to 4. Add it to ElasticSearch"){.anchor aria-label="Direct link to 4. Add it to ElasticSearch"}
+
+With our code in place (and synced into our Docker container with a
+mounted volume), it\'s time to compile it. In the Docker shell you
+started up in step #1, build your plugin:
+
+`> gradle build`
+
+Assuming that works, you should now see a `build` directory with a bunch
+of stuff in it. The file you care about is
+`build/distributions/contains-multiple-0.0.1.zip` (though that\'ll
+obviously change if you call your plugin something different or give it
+a different version number). Grab that file and copy it to where you
+plan to actually run ElasticSearch. For me, I placed it in a folder
+called `.docker/elastic` in the main project repo. In that same
+directory, create a new `Dockerfile` that\'ll actually run Elastic:
+
+``` {.code-block .line-numbers}
+FROM docker.elastic.co/elasticsearch/elasticsearch:7.11.1
+
+COPY .docker/elastic/contains-multiple-0.0.1.zip /plugins/contains-multiple-0.0.1.zip
+
+RUN elasticsearch-plugin install 
+  file:///plugins/contains-multiple-0.0.1.zip
+```
+
+Then, in your project root, create the following `docker-compose.yml`:
+
+``` {.code-block .line-numbers}
+version: '3.2'
+
+services: elasticsearch:
+    image: projectname_elasticsearch
+    build:
+      context: .
+      dockerfile: ./.docker/elastic/Dockerfile
+      ports:
+        - 9200:9200
+      environment:
+        - discovery.type=single-node
+        - script.allowed_types=inline
+        - script.allowed_contexts=filter
+```
+
+Those last couple lines are pretty important and your script won\'t work
+without them. Build your image with `docker-compose build` and then
+start Elastic with `docker-compose up`.
+
+[]{#5-use-your-plugin}
+
+## 5. Use your plugin [\#](#5-use-your-plugin "Direct link to 5. Use your plugin"){.anchor aria-label="Direct link to 5. Use your plugin"}
+
+To actually see the plugin in action, first create an index and add some
+documents (I\'ll assume you\'re able to do this if you\'ve read this far
+into this post). Then, make a query with `curl` (or your Elastic wrapper
+of choice), substituting `full_text`, `yabba` and `index_name` with
+whatever makes sense for you:
+
+``` {.code-block .line-numbers}
+> curl -H "content-type: application/json" 
+-d '
+{
+  "query": {
+    "bool": {
+      "filter": {
+        "script": {
+          "script": {
+            "source": "contains_multiple",
+            "lang": "expert_scripts",
+            "params": {
+              "field": "full_text",
+              "term": "yabba",
+              "count": 3
+            }
+          }
+        }
+      }
+    }
+  }
+}' 
+"localhost:9200/index_name/_search?pretty"
+```
+
+The result should be something like:
+
+``` {.code-block .line-numbers}
+{
+  "took" : 6,
+  "timed_out" : false,
+  "_shards" : {
+    "total" : 1,
+    "successful" : 1,
+    "skipped" : 0,
+    "failed" : 0
+  },
+  "hits" : {
+    "total" : {
+      "value" : 1,
+      "relation" : "eq"
+    },
+    "max_score" : 0.0,
+    "hits" : [
+      {
+        "_index" : "index_name",
+        "_type" : "_doc",
+        "_id" : "10",
+        ...
+```
+
+So that\'s that, an ElasticSearch plugin from start-to-finish. I\'m sure
+there are better ways to do some of this stuff, and if you\'re aware of
+any, let us know in the comments or write your own dang blog.