Pull in Viget posts
This commit is contained in:
@@ -0,0 +1,427 @@
|
||||
---
|
||||
title: "Let’s Write a Dang ElasticSearch Plugin"
|
||||
date: 2021-03-15T00:00:00+00:00
|
||||
draft: false
|
||||
needs_review: true
|
||||
canonical_url: https://www.viget.com/articles/lets-write-a-dang-elasticsearch-plugin/
|
||||
---
|
||||
|
||||
One of our current projects involves a complex interactive query builder
|
||||
to search a large collection of news items. Some of the conditionals
|
||||
fall outside of the sweet spot of Postgres (e.g. word X must appear
|
||||
within Y words of word Z), and so we opted to pull in
|
||||
[ElasticSearch](https://www.elastic.co/elasticsearch/) alongside it.
|
||||
It\'s worked perfectly, hitting all of our condition and grouping needs
|
||||
with one exception: we need to be able to filter for articles that
|
||||
contain a term a minimum number of times (so \"Apple\" must appear in
|
||||
the article 3 times, for example). Frustratingly, Elastic *totally* has
|
||||
this information via its
|
||||
[`term_vector`](https://www.elastic.co/guide/en/elasticsearch/reference/current/term-vector.html)
|
||||
feature, but you can\'t use that data inside a query, as least as far as
|
||||
I can tell.
|
||||
|
||||
The solution, it seems, is to write a custom plugin. I figured it out,
|
||||
eventually, but it was a lot of trial-and-error as the documentation I
|
||||
was able to find is largely outdated or incomplete. So I figured I\'d
|
||||
take what I learned while it\'s still fresh in my mind in the hopes that
|
||||
someone else might have an easier time of it. That\'s what internet
|
||||
friends are for, after all.
|
||||
|
||||
Quick note before we start: all the version numbers you see are current
|
||||
and working as of February 25, 2021. Hopefully this post ages well, but
|
||||
if you try this out and hit issues, bumping the versions of Elastic,
|
||||
Gradle, and maybe even Java is probably a good place to start. Also, I
|
||||
use `projectname` a lot in the code examples --- that\'s not a special
|
||||
word and you should change it to something that makes sense for you.
|
||||
|
||||
[]{#1-set-up-a-java-development-environment}
|
||||
|
||||
## 1. Set up a Java development environment [\#](#1-set-up-a-java-development-environment "Direct link to 1. Set up a Java development environment"){.anchor aria-label="Direct link to 1. Set up a Java development environment"}
|
||||
|
||||
First off, you\'re gonna be writing some Java. That\'s not my usual
|
||||
thing, so the first step was to get a working environment to compile my
|
||||
code. To do that, we\'ll use [Docker](https://www.docker.com/). Here\'s
|
||||
a `Dockerfile`:
|
||||
|
||||
``` {.code-block .line-numbers}
|
||||
FROM adoptopenjdk/openjdk12:jdk-12.0.2_10-ubuntu
|
||||
|
||||
RUN apt-get update &&
|
||||
apt-get install -y zip unzip &&
|
||||
rm -rf /var/lib/apt/lists/*
|
||||
|
||||
SHELL ["/bin/bash", "-c"]
|
||||
|
||||
RUN curl -s "https://get.sdkman.io" | bash &&
|
||||
source "/root/.sdkman/bin/sdkman-init.sh" &&
|
||||
sdk install gradle 6.8.2
|
||||
|
||||
WORKDIR /plugin
|
||||
```
|
||||
|
||||
We use a base image with all the Java stuff but also a working Ubuntu
|
||||
install so that we can do normal Linux-y things inside our container.
|
||||
From your terminal, build the image:
|
||||
|
||||
`> docker build . -t projectname-java`
|
||||
|
||||
Then, spin up the container and start an interactive shell, mounting
|
||||
your local working directory into `/plugin`:
|
||||
|
||||
`> docker run --rm -it -v ${PWD}:/plugin projectname-java bash`
|
||||
|
||||
[]{#2-configure-gradle}
|
||||
|
||||
## 2. Configure Gradle [\#](#2-configure-gradle "Direct link to 2. Configure Gradle"){.anchor aria-label="Direct link to 2. Configure Gradle"}
|
||||
|
||||
[Gradle](https://gradle.org/) is a \"build automation tool for
|
||||
multi-language software development,\" and what Elastic recommends for
|
||||
plugin development. Configuring Gradle to build the plugin properly was
|
||||
the hardest part of this whole endeavor. Throw this into `build.gradle`
|
||||
in your project root:
|
||||
|
||||
``` {.code-block .line-numbers}
|
||||
buildscript {
|
||||
repositories {
|
||||
mavenLocal()
|
||||
mavenCentral()
|
||||
jcenter()
|
||||
}
|
||||
|
||||
dependencies {
|
||||
classpath "org.elasticsearch.gradle:build-tools:7.11.1"
|
||||
}
|
||||
}
|
||||
|
||||
apply plugin: 'java'
|
||||
|
||||
compileJava {
|
||||
sourceCompatibility = JavaVersion.VERSION_12
|
||||
targetCompatibility = JavaVersion.VERSION_12
|
||||
}
|
||||
|
||||
apply plugin: 'elasticsearch.esplugin'
|
||||
|
||||
group = "com.projectname"
|
||||
version = "0.0.1"
|
||||
|
||||
esplugin {
|
||||
name 'contains-multiple'
|
||||
description 'A script for finding documents that match a term a certain number of times'
|
||||
classname 'com.projectname.containsmultiple.ContainsMultiplePlugin'
|
||||
licenseFile rootProject.file('LICENSE.txt')
|
||||
noticeFile rootProject.file('NOTICE.txt')
|
||||
}
|
||||
|
||||
validateNebulaPom.enabled = false
|
||||
```
|
||||
|
||||
You\'ll also need files named `LICENSE.txt` and `NOTICE.txt` --- mine
|
||||
are empty, since the plugin is for internal use only. If you\'re going
|
||||
to be releasing your plugin in some public way, maybe talk to a lawyer
|
||||
about what to put in those files.
|
||||
|
||||
[]{#3-write-the-dang-plugin}
|
||||
|
||||
## 3. Write the dang plugin [\#](#3-write-the-dang-plugin "Direct link to 3. Write the dang plugin"){.anchor aria-label="Direct link to 3. Write the dang plugin"}
|
||||
|
||||
To write the actual plugin, I started with [this example
|
||||
plugin](https://github.com/elastic/elasticsearch/blob/master/plugins/examples/script-expert-scoring/src/main/java/org/elasticsearch/example/expertscript/ExpertScriptPlugin.java)
|
||||
which scores a document based on the frequency of a given term. My use
|
||||
case was fortunately quite similar, though I\'m using a `filter` query,
|
||||
meaning I just want a boolean, i.e. does this document contain this term
|
||||
the requisite number of times? As such, I implemented a
|
||||
[`FilterScript`](https://www.javadoc.io/doc/org.elasticsearch/elasticsearch/latest/org/elasticsearch/script/FilterScript.html)
|
||||
rather than the `ScoreScript` implemented in the example code.
|
||||
|
||||
This file lives in (deep breath)
|
||||
`src/main/java/com/projectname/containsmultiple/ContainsMultiplePlugin.java`:
|
||||
|
||||
``` {.code-block .line-numbers}
|
||||
package com.projectname.containsmultiple;
|
||||
|
||||
import org.apache.lucene.index.LeafReaderContext;
|
||||
import org.apache.lucene.index.PostingsEnum;
|
||||
import org.apache.lucene.index.Term;
|
||||
import org.elasticsearch.common.settings.Settings;
|
||||
import org.elasticsearch.plugins.Plugin;
|
||||
import org.elasticsearch.plugins.ScriptPlugin;
|
||||
import org.elasticsearch.script.FilterScript;
|
||||
import org.elasticsearch.script.FilterScript.LeafFactory;
|
||||
import org.elasticsearch.script.ScriptContext;
|
||||
import org.elasticsearch.script.ScriptEngine;
|
||||
import org.elasticsearch.script.ScriptFactory;
|
||||
import org.elasticsearch.search.lookup.SearchLookup;
|
||||
|
||||
import java.io.IOException;
|
||||
import java.io.UncheckedIOException;
|
||||
import java.util.Collection;
|
||||
import java.util.Map;
|
||||
import java.util.Set;
|
||||
|
||||
/**
|
||||
* A script for finding documents that match a term a certain number of times
|
||||
*/
|
||||
public class ContainsMultiplePlugin extends Plugin implements ScriptPlugin {
|
||||
|
||||
@Override
|
||||
public ScriptEngine getScriptEngine(
|
||||
Settings settings,
|
||||
Collection<ScriptContext<?>> contexts
|
||||
) {
|
||||
return new ContainsMultipleEngine();
|
||||
}
|
||||
|
||||
// tag::contains_multiple
|
||||
private static class ContainsMultipleEngine implements ScriptEngine {
|
||||
@Override
|
||||
public String getType() {
|
||||
return "expert_scripts";
|
||||
}
|
||||
|
||||
@Override
|
||||
public <T> T compile(
|
||||
String scriptName,
|
||||
String scriptSource,
|
||||
ScriptContext<T> context,
|
||||
Map<String, String> params
|
||||
) {
|
||||
if (context.equals(FilterScript.CONTEXT) == false) {
|
||||
throw new IllegalArgumentException(getType()
|
||||
+ " scripts cannot be used for context ["
|
||||
+ context.name + "]");
|
||||
}
|
||||
// we use the script "source" as the script identifier
|
||||
if ("contains_multiple".equals(scriptSource)) {
|
||||
FilterScript.Factory factory = new ContainsMultipleFactory();
|
||||
return context.factoryClazz.cast(factory);
|
||||
}
|
||||
throw new IllegalArgumentException("Unknown script name "
|
||||
+ scriptSource);
|
||||
}
|
||||
|
||||
@Override
|
||||
public void close() {
|
||||
// optionally close resources
|
||||
}
|
||||
|
||||
@Override
|
||||
public Set<ScriptContext<?>> getSupportedContexts() {
|
||||
return Set.of(FilterScript.CONTEXT);
|
||||
}
|
||||
|
||||
private static class ContainsMultipleFactory implements FilterScript.Factory,
|
||||
ScriptFactory {
|
||||
@Override
|
||||
public boolean isResultDeterministic() {
|
||||
return true;
|
||||
}
|
||||
|
||||
@Override
|
||||
public LeafFactory newFactory(
|
||||
Map<String, Object> params,
|
||||
SearchLookup lookup
|
||||
) {
|
||||
return new ContainsMultipleLeafFactory(params, lookup);
|
||||
}
|
||||
}
|
||||
|
||||
private static class ContainsMultipleLeafFactory implements LeafFactory {
|
||||
private final Map<String, Object> params;
|
||||
private final SearchLookup lookup;
|
||||
private final String field;
|
||||
private final String term;
|
||||
private final int count;
|
||||
|
||||
private ContainsMultipleLeafFactory(
|
||||
Map<String, Object> params, SearchLookup lookup) {
|
||||
if (params.containsKey("field") == false) {
|
||||
throw new IllegalArgumentException(
|
||||
"Missing parameter [field]");
|
||||
}
|
||||
if (params.containsKey("term") == false) {
|
||||
throw new IllegalArgumentException(
|
||||
"Missing parameter [term]");
|
||||
}
|
||||
if (params.containsKey("count") == false) {
|
||||
throw new IllegalArgumentException(
|
||||
"Missing parameter [count]");
|
||||
}
|
||||
this.params = params;
|
||||
this.lookup = lookup;
|
||||
field = params.get("field").toString();
|
||||
term = params.get("term").toString();
|
||||
count = Integer.parseInt(params.get("count").toString());
|
||||
}
|
||||
|
||||
@Override
|
||||
public FilterScript newInstance(LeafReaderContext context)
|
||||
throws IOException {
|
||||
PostingsEnum postings = context.reader().postings(
|
||||
new Term(field, term));
|
||||
if (postings == null) {
|
||||
/*
|
||||
* the field and/or term don't exist in this segment,
|
||||
* so always return 0
|
||||
*/
|
||||
return new FilterScript(params, lookup, context) {
|
||||
@Override
|
||||
public boolean execute() {
|
||||
return false;
|
||||
}
|
||||
};
|
||||
}
|
||||
return new FilterScript(params, lookup, context) {
|
||||
int currentDocid = -1;
|
||||
@Override
|
||||
public void setDocument(int docid) {
|
||||
/*
|
||||
* advance has undefined behavior calling with
|
||||
* a docid <= its current docid
|
||||
*/
|
||||
if (postings.docID() < docid) {
|
||||
try {
|
||||
postings.advance(docid);
|
||||
} catch (IOException e) {
|
||||
throw new UncheckedIOException(e);
|
||||
}
|
||||
}
|
||||
currentDocid = docid;
|
||||
}
|
||||
@Override
|
||||
public boolean execute() {
|
||||
if (postings.docID() != currentDocid) {
|
||||
/*
|
||||
* advance moved past the current doc, so this
|
||||
* doc has no occurrences of the term
|
||||
*/
|
||||
return false;
|
||||
}
|
||||
try {
|
||||
return postings.freq() >= count;
|
||||
} catch (IOException e) {
|
||||
throw new UncheckedIOException(e);
|
||||
}
|
||||
}
|
||||
};
|
||||
}
|
||||
}
|
||||
}
|
||||
// end::contains_multiple
|
||||
}
|
||||
```
|
||||
|
||||
[]{#4-add-it-to-elasticSearch}
|
||||
|
||||
## 4. Add it to ElasticSearch [\#](#4-add-it-to-elasticSearch "Direct link to 4. Add it to ElasticSearch"){.anchor aria-label="Direct link to 4. Add it to ElasticSearch"}
|
||||
|
||||
With our code in place (and synced into our Docker container with a
|
||||
mounted volume), it\'s time to compile it. In the Docker shell you
|
||||
started up in step #1, build your plugin:
|
||||
|
||||
`> gradle build`
|
||||
|
||||
Assuming that works, you should now see a `build` directory with a bunch
|
||||
of stuff in it. The file you care about is
|
||||
`build/distributions/contains-multiple-0.0.1.zip` (though that\'ll
|
||||
obviously change if you call your plugin something different or give it
|
||||
a different version number). Grab that file and copy it to where you
|
||||
plan to actually run ElasticSearch. For me, I placed it in a folder
|
||||
called `.docker/elastic` in the main project repo. In that same
|
||||
directory, create a new `Dockerfile` that\'ll actually run Elastic:
|
||||
|
||||
``` {.code-block .line-numbers}
|
||||
FROM docker.elastic.co/elasticsearch/elasticsearch:7.11.1
|
||||
|
||||
COPY .docker/elastic/contains-multiple-0.0.1.zip /plugins/contains-multiple-0.0.1.zip
|
||||
|
||||
RUN elasticsearch-plugin install
|
||||
file:///plugins/contains-multiple-0.0.1.zip
|
||||
```
|
||||
|
||||
Then, in your project root, create the following `docker-compose.yml`:
|
||||
|
||||
``` {.code-block .line-numbers}
|
||||
version: '3.2'
|
||||
|
||||
services: elasticsearch:
|
||||
image: projectname_elasticsearch
|
||||
build:
|
||||
context: .
|
||||
dockerfile: ./.docker/elastic/Dockerfile
|
||||
ports:
|
||||
- 9200:9200
|
||||
environment:
|
||||
- discovery.type=single-node
|
||||
- script.allowed_types=inline
|
||||
- script.allowed_contexts=filter
|
||||
```
|
||||
|
||||
Those last couple lines are pretty important and your script won\'t work
|
||||
without them. Build your image with `docker-compose build` and then
|
||||
start Elastic with `docker-compose up`.
|
||||
|
||||
[]{#5-use-your-plugin}
|
||||
|
||||
## 5. Use your plugin [\#](#5-use-your-plugin "Direct link to 5. Use your plugin"){.anchor aria-label="Direct link to 5. Use your plugin"}
|
||||
|
||||
To actually see the plugin in action, first create an index and add some
|
||||
documents (I\'ll assume you\'re able to do this if you\'ve read this far
|
||||
into this post). Then, make a query with `curl` (or your Elastic wrapper
|
||||
of choice), substituting `full_text`, `yabba` and `index_name` with
|
||||
whatever makes sense for you:
|
||||
|
||||
``` {.code-block .line-numbers}
|
||||
> curl -H "content-type: application/json"
|
||||
-d '
|
||||
{
|
||||
"query": {
|
||||
"bool": {
|
||||
"filter": {
|
||||
"script": {
|
||||
"script": {
|
||||
"source": "contains_multiple",
|
||||
"lang": "expert_scripts",
|
||||
"params": {
|
||||
"field": "full_text",
|
||||
"term": "yabba",
|
||||
"count": 3
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}'
|
||||
"localhost:9200/index_name/_search?pretty"
|
||||
```
|
||||
|
||||
The result should be something like:
|
||||
|
||||
``` {.code-block .line-numbers}
|
||||
{
|
||||
"took" : 6,
|
||||
"timed_out" : false,
|
||||
"_shards" : {
|
||||
"total" : 1,
|
||||
"successful" : 1,
|
||||
"skipped" : 0,
|
||||
"failed" : 0
|
||||
},
|
||||
"hits" : {
|
||||
"total" : {
|
||||
"value" : 1,
|
||||
"relation" : "eq"
|
||||
},
|
||||
"max_score" : 0.0,
|
||||
"hits" : [
|
||||
{
|
||||
"_index" : "index_name",
|
||||
"_type" : "_doc",
|
||||
"_id" : "10",
|
||||
...
|
||||
```
|
||||
|
||||
So that\'s that, an ElasticSearch plugin from start-to-finish. I\'m sure
|
||||
there are better ways to do some of this stuff, and if you\'re aware of
|
||||
any, let us know in the comments or write your own dang blog.
|
||||
Reference in New Issue
Block a user