Pull in Viget posts

This commit is contained in:
David Eisinger
2023-10-22 23:52:56 -04:00
parent 625d374135
commit 0438a6d828
77 changed files with 8219 additions and 5 deletions

View File

@@ -0,0 +1,427 @@
---
title: "Lets Write a Dang ElasticSearch Plugin"
date: 2021-03-15T00:00:00+00:00
draft: false
needs_review: true
canonical_url: https://www.viget.com/articles/lets-write-a-dang-elasticsearch-plugin/
---
One of our current projects involves a complex interactive query builder
to search a large collection of news items. Some of the conditionals
fall outside of the sweet spot of Postgres (e.g. word X must appear
within Y words of word Z), and so we opted to pull in
[ElasticSearch](https://www.elastic.co/elasticsearch/) alongside it.
It\'s worked perfectly, hitting all of our condition and grouping needs
with one exception: we need to be able to filter for articles that
contain a term a minimum number of times (so \"Apple\" must appear in
the article 3 times, for example). Frustratingly, Elastic *totally* has
this information via its
[`term_vector`](https://www.elastic.co/guide/en/elasticsearch/reference/current/term-vector.html)
feature, but you can\'t use that data inside a query, as least as far as
I can tell.
The solution, it seems, is to write a custom plugin. I figured it out,
eventually, but it was a lot of trial-and-error as the documentation I
was able to find is largely outdated or incomplete. So I figured I\'d
take what I learned while it\'s still fresh in my mind in the hopes that
someone else might have an easier time of it. That\'s what internet
friends are for, after all.
Quick note before we start: all the version numbers you see are current
and working as of February 25, 2021. Hopefully this post ages well, but
if you try this out and hit issues, bumping the versions of Elastic,
Gradle, and maybe even Java is probably a good place to start. Also, I
use `projectname` a lot in the code examples --- that\'s not a special
word and you should change it to something that makes sense for you.
[]{#1-set-up-a-java-development-environment}
## 1. Set up a Java development environment [\#](#1-set-up-a-java-development-environment "Direct link to 1. Set up a Java development environment"){.anchor aria-label="Direct link to 1. Set up a Java development environment"}
First off, you\'re gonna be writing some Java. That\'s not my usual
thing, so the first step was to get a working environment to compile my
code. To do that, we\'ll use [Docker](https://www.docker.com/). Here\'s
a `Dockerfile`:
``` {.code-block .line-numbers}
FROM adoptopenjdk/openjdk12:jdk-12.0.2_10-ubuntu
RUN apt-get update &&
apt-get install -y zip unzip &&
rm -rf /var/lib/apt/lists/*
SHELL ["/bin/bash", "-c"]
RUN curl -s "https://get.sdkman.io" | bash &&
source "/root/.sdkman/bin/sdkman-init.sh" &&
sdk install gradle 6.8.2
WORKDIR /plugin
```
We use a base image with all the Java stuff but also a working Ubuntu
install so that we can do normal Linux-y things inside our container.
From your terminal, build the image:
`> docker build . -t projectname-java`
Then, spin up the container and start an interactive shell, mounting
your local working directory into `/plugin`:
`> docker run --rm -it -v ${PWD}:/plugin projectname-java bash`
[]{#2-configure-gradle}
## 2. Configure Gradle [\#](#2-configure-gradle "Direct link to 2. Configure Gradle"){.anchor aria-label="Direct link to 2. Configure Gradle"}
[Gradle](https://gradle.org/) is a \"build automation tool for
multi-language software development,\" and what Elastic recommends for
plugin development. Configuring Gradle to build the plugin properly was
the hardest part of this whole endeavor. Throw this into `build.gradle`
in your project root:
``` {.code-block .line-numbers}
buildscript {
repositories {
mavenLocal()
mavenCentral()
jcenter()
}
dependencies {
classpath "org.elasticsearch.gradle:build-tools:7.11.1"
}
}
apply plugin: 'java'
compileJava {
sourceCompatibility = JavaVersion.VERSION_12
targetCompatibility = JavaVersion.VERSION_12
}
apply plugin: 'elasticsearch.esplugin'
group = "com.projectname"
version = "0.0.1"
esplugin {
name 'contains-multiple'
description 'A script for finding documents that match a term a certain number of times'
classname 'com.projectname.containsmultiple.ContainsMultiplePlugin'
licenseFile rootProject.file('LICENSE.txt')
noticeFile rootProject.file('NOTICE.txt')
}
validateNebulaPom.enabled = false
```
You\'ll also need files named `LICENSE.txt` and `NOTICE.txt` --- mine
are empty, since the plugin is for internal use only. If you\'re going
to be releasing your plugin in some public way, maybe talk to a lawyer
about what to put in those files.
[]{#3-write-the-dang-plugin}
## 3. Write the dang plugin [\#](#3-write-the-dang-plugin "Direct link to 3. Write the dang plugin"){.anchor aria-label="Direct link to 3. Write the dang plugin"}
To write the actual plugin, I started with [this example
plugin](https://github.com/elastic/elasticsearch/blob/master/plugins/examples/script-expert-scoring/src/main/java/org/elasticsearch/example/expertscript/ExpertScriptPlugin.java)
which scores a document based on the frequency of a given term. My use
case was fortunately quite similar, though I\'m using a `filter` query,
meaning I just want a boolean, i.e. does this document contain this term
the requisite number of times? As such, I implemented a
[`FilterScript`](https://www.javadoc.io/doc/org.elasticsearch/elasticsearch/latest/org/elasticsearch/script/FilterScript.html)
rather than the `ScoreScript` implemented in the example code.
This file lives in (deep breath)
`src/main/java/com/projectname/containsmultiple/ContainsMultiplePlugin.java`:
``` {.code-block .line-numbers}
package com.projectname.containsmultiple;
import org.apache.lucene.index.LeafReaderContext;
import org.apache.lucene.index.PostingsEnum;
import org.apache.lucene.index.Term;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.plugins.Plugin;
import org.elasticsearch.plugins.ScriptPlugin;
import org.elasticsearch.script.FilterScript;
import org.elasticsearch.script.FilterScript.LeafFactory;
import org.elasticsearch.script.ScriptContext;
import org.elasticsearch.script.ScriptEngine;
import org.elasticsearch.script.ScriptFactory;
import org.elasticsearch.search.lookup.SearchLookup;
import java.io.IOException;
import java.io.UncheckedIOException;
import java.util.Collection;
import java.util.Map;
import java.util.Set;
/**
* A script for finding documents that match a term a certain number of times
*/
public class ContainsMultiplePlugin extends Plugin implements ScriptPlugin {
@Override
public ScriptEngine getScriptEngine(
Settings settings,
Collection<ScriptContext<?>> contexts
) {
return new ContainsMultipleEngine();
}
// tag::contains_multiple
private static class ContainsMultipleEngine implements ScriptEngine {
@Override
public String getType() {
return "expert_scripts";
}
@Override
public <T> T compile(
String scriptName,
String scriptSource,
ScriptContext<T> context,
Map<String, String> params
) {
if (context.equals(FilterScript.CONTEXT) == false) {
throw new IllegalArgumentException(getType()
+ " scripts cannot be used for context ["
+ context.name + "]");
}
// we use the script "source" as the script identifier
if ("contains_multiple".equals(scriptSource)) {
FilterScript.Factory factory = new ContainsMultipleFactory();
return context.factoryClazz.cast(factory);
}
throw new IllegalArgumentException("Unknown script name "
+ scriptSource);
}
@Override
public void close() {
// optionally close resources
}
@Override
public Set<ScriptContext<?>> getSupportedContexts() {
return Set.of(FilterScript.CONTEXT);
}
private static class ContainsMultipleFactory implements FilterScript.Factory,
ScriptFactory {
@Override
public boolean isResultDeterministic() {
return true;
}
@Override
public LeafFactory newFactory(
Map<String, Object> params,
SearchLookup lookup
) {
return new ContainsMultipleLeafFactory(params, lookup);
}
}
private static class ContainsMultipleLeafFactory implements LeafFactory {
private final Map<String, Object> params;
private final SearchLookup lookup;
private final String field;
private final String term;
private final int count;
private ContainsMultipleLeafFactory(
Map<String, Object> params, SearchLookup lookup) {
if (params.containsKey("field") == false) {
throw new IllegalArgumentException(
"Missing parameter [field]");
}
if (params.containsKey("term") == false) {
throw new IllegalArgumentException(
"Missing parameter [term]");
}
if (params.containsKey("count") == false) {
throw new IllegalArgumentException(
"Missing parameter [count]");
}
this.params = params;
this.lookup = lookup;
field = params.get("field").toString();
term = params.get("term").toString();
count = Integer.parseInt(params.get("count").toString());
}
@Override
public FilterScript newInstance(LeafReaderContext context)
throws IOException {
PostingsEnum postings = context.reader().postings(
new Term(field, term));
if (postings == null) {
/*
* the field and/or term don't exist in this segment,
* so always return 0
*/
return new FilterScript(params, lookup, context) {
@Override
public boolean execute() {
return false;
}
};
}
return new FilterScript(params, lookup, context) {
int currentDocid = -1;
@Override
public void setDocument(int docid) {
/*
* advance has undefined behavior calling with
* a docid <= its current docid
*/
if (postings.docID() < docid) {
try {
postings.advance(docid);
} catch (IOException e) {
throw new UncheckedIOException(e);
}
}
currentDocid = docid;
}
@Override
public boolean execute() {
if (postings.docID() != currentDocid) {
/*
* advance moved past the current doc, so this
* doc has no occurrences of the term
*/
return false;
}
try {
return postings.freq() >= count;
} catch (IOException e) {
throw new UncheckedIOException(e);
}
}
};
}
}
}
// end::contains_multiple
}
```
[]{#4-add-it-to-elasticSearch}
## 4. Add it to ElasticSearch [\#](#4-add-it-to-elasticSearch "Direct link to 4. Add it to ElasticSearch"){.anchor aria-label="Direct link to 4. Add it to ElasticSearch"}
With our code in place (and synced into our Docker container with a
mounted volume), it\'s time to compile it. In the Docker shell you
started up in step #1, build your plugin:
`> gradle build`
Assuming that works, you should now see a `build` directory with a bunch
of stuff in it. The file you care about is
`build/distributions/contains-multiple-0.0.1.zip` (though that\'ll
obviously change if you call your plugin something different or give it
a different version number). Grab that file and copy it to where you
plan to actually run ElasticSearch. For me, I placed it in a folder
called `.docker/elastic` in the main project repo. In that same
directory, create a new `Dockerfile` that\'ll actually run Elastic:
``` {.code-block .line-numbers}
FROM docker.elastic.co/elasticsearch/elasticsearch:7.11.1
COPY .docker/elastic/contains-multiple-0.0.1.zip /plugins/contains-multiple-0.0.1.zip
RUN elasticsearch-plugin install
file:///plugins/contains-multiple-0.0.1.zip
```
Then, in your project root, create the following `docker-compose.yml`:
``` {.code-block .line-numbers}
version: '3.2'
services: elasticsearch:
image: projectname_elasticsearch
build:
context: .
dockerfile: ./.docker/elastic/Dockerfile
ports:
- 9200:9200
environment:
- discovery.type=single-node
- script.allowed_types=inline
- script.allowed_contexts=filter
```
Those last couple lines are pretty important and your script won\'t work
without them. Build your image with `docker-compose build` and then
start Elastic with `docker-compose up`.
[]{#5-use-your-plugin}
## 5. Use your plugin [\#](#5-use-your-plugin "Direct link to 5. Use your plugin"){.anchor aria-label="Direct link to 5. Use your plugin"}
To actually see the plugin in action, first create an index and add some
documents (I\'ll assume you\'re able to do this if you\'ve read this far
into this post). Then, make a query with `curl` (or your Elastic wrapper
of choice), substituting `full_text`, `yabba` and `index_name` with
whatever makes sense for you:
``` {.code-block .line-numbers}
> curl -H "content-type: application/json"
-d '
{
"query": {
"bool": {
"filter": {
"script": {
"script": {
"source": "contains_multiple",
"lang": "expert_scripts",
"params": {
"field": "full_text",
"term": "yabba",
"count": 3
}
}
}
}
}
}
}'
"localhost:9200/index_name/_search?pretty"
```
The result should be something like:
``` {.code-block .line-numbers}
{
"took" : 6,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.0,
"hits" : [
{
"_index" : "index_name",
"_type" : "_doc",
"_id" : "10",
...
```
So that\'s that, an ElasticSearch plugin from start-to-finish. I\'m sure
there are better ways to do some of this stuff, and if you\'re aware of
any, let us know in the comments or write your own dang blog.