Class ElasticsearchCommitter

All Implemented Interfaces:
IBatchConsumer, ICommitter, IXMLConfigurable, AutoCloseable

public class ElasticsearchCommitter extends AbstractBatchCommitter

Commits documents to Elasticsearch. This committer relies on Elasticsearch REST API.

"_id" field

Elasticsearch expects a field named "_id" that uniquely identifies each documents. You can provide that field yourself in documents you submit. If you do not specify an "_id" field, this committer will create one for you, using the document reference as the identifier value.

"content" field

By default the "body" of a document is read as an input stream and stored in a "content" field. You can change that target field name with setTargetContentField(String). If you set the target content field to null, it will effectively skip storing the content stream.

Dots (.) in field names

Your Elasticsearch installation may consider dots in field names to be representing "objects", which may not always be what you want. If having dots is causing you issues, make sure not to submit fields with dots, or use setDotReplacement(String) to replace dots with a character of your choice (e.g., underscore). If your dot represents a nested object, keep reading.

JSON Objects

It is possible to provide a regular expression that will identify one or more fields containing a JSON object rather than a regular string (setJsonFieldsPattern(String)). For example, this is a useful way to store nested objects. While very flexible, it can be challenging to come up with the JSON structure. You may want to consider custom code. For this to work properly, make sure you define your Elasticsearch field mappings on your index beforehand.

Elasticsearch ID limitations:

As of this writing, Elasticsearch 5 or higher have a 512 bytes limitation on its "_id" field. By default, an error (from Elasticsearch) will result from trying to submit documents with an invalid ID. You can get around this by setting setFixBadIds(boolean) to true. It will truncate references that are too long and append a hash code to it representing the truncated part. This approach is not 100% collision-free (uniqueness), but it should safely cover the vast majority of cases.

Type Name

As of Elasticsearch 7.0, the index type has been deprecated. If you are using Elasticsearch 7.0 or higher, do not configure the typeName. Doing so may cause errors. The typeName is available only for backward compatibility for those using this Committer with older versions of Elasticsearch.

Authentication

Basic authentication is supported for password-protected clusters. Alternatively, API Key authentication can be used by providing the encoded API key value via the setApiKey(String) method or the <apiKey> XML configuration element. When an API key is set, it takes precedence over basic credentials. The API key value should be the Base64-encoded string as provided by Elasticsearch (i.e., the value sent in the Authorization: ApiKey ... header).

Timeouts

You can specify timeout values for when this committer sends documents to Elasticsearch.

XML configuration usage:


<committer
    class="com.norconex.committer.elasticsearch.ElasticsearchCommitter">
  <nodes>
    (Comma-separated list of Elasticsearch node URLs.
    Defaults to http://localhost:9200)
  </nodes>
  <indexName>(Name of the index to use)</indexName>
  <typeName>
    (Name of the type to use. Deprecated since Elasticsearch v7.)
  </typeName>
  <ignoreResponseErrors>[false|true]</ignoreResponseErrors>
  <discoverNodes>[false|true]</discoverNodes>
  <dotReplacement>
    (Optional value replacing dots in field names)
  </dotReplacement>
  <jsonFieldsPattern>
    (Optional regular expression to identify fields containing JSON
    objects instead of regular strings)
  </jsonFieldsPattern>
  <connectionTimeout>(milliseconds)</connectionTimeout>
  <socketTimeout>(milliseconds)</socketTimeout>
  <fixBadIds>
    [false|true](Forces references to fit into Elasticsearch _id field.)
  </fixBadIds>
  <!-- Use "credentials" for basic auth, or "apiKey" for API Key auth. -->
  <credentials/>
  <apiKey>
    (Base64-encoded API key for Elasticsearch API Key authentication.
     When set, takes precedence over basic credentials.)
  </apiKey>
  <sourceIdField>
    (Optional document field name containing the value that will be stored
    in Elasticsearch "_id" field. Default is the document reference.)
  </sourceIdField>
  <targetContentField>
    (Optional Elasticsearch field name to store the document
    content/body. Default is "content".)
  </targetContentField>
</committer>

XML configuration entries expecting millisecond durations can be provided in human-readable format (English only), as per DurationParser (e.g., "5 minutes and 30 seconds" or "5m30s").

XML usage example:


<committer
    class="com.norconex.committer.elasticsearch.ElasticsearchCommitter">
  <indexName>some_index</indexName>
</committer>

The above example uses the minimum required settings, on the local host.

Author:
Pascal Essiembre
  • Field Details

  • Constructor Details

    • ElasticsearchCommitter

      public ElasticsearchCommitter()
  • Method Details

    • getNodes

      public List<String> getNodes()
      Gets an unmodifiable list of Elasticsearch cluster node URLs. Defaults to "http://localhost:9200".
      Returns:
      Elasticsearch nodes
    • setNodes

      public void setNodes(String... nodes)
      Sets cluster node URLs. Node URLs with no port are assumed to be using port 80.
      Parameters:
      nodes - Elasticsearch cluster nodes
    • setNodes

      public void setNodes(List<String> nodes)
      Sets cluster node URLs. Node URLs with no port are assumed to be using port 80.
      Parameters:
      nodes - Elasticsearch cluster nodes
    • getTargetContentField

      public String getTargetContentField()
      Gets the name of the Elasticsearch field where content will be stored. Default is "content".
      Returns:
      field name
    • setTargetContentField

      public void setTargetContentField(String targetContentField)
      Sets the name of the Elasticsearch field where content will be stored. Specifying a null value will disable storing the content.
      Parameters:
      targetContentField - field name
    • getSourceIdField

      public String getSourceIdField()
      Gets the document field name containing the value to be stored in Elasticsearch "_id" field. Default is not a field, but rather the document reference.
      Returns:
      name of field containing id value
    • setSourceIdField

      public void setSourceIdField(String sourceIdField)
      Sets the document field name containing the value to be stored in Elasticsearch "_id" field. Set null to use the document reference instead of a field (default).
      Parameters:
      sourceIdField - name of field containing id value, or null
    • getIndexName

      public String getIndexName()
      Gets the index name.
      Returns:
      index name
    • setIndexName

      public void setIndexName(String indexName)
      Sets the index name.
      Parameters:
      indexName - the index name
    • getTypeName

      public String getTypeName()
      Gets the type name. Type name is deprecated if you are using Elasticsearch 7.0 or higher and should be null.
      Returns:
      type name
    • setTypeName

      public void setTypeName(String typeName)
      Sets the type name. Type name is deprecated if you are using Elasticsearch 7.0 or higher and should be null.
      Parameters:
      typeName - type name
    • getJsonFieldsPattern

      public String getJsonFieldsPattern()
      Gets the regular expression matching fields that contains a JSON object for its value (as opposed to a regular string). Default is null.
      Returns:
      regular expression
      Since:
      4.1.0
    • setJsonFieldsPattern

      public void setJsonFieldsPattern(String jsonFieldsPattern)
      Sets the regular expression matching fields that contains a JSON object for its value (as opposed to a regular string).
      Parameters:
      jsonFieldsPattern - regular expression
      Since:
      4.1.0
    • isIgnoreResponseErrors

      public boolean isIgnoreResponseErrors()
      Whether to ignore response errors. By default, an exception is thrown if the Elasticsearch response contains an error. When true the errors are logged instead.
      Returns:
      true when ignoring response errors
    • setIgnoreResponseErrors

      public void setIgnoreResponseErrors(boolean ignoreResponseErrors)
      Sets whether to ignore response errors. When false, an exception is thrown if the Elasticsearch response contains an error. When true the errors are logged instead.
      Parameters:
      ignoreResponseErrors - true when ignoring response errors
    • isDiscoverNodes

      public boolean isDiscoverNodes()
      Whether automatic discovery of Elasticsearch cluster nodes should be enabled.
      Returns:
      true if enabled
    • setDiscoverNodes

      public void setDiscoverNodes(boolean discoverNodes)
      Sets whether automatic discovery of Elasticsearch cluster nodes should be enabled.
      Parameters:
      discoverNodes - true if enabled
    • getCredentials

      public Credentials getCredentials()
      Gets Elasticsearch authentication credentials.
      Returns:
      credentials
      Since:
      5.0.0
    • setCredentials

      public void setCredentials(Credentials credentials)
      Sets Elasticsearch authentication credentials.
      Parameters:
      credentials - the credentials
      Since:
      5.0.0
    • getApiKey

      public String getApiKey()
      Gets the API key for Elasticsearch API Key authentication.
      Returns:
      the Base64-encoded API key, or null
      Since:
      5.0.0
    • setApiKey

      public void setApiKey(String apiKey)
      Sets the API key for Elasticsearch API Key authentication. When set, this takes precedence over basic credentials. The value should be the Base64-encoded API key as provided by Elasticsearch.
      Parameters:
      apiKey - the Base64-encoded API key
      Since:
      5.0.0
    • getDotReplacement

      public String getDotReplacement()
      Gets the character used to replace dots in field names. Default is null (does not replace dots).
      Returns:
      replacement character or null
    • setDotReplacement

      public void setDotReplacement(String dotReplacement)
      Sets the character used to replace dots in field names.
      Parameters:
      dotReplacement - replacement character or null
    • getConnectionTimeout

      public int getConnectionTimeout()
      Gets Elasticsearch connection timeout.
      Returns:
      milliseconds
      Since:
      4.1.0
    • setConnectionTimeout

      public void setConnectionTimeout(int connectionTimeout)
      Sets Elasticsearch connection timeout.
      Parameters:
      connectionTimeout - milliseconds
      Since:
      4.1.0
    • getSocketTimeout

      public int getSocketTimeout()
      Gets Elasticsearch socket timeout.
      Returns:
      milliseconds
      Since:
      4.1.0
    • setSocketTimeout

      public void setSocketTimeout(int socketTimeout)
      Sets Elasticsearch socket timeout.
      Parameters:
      socketTimeout - milliseconds
      Since:
      4.1.0
    • isFixBadIds

      public boolean isFixBadIds()
      Gets whether to fix IDs that are too long for Elasticsearch ID limitation (512 bytes max). If true, long IDs will be truncated and a hash code representing the truncated part will be appended.
      Returns:
      true to fix IDs that are too long
      Since:
      4.1.0
    • setFixBadIds

      public void setFixBadIds(boolean fixBadIds)
      Sets whether to fix IDs that are too long for Elasticsearch ID limitation (512 bytes max). If true, long IDs will be truncated and a hash code representing the truncated part will be appended.
      Parameters:
      fixBadIds - true to fix IDs that are too long
      Since:
      4.1.0
    • initBatchCommitter

      protected void initBatchCommitter() throws CommitterException
      Overrides:
      initBatchCommitter in class AbstractBatchCommitter
      Throws:
      CommitterException
    • commitBatch

      protected void commitBatch(Iterator<ICommitterRequest> it) throws CommitterException
      Specified by:
      commitBatch in class AbstractBatchCommitter
      Throws:
      CommitterException
    • closeBatchCommitter

      protected void closeBatchCommitter() throws CommitterException
      Overrides:
      closeBatchCommitter in class AbstractBatchCommitter
      Throws:
      CommitterException
    • createRestClient

      protected org.elasticsearch.client.RestClient createRestClient()
    • createSniffer

      protected org.elasticsearch.client.sniff.Sniffer createSniffer(org.elasticsearch.client.RestClient client)
    • saveBatchCommitterToXML

      protected void saveBatchCommitterToXML(XML xml)
      Specified by:
      saveBatchCommitterToXML in class AbstractBatchCommitter
    • loadBatchCommitterFromXML

      protected void loadBatchCommitterFromXML(XML xml)
      Specified by:
      loadBatchCommitterFromXML in class AbstractBatchCommitter
    • equals

      public boolean equals(Object other)
      Overrides:
      equals in class AbstractBatchCommitter
    • hashCode

      public int hashCode()
      Overrides:
      hashCode in class AbstractBatchCommitter
    • toString

      public String toString()
      Overrides:
      toString in class AbstractBatchCommitter