Searchindex Configuration - Enterprise Java Content management system - Hippo CMS

Searchindex Configuration

Introduction

The SearchIndex element is part of the Workspace configuration. Also see the Jackrabbit Wiki about Search and Indexing Configuration.

This page elaborates on "best practice" SearchIndex params.

A minimal SearchIndex configuration looks like the following:

<SearchIndex class="org.hippoecm.repository.FacetedNavigationEngineImpl">
   <param name="indexingConfiguration" value="indexing_configuration.xml"/>
   <param name="indexingConfigurationClass" value="org.hippoecm.repository.query.lucene.ServicingIndexingConfigurationImpl"/>
</SearchIndex> 

Best practice

The Jackrabbit documentation describes all available <SearchIndex> params. Below are some Hippo Repository default settings for a combination of params which have been found to be proper values.

<param name="useCompoundFile" value="true"/>
<param name="minMergeDocs" value="1000"/>
<param name="volatileIdleTime" value="10"/>
<param name="maxMergeDocs" value="1000000000"/>
<param name="mergeFactor" value="5"/> 

Setting maxMergeDocs too low or mergeFactor too high results in many lucene indexes which in turn slows down lucene queries severely. The volatileIdleTime is the idle time in seconds until the volatile index part is moved to a persistent index even though minMergeDocs is not reached.

Another interesting SearchIndex param is:

<param name="analyzer"
       value="org.hippoecm.repository.query.lucene.StandardHippoAnalyzer"/> 

analyzer: Default, Hippo ships with org.hippoecm.repository.query.lucene.StandardHippoAnalyzer which is used as default text analyzer. If needed, this analyzer can be replaced by for example org.apache.lucene.analysis.Analyzer.GermanAnalyzer for German texts.

The parameters forceConsistencyCheck, enableConsistencyCheck and autoRepair are all set to true by default. See Checking and fixing search index inconsistencies for more information about this.

<param name="forceConsistencyCheck" value="true"/>
<param name="enableConsistencyCheck" value="true"/>
<param name="autoRepair" value="true"/> 

When highlighting is not used in search results, which is default the case, it is best to not support highlighting at all, as it reduces the Lucene index sizes. This is done through

<param name="supportHighlighting" value="false"/> 

Similarity on text between documents is supported by default, and similarity on text between binaries is by default switched off. This is set through

<param name="supportSimilarityOnStrings" value="true"/>
<param name="supportSimilarityOnBinaries" value="false"/>

Since Hippo Repository supports authorized queries, it can show exact query result sizes very fast since the authorization is checked against Lucene. There are however authorization setups possible, where not all checks can be done against Lucene. In that case, the #getSize of the QueryResult might be bigger than the actual results that can be retrieved (note that an unauthorized node is never returned as standard authorization checks when fetching the node for the query result are still done). If you have an authorization setup that results in #getSize not being precise, and you prefer correctness of the size above performance, you can change the default below to true.

<param name="slowAlwaysExactSizedQueryResult" value="false"/>

 

Searchindex Configuration

Introduction

The SearchIndex element is part of the Workspace configuration. Also see the Jackrabbit Wiki about Search and Indexing Configuration.

This page elaborates on "best practice" SearchIndex params.

A minimal SearchIndex configuration looks like the following:

<SearchIndex class="org.hippoecm.repository.FacetedNavigationEngineImpl">
   <param name="indexingConfiguration" value="indexing_configuration.xml"/>
   <param name="indexingConfigurationClass" value="org.hippoecm.repository.query.lucene.ServicingIndexingConfigurationImpl"/>
</SearchIndex> 

Best practice

The Jackrabbit documentation describes all available <SearchIndex> params. Below are some Hippo Repository default settings for a combination of params which have been found to be proper values.

<param name="useCompoundFile" value="true"/>
<param name="minMergeDocs" value="1000"/>
<param name="volatileIdleTime" value="10"/>
<param name="maxMergeDocs" value="1000000000"/>
<param name="mergeFactor" value="5"/> 

Setting maxMergeDocs too low or mergeFactor too high results in many lucene indexes which in turn slows down lucene queries severely. The volatileIdleTime is the idle time in seconds until the volatile index part is moved to a persistent index even though minMergeDocs is not reached.

Another interesting SearchIndex param is:

<param name="analyzer"
       value="org.hippoecm.repository.query.lucene.StandardHippoAnalyzer"/> 

analyzer: Default, Hippo ships with org.hippoecm.repository.query.lucene.StandardHippoAnalyzer which is used as default text analyzer. If needed, this analyzer can be replaced by for example org.apache.lucene.analysis.Analyzer.GermanAnalyzer for German texts.

The parameters forceConsistencyCheck, enableConsistencyCheck and autoRepair are all set to true by default. See Checking and fixing search index inconsistencies for more information about this.

<param name="forceConsistencyCheck" value="true"/>
<param name="enableConsistencyCheck" value="true"/>
<param name="autoRepair" value="true"/> 

When highlighting is not used in search results, which is default the case, it is best to not support highlighting at all, as it reduces the Lucene index sizes. This is done through

<param name="supportHighlighting" value="false"/> 

Similarity on text between documents is supported by default, and similarity on text between binaries is by default switched off. This is set through

<param name="supportSimilarityOnStrings" value="true"/>
<param name="supportSimilarityOnBinaries" value="false"/>

Since Hippo Repository supports authorized queries, it can show exact query result sizes very fast since the authorization is checked against Lucene. There are however authorization setups possible, where not all checks can be done against Lucene. In that case, the #getSize of the QueryResult might be bigger than the actual results that can be retrieved (note that an unauthorized node is never returned as standard authorization checks when fetching the node for the query result are still done). If you have an authorization setup that results in #getSize not being precise, and you prefer correctness of the size above performance, you can change the default below to true.

<param name="slowAlwaysExactSizedQueryResult" value="false"/>