Searchindex Configuration

Configure the search index for content that is shared repository wide (/jcr:system tree, contains mainly versions).

Introduction

The SearchIndex element is part of the Workspace configuration. Also see the Jackrabbit Wiki about Search and Indexing Configuration.

This page elaborates on "best practice" SearchIndex params, the custom Hippo Repository ServicingSearchIndex, and the custom Hippo Repository ServicingIndexingConfigurationImpl.

Minimal configuration

A minimal SearchIndex configuration looks like the following:

<SearchIndex class="org.hippoecm.repository.FacetedNavigationEngineThirdImpl"> <param name="indexingConfiguration" value="indexing_configuration.xml"/> <param name="indexingConfigurationClass" value="org.hippoecm.repository.query.lucene.ServicingIndexingConfigurationImpl"/> </SearchIndex>

Best practice

The Jackrabbit documentation describes all available <SearchIndex> params. Below are some Hippo Repository default settings for a combination of params which have been found to be proper values.

<param name="useCompoundFile" value="true"/> <param name="minMergeDocs" value="1000"/> <param name="volatileIdleTime" value="10"/> <param name="maxMergeDocs" value="1000000000"/> <param name="mergeFactor" value="5"/>

Setting maxMergeDocs too low or mergeFactor too high results in many lucene indexes which in turn slows down lucene queries severely. The volatileIdleTime is the idle time in seconds until the volatile index part is moved to a persistent index even though minMergeDocs is not reached. New, working link to JackRabbit lucene implementation Also see href="http://jackrabbit.apache.org/doc/arch/operate/index-readers.html" JackRabbit lucene implementation.

Two other interesting SearchIndex params are

  1. textFilterClasses : Sets the list of text filters (and text extractors) to use for extracting text content from binary properties. The list must be comma (or whitespace) separated, and contain fully qualified class names of the TextFilter (and since 1.3 TextExtractor ) classes to be used. The configured classes must all have a public default constructor.

  2. analyzer: Default, JackRabbit ships with a JackrabbitAnalyzer. The analyzer configured in <param name="analyzer" value="..."> is used as default text analyzer. If needed, this analyzer can be replaced by for example org.apache.lucene.analysis.Analyzer.GermanAnalyzer for German texts. See IndexingConfiguration at the bottom, there is described how to set different analyzers per property.

For having search highlighting and spellchecking enable the following parameters are configured:

<param name="excerptProviderClass" value="org.apache.jackrabbit.core.query.lucene.DefaultHTMLExcerpt"/> <param name="supportHighlighting" value="true"/> <param name="spellCheckerClass" value="org.hippoecm.repository.query.lucene.LuceneSpellChecker$OneMinuteRefreshInterval"/>

The parameters forceConsistencyCheck, enableConsistencyCheck and autoRepair are all set to true.

<param name="forceConsistencyCheck" value="true"/> <param name="enableConsistencyCheck" value="true"/> <param name="autoRepair" value="true"/>

ServicingSearchIndex

The ServicingSearchIndex extends the org.apache.jackrabbit.core.query.lucene.SearchIndex to enable

  1. getting indexing_configuration from a jar classpath instead of the current JackRabbit implementation which can only fetch an indexing_configuration from filesystem.

  2. extending the createDocument that is responsible for creating a lucene Document which will be added to the lucene index. This createDocument uses Hippo ServicingNodeIndexer which indexes documents in a way to enable fast faceted searching on their properties.

ServicingIndexingConfigurationImpl

The ServicingIndexingConfigurationImpl only serves to have the http://www.hippoecm.org/nt/1.2 namespace available, and to know whether a property is a Hippo Facet or Hippo Path, and needs to be indexed as such. The indexing_configuration is ignored.

Configuration can look like:

<configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0" xmlns:hippo="http://www.hippoecm.org/nt/1.2"> <facets> <property name="author"/> <property name="date"/> <property name="published"/> </facets> </configuration>

This would mean, that only properties of type 'caption', 'author' and 'date' are indexed as facets, and other properties are not available as facets. The advantage for this is that less properties needs to be indexed and their TermVector stored. For example, it makes hardly any use to index the 'content' of a document as a facet value. Some JCR properties are indexed as facet, and some are never. See below:

property facet indexed
javax.jcr.PropertyType.BINARY NO
javax.jcr.PropertyType.BOOLEAN YES
javax.jcr.PropertyType.DATE YES
javax.jcr.PropertyType.DOUBLE YES
javax.jcr.PropertyType.LONG YES
javax.jcr.PropertyType.REFERENCE NO
javax.jcr.PropertyType.PATH NO
javax.jcr.PropertyType.STRING YES
javax.jcr.PropertyType.NAME NO

Hippo Europe: +31 (0)20 5224466
Hippo North America: +1 (707) 773-4646