Default Hippo CMS Functionality - Derived Data

Derived data is the concept where additional information for a document generated, based upon the information of the document itself.

Description of the problem

For example, if you store some (binary) content in a document, you can add a property to the document that indicates the size of the content stored.

In principle, this is utterly useless and even somewhat dangerous. This because one can simply use a function that computes the size of the document on the fly. So when retrieving the document, there is no fundamental reason that retrieving the property with the size is better than just calling the function which computes the size of the content. The same yields for querying, a query might as well be written as SELECT * FROM documents WHERE length(content) > 0.

The danger lies that when modifying the content, you can forget to modify the property size to reflect the different size of the content. Now the content and the size property are inconsistent, which is generally considered a bigger problem than just forgetting the data all together!

There are however technical reasons why such functionality is still needed. For one, the compute function might be very expensive, and since this function could be executed a lot of times, especially in a query over all documents. Another, even more pressing need for this functionality is that the query languages available do not allow you to express all kind of realistic queries. For example, a XPATH query does not allow you to query for a document which contains two properties, lets call them a and b, and where these two properties are equal. Naively this could be written down as //*[@a=@b] but this yields no results even though logically there would be. Certain other queries are possible but have huge performance impacts. These are deliberate deficiencies in the query languages XPATH and JCR-SQL, not bugs.

Facility offered

As a solution to expressing efficient queries and accessing information about the content without having to know or execute the procedure how the data is obtained, the Hippo CMS repository has the facility of "derived data".
As the term coins, it computes properties of a document, derived from other properties of the document. Derived properties may be put on and computed from the JCR node that represents a document, or on any descendent node in a document.

When entering a document which should contain such derived property, you should not set the value of the derived property yourself. Instead, the repository upon a save() call will automatically compute the value of the property. Because the repository guarantees to recompute the property upon a save, the problem of inconsistent data has nearly vanished.

In order for the repository to do this, the repository must be informed when and how to compute the properties.

  • The "when" is determined by the JCR nodetype of the data. The repository can be configured to compute a property of a certain node type.

    The logic of this is that a JCR nodetype defines the valid properties a piece of document is allowed to contain, i.e.. the structure of the data. Whether or not a property is to be derived is at the very least dependent on whether a piece of data can contain the property in question. It is therefore proper to use the content model also to determine when to compute a property.

  • The "how" to compute a property is to be given as a method in a class that is given the document as input and should compute the property as a pure function, i.e. the implementing method must always return the same result when given the same input parameters and not access any other resources than given as parameters.

Because of the behavior that derived properties are computed upon every modification which could lead to a different value of the derived property, and the compute method must be written as a pure function this methodology is functionally equivalent as if the properties where never computed at when storing data, but computed on the fly when retrieving or querying the data.

How to define

We will outline how to define, configure and use derived data based on a simple example to compute Pythagorean theorem.

Defining the data for which to compute properties

We define a document type that is a core shape definition:

[sample:shape] > hippo:document - sample:a (double) - sample:b (double)

And subsequently a definition that can be added as mixin type to the shape definition to indicate the shape is a triangle:

[sample:triangle] > hippo:derived mixin - sample:c (double)

To indicate certain properties of this type sample:triangle are to be computed using the procedure of derived data we must extend from the hippo:derived mixin node type.

Configuring the repository to compute derived properties for this data

Now we need to configure in the repository how to compute the derived property field of sample:triangle. These procedures are defined in the JCR repository under /hippo:configuration/hippo:derivatives. To compute the c property we can enter the following JCR definition

[repository root] `-- hippo:configuration `-- hippo:derivatives [hippo:derivativesfolder] `-- pythagorean [hippo:deriveddefinition] + hippo:nodetype = sample:triangle + hippo:classname = sample.PythagoreanTheorem + hippo:serialver = 1 |-- hippo:accessed [hippo:propertyreferences] | |-- a [hippo:relativepropertyreference] | | + hippo:relPath = sample:a | `-- b [hippo:relativepropertyreference] | + hippo:relPath = sample:b `-- hippo:derived [hippo:propertyreferences] `-- c [hippo:relativepropertyreference] + hippo:relPath = sample:c

First, the hippo:nodetype property defines the nodetype which contains the properties that should be derived. For any change to nodes of this type, this derived data definition indicates the function to be executed.

The hippo:classname contains the name of the class that should extend the base class org.hippoecm.repository.ext.DerivedDataFunction and implement the method compute should should be a pure function as implementation. The class PythagoreanTheorem must have a no argument public constructor. The number stated in the hippo:serialver property should match the serialVersionUID field in the implementing class sample.PythagorieanTheorem.

The definitions in hippo:accessed and hippo:derived node structure indicate the input and output parameters to the derived data function.

Here we indicate that relative to the node of type sample:triangle there are two input properties: sample:a and sample:b. The hippo:relPath properties indicate the relative path to the subject node for which the computation takes place. The value of these two properties are entered as keys "a" and "b" (the name of the hippo:relativepropertyreference nodes) in a Map the compute method implemented by PythagoreanTheorem takes as input:

public Map<String,Value[]> compute(Map<String,Value[]> parameters);

As result the compute method should return a map where under the key "c" the value for the derived property sample:c can be found. The definition also states the --possible multiple-- computed results by the function as nodes under hippo:derived. The hippo:relPath again indicates the relative path to the property.

The hippo:relPath may indicate any property below the document for which properties are computed. It may not contain references to other documents.

Supplying the method that computes the derived property

The configuration indicates which class should be used to compute the data. This class must extend the org.hippoecm.repository.ext.DerivedDataFunction base class and implement the compute method. It cannot be stressed enough that it is an requirement that this is a pure function.

package sample; import org.hippoecm.repository.ext.DerivedDataFunction; public static class PythagoreanTheorem extends DerivedDataFunction { static final long serialVersionUID = 1; public Map<String,Value[]> compute(Map<String,Value[]> parameters) { double a = parameters.get("a")[0].getDouble(); double b = parameters.get("b")[0].getDouble(); double c = Math.sqrt(a * a + b * b); parameters.put("c", new Value[] { getValueFactory().createValue(c) }); return parameters; } }

This class can be packages in a normal plug-in. Upon any change the properties will be computed. Current limitations give however one exception, imported data is not recomputed and must be already correct.

Hippo Europe: +31 (0)20 5224466
Hippo North America: +1 (707) 773-4646