Using the Updater Editor

Introduction

Goal

Perform bulk content operations through scripting.

Background 

The Updater Editor allows developers to create, manage and run updater scripts against a running repository all from within the CMS UI. Updater scripts can bulk-change existing content. For example, they can create a thumbnail of a particular size for every image of a certain type, or rename a certain field for all documents of type news. 

Functionality-wise, this tool has many similarities with the JCR Runner forge project. However the Updater Editor has three important advantages:

  1. Unlike for JCR Runners, you don't need to have console access to the server in order to be able to run your updater scripts. 
  2. The script runs within the same JVM as the repository and does not have to connect over RMI, which increases performance.
  3. A lot of information is logged about an updater run, including information that allows updates to be automatically undone.

Using the Updater Editor it is possible to control which JCR nodes the updater script visits by specifying either a query or a path. It is also possible to control the speed with which the script iterates over the JCR nodes to ensure that the script does not drain the resources of the running repository. The script has access to the full JCR API.

With Great Power Comes Great Responsibility

Updater scripts can modify large parts of your repository. Use them with care.

Security

The scripts are executed via a custom Groovy ClassLoader which protects against obvious and trivial mistakes and misuse (for example invoking System.exit()). However this is not intended to provide a fully protected Groovy sandbox. This means that technically Groovy Updater scripts can be used to execute external programs, possibly compromising the server environment.
Therefore protection against incorrect usage of Groovy updater scripts must be enforced by limiting the access and usage to trusted developers and administrators only.

Managing Updater Scripts

The left side of the Updater Editor consists of three parts:

  • Registry
    Contains all created updater scripts. Add a new script by clicking the 'New' button. Select a script to edit it. Scripts can be saved to and deleted from the repository. 
  • Queue
    Contains all scripts that are (waiting to be) executed. The scripts are executed in the order in which they were added to the queue. Only one script is executed simultaneously, even in a clustered environment. You can stop the currently executing script, and delete queued scripts from the queue. Stopping a script will finish the current NodeUpdaterVisitor#doUpdate call before actually stopping. The output of the script is available in the bottom part of the screen, and live updated every few seconds.
  • History
    Contains all scripts that have been (fully or partially) executed. Scripts that have been executed can be reverted from here, provided they support this feature by having implemented  undoUpdate.

Executing Updater Scripts

The updater engine uses the visitor pattern. Which nodes are visited is specified by supplying either a path or a  query.

  • path is an absolute path in the repository, e.g. '/content/documents', '/hst:hst/hst:configurations', etc. All nodes below the path will be visited, including the node specified by the path itself.
  • query is an XPath query that selects the nodes to visits. Examples queries are:
     //element(*, hippo:document)  all nodes of type ' hippo:document'
      /jcr:root/hst:hst/hst:configurations//element(*, hst:sitemapitem)  all nodes of type ' hst:sitemapitem' below  /hst:hst/hst:configurations
     //*[@example:title='foo']  all nodes that have the property ' example:title' set to the value 'foo'
Try out XPath queries in the repository servlet at http://localhost:8080/cms/repository

For each visited node the updater engine will call the script method doUpdate. When the script modifies the node in any way, it should notify the updater engine by returning true from that method.

Changes to nodes are saved in batches. Each executed script can specify a batch size and a throttle value:

  • batch size is the number of nodes that have to be modified before changes are written to the repository (the engine counts the number of updated nodes by checking if the return of #doUpdate(Node) method; true for updated, false for skipped and exception/error for failed ones). Keep the batch size reasonably low, say fifty or a hundred, to avoid large changesets that consume a lot of memory.
    See more detail in Reporting of Execution section below.
  • throttle is the number of milliseconds to wait after each batch. This avoids that a running repository is swamped with changes and becomes unresponsive to other users.

There are two ways to execute a script:

  • execute will visit all specified nodes and save the changes to the repository after each batch. The UUIDs of all modified nodes are logged in case the script has to be undone later.
  • dry run will also visit all specified nodes, but never write any changes to the repository (i.e. the engine calls  Session.refresh(false) after each batch)

Use dry run to try out new scripts without risk.

Parameterizing Updater Scripts

If your updater script can be reused multiple times without modification of the source, it is useful to set parameters and let your script read the parameters instead of using hard-coded values.

For this purpose, you can define parameters:

  • parameters can be specified with a valid JSON string which defines a map of parameter name (String) and parameter value (Object) pairs.

In your script, you may access the parameters by using the parametersMap variable. For example, if you set parameters to { "basePath": "/content/documents/myhippoproject/news", "tag" : "gogreen" }, then you can access those parameters anywhere (e.g, in #initialize(Session) or #doUpdate(Node) method) in your updater script as follows:

        def basePath = parametersMap["basePath"]
        def tag = parametersMap["tag"]
        log.debug "basePath: ${basePath}, tag: ${tag}"

Automatically Executing Updater Scripts on Startup

It's possible to automatically execute scripts on startup by using the repository-data-application module to add the scripts as content definitions to  /hippo:configuration/hippo:update/hippo:queue. Once the application has started it will execute any scripts in the queue.

Note: there can be additional limitations with respect to the accessible classpath for an automatically executing Updater Script, depending on in which environment it is executed.
In a delivery-tier only environment only the functionality provided by the Hippo Repository might be available on the classpath.  

Undoing Updates

An updater script can support easy undo of its modifications by implementing the undoUpdate method. That method should revert a node back to the state before doUpdate was called.

Scripts in the History that have been executed can be undone by clicking the 'Undo' button. The updater engine will then visit only those nodes again that were modified before by the doUpdate method. For these modified nodes it will call the method undoUpdate.

Items in the History that were dry run or were the result of an undo run cannot be undone.

Writing Updater Scripts

Updater scripts are written in Groovy. Each script must implement the interface NodeUpdateVisitor:

/**
 * Visitor for updating repository content. Replaces
 * {@link org.hippoecm.repository.ext.UpdaterModule}s for all update tasks
 * except backward incompatible node type changes.
 */
public interface NodeUpdateVisitor {

    /**
     * Allows initialization of this updater. Called before any other method is
     * called.
     *
     * @param session a JCR {@link Session} with system credentials
     * @throws RepositoryException when thrown, the updater will not be run by
     *         the framework
     */
    void initialize(Session session) throws RepositoryException;

    /**
     * Update the given node.
     *
     * @param node  the {@link Node} to be updated
     * @return  <code>true</code> if the node was changed, <code>false</code>
     *          if not
     * @throws RepositoryException  if an exception occurred while updating
     *         the node
     */
    boolean doUpdate(Node node) throws RepositoryException;

    /**
     * Revert the given node. This method is intended to be the reverse of the
     * {@link #doUpdate} method.
     * It allows update runs to be reverted in case a problem arises due to the
     * update. The method should throw an {@link UnsupportedOperationException}
     * when it is not implemented.
     *
     * @param node  the node to be reverted.
     * @return  <code>true</code> if the node was changed, <code>false</code>
     *          if not
     * @throws RepositoryException  if an exception occurred while reverting
     *         the node
     * @throws UnsupportedOperationException if the method is not implemented
     */
    boolean undoUpdate(Node node) throws RepositoryException,
                                         UnsupportedOperationException;

    /**
     * Allows cleanup of resources held by this updater. Called after an
     * updater run was completed.
     */
    void destroy();

}

Most scripts will extend the base class BaseNodeUpdateVisitor, which provides a logger and default (no-op) implementations of the methods initialize and destroy

The default updater script only logs the paths of all visited nodes:

package org.hippoecm.frontend.plugins.cms.dev.updater

import org.onehippo.repository.update.BaseNodeUpdateVisitor
import javax.jcr.Node

class UpdaterTemplate extends BaseNodeUpdateVisitor {

  boolean doUpdate(Node node) {
    log.debug "Updating node ${node.path}"
    return false
  }

  boolean undoUpdate(Node node) {
    throw new UnsupportedOperationException('Updater does not implement
                                                         undoUpdate method')
  }

}

The node parameter is a javax.jcr.Node object with which to gain full JCR access to the repository.

Example Script: Add a Property 

The following more elaborate updater script example adds a property ' gettingstarted:copyright' to all documents of type ' gettingstarted:newsdocument'. The script would be useful in the following situation:

  1. create a new project from the archetype 
  2. edit the ' gettingstarted:newsdocument' type and add a String field called 'copyright' and default value "(c) Hippo"
  3. commit the type
  4. create some more news documents

All newly created news document will now get the 'copyright' field with the default value "(c) Hippo". However, existing news documents still have an empty 'copyright' field, and need to be updated. The following updater script will add the default copyright statement if it does not exist. It will also undo the operation.

package org.hippoecm.frontend.plugins.cms.dev.updater

import org.onehippo.repository.update.BaseNodeUpdateVisitor
import javax.jcr.Node

class CopyrightUpdater extends BaseNodeUpdateVisitor {

  private static final PROPERTY_COPYRIGHT = 'gettingstarted:copyright'
  private static final DEFAULT_COPYRIGHT = '(c) Hippo'

  boolean doUpdate(Node node) {
    if (!node.hasProperty(PROPERTY_COPYRIGHT)) {
        log.debug "Adding copyright to node ${node.path}"
        node.setProperty(PROPERTY_COPYRIGHT, DEFAULT_COPYRIGHT);
        return true;
    }
    return false;
  }

  boolean undoUpdate(Node node) {
    if (node.hasProperty(PROPERTY_COPYRIGHT)) {
      node.getProperty(PROPERTY_COPYRIGHT).remove();
      return true;
    }
    return false;
  }

}

Execute this script with the following query:

/jcr:root/content/documents//element(*, gettingstarted:newsdocument) 

After the updater script is executed, all three existing news documents will have the default copyright value too. When the script is undone, the default copyright value will be removed again from only those news documents that were changed by the updater. The other news documents will not be touched.

Default Imports

By default all of the main JCR API packages are already imported by the script classloader: javax.jcr, javax.jcr.nodetype, javax.jcr.security, and javax.jcr.version. You should not have to import package members explicitly anymore.

Restrictions

Some basic restrictions apply to the calls you can make and the classes you can use from your script. Interaction with the local filesystem has been disabled, the following classes cannot be used: java.io.File, java.io.FileDescriptor, java.io.FileInputStream, java.io.FileInputStream, java.io.FileOutputStream, java.io.FileWriter, java.io.FileReader, along with the following packages: java.nio.file, java.net, javax.net, javax.net.ssl. It is also not possible to use reflection, calling Class.forName is illegal and you can't use the package java.lang.reflect. Calling System.exit is also prevented.

And there can be additional limitations with respect to the accessible classpath when automatically executing an Updater Script at startup (see above), depending on in which environment it is executed.
In a delivery-tier only environment only the functionality provided by the Hippo Repository might be available on the classpath.

Portability

The scripts, when executed from within the Updater Editor, are using a classloader in the CMS application context. Therefore, all libraries packaged with your CMS application are available to use by your script. If, however, you wish to develop scripts that can be reused in multiple projects you should take care not to use libraries that are only packaged with that project. The safest bet would be to only use libraries and APIs that are available in the shared class loader only but availability of libraries such as commons-collections and guava can be depended on with some confidence as well.
Furthermore, for automatically executed scripts during startup (see above) possibly only classes in the Repository context might be available in a delivery-tier only environment.

Reporting of Execution

After execution of an updater script, the updater engine logs a report of the execution result about how many nodes were visited and how many nodes were updated, skipped or failed. As explained above, whenever the updated count reaches the batch size, the updater engine either saves the session on 'Execute' or discard the changes on 'Dry run'.

The updater engine automatically records the updated, skipped or failed count on every invocation on #doUpdate(Node) method by default. So, if each unit task of the update process in your updater script matches with each node iteration based on either path or query configuration, this automatic recording and batch processing by the updater engine should be good enough.

However, if your updater script doesn't match with the node iteration based on either path or query configuration but it makes a query and iterates nodes manually, then the generated report would not reflect what the updater script really executed. Your script might not take advantage of using 'Dry run' option, and its execution is not controlled by the batch processing of the updater engine with the batch size configuration, either. Even worse, it may cause an impactful system overhead (e.g, consuming too much memory) due to uncontrolled batch updates.

As of Hippo CMS 10.2, to address the potential problem mentioned above, an updater script may report the updated/skipped/failed nodes manually by using visitorContext variable (type of org.onehippo.repository.update.NodeUpdateVisitorContext).

Here's an example using visitorContext to report the updated news document count after changing a field in a manual node iteration:

/**
 * ExampleNewsDocumentDateFieldUpdateDemoVisitor is a script that does manual node iteration
 * in an original iteration cycle and reports updated node manually in order to be aligned
 * with the built-in batch commit/revert feature of the updater engine for demonstration purpose.
 */
package org.hippoecm.frontend.plugins.cms.admin.updater

import org.onehippo.repository.update.BaseNodeUpdateVisitor
import java.util.*
import javax.jcr.*
import javax.jcr.query.*

class ExampleNewsDocumentDateFieldUpdateDemoVisitor extends BaseNodeUpdateVisitor {

  boolean doUpdate(Node node) {
    log.debug "Visiting node at ${node.path} just as an entry point in this demo."
    
    // new date field value from the current time
    def now = Calendar.getInstance()
    
    // do manual query and node iteration
    def query = node.session.workspace.queryManager.createQuery("//element(*,demosite:newsdocument)", "xpath")
    def result = query.execute()
    
    for (NodeIterator nodeIt = result.getNodes(); nodeIt.hasNext(); ) {
      def newsNode = nodeIt.nextNode()
      newsNode.setProperty("demosite:date", now)
      // report updated to the engine manually here.
      visitorContext.reportUpdated(newsNode.path)
    }
    
    return false
  }

  boolean undoUpdate(Node node) {
    throw new UnsupportedOperationException('Updater does not implement undoUpdate method')
  }

}

In the example shown above, it invokes visitorContext.reportUpdated(path) method after setting "demosite:date" property. And so, the updater engine can be aware of how many nodes were updated and do the batch processing (either save or discard session) properly based on the batch size configuration.