Monday, November 17, 2014

Nuxeo 6.0 and Elasticsearch

Lucidworks' Lucene and Solr have been the dominant open source search options for the last decade.  Solr is now widely used and is tightly integrated with many products, like those from Alfresco and PTC, and it's used by a wide variety of companies and organizations like Comcast, Disney, Goldmansachs and the FCC.  Here at Formtek, we've integrated it into Formtek Orion software too.

But Solr isn't the only viable open-source search option any more.  For example, it got my attention earlier this year when ECM vendor Nuxeo upgraded the search capabilities of their core product to use Elasticsearch in their 5.9.3 fast track release.  That Elasticsearch integration is officially available now in the Nuxeo long term support (LTS) 6.0 release and was just made available this week.

Solr Versus Elasticsearch

What makes Elasticsearch attractive as a technology?

There's actually a lot of similarities between Solr and Elasticsearch technologies. Both Elasticsearch and Solr are built on top of Lucene, and they're both Java-based Apache-licensed open source software. The feature sets for both of them are very comparable, partly because they're both built on top of Lucene.  Both technologies offer:
  • Java API and REST
  • Faceting
  • Highlighting
  • Replication
  • Distribution
But despite the similarities, or maybe because of them, Elasticsearch has seen tremendous growth in mindshare over the last two years. Google Trends shows that Elasticsearch interest surpassed interest in Solr in 2014.  So, at this point, while the Solr community is significantly bigger and Solr is more mature, Elasticsearch is growing quickly and is expected to grow even faster, especially now after Elasticsearch received $70 million of venture funding in June 2014.

Compared to Solr, opinions about ElasticSearch are often that it is simpler to configure and administer, it's use of REST and JSON is more intuitive, and it is built on an architecture that was designed from the ground up for distributed scaling.

Nuxeo Implementation of Elasticsearch

Some of the benefits of Elasticsearch derived by Nuxeo in their 6.0 release include:

  • Faster full text search
  • Query features like facets, geo location, and "more results like this"
  • Consistency with Nuxeo's NXQL query language
  • Ability to aggregate data for running reports and generating statistics
  • Highly scalability horizontally by adding Elasticsearch nodes
Eric Barroca, Nuxeo CEO, commented that "with Elasticsearch, we have separated the query engine from the database, which has major implications for architectural flexibility and performance.  Because Elasticsearch scales horizontally, the Nuxeo Platform now has virtually infinite scalability.”


Eventual Consistency


When working with Alfresco and Solr implementations I first ran into the problem of 'eventual consistency'.  I like Nuxeo's solution for this with their Elasticsearch implementation.

In short, the problem is that a repository which uses an external search engine often takes time to update the search indexes after any changes are made in the repository.  As a way to make client software seem more responsive, repositories like Alfresco and Nuxeo separate out the process of updating the search index from the database transaction. 

'Eventual consistency' or 'asynchronous indexing' refers to a small gap of time, often just seconds, between when a database operation occurs and when the request to update the search index to reflect the data changes is queued and then finally processed.  Ultimately both the database and search index will be consistent.

In Alfresco 4.0 you had to choose a search engine: either Lucene or Solr.  Lucene searches were 'in transaction' so that database and search indexes were always consistent, while Solr searches would use 'eventual consistency'.  Depending on your use case, it was possible to choose either the Solr or Lucene implementation, and that one engine would then be used for all queries.  But with Alfresco 5.0 Lucene is no longer available, so 'in transaction' consistency is no longer an option.

For most use cases, eventual consistency doesn't cause a problem.  But it means that if a query were to fire off immediately after a database update, the search results may not be totally consistent with what's actually in the database.

With Nuxeo 6.0 there are two ways to search data in the repository:
  • Elasticsearch index query, and
  • Direct Relational or No-SQL database query
Based on your use case, with Nuxeo, you can control which of these types of queries to run.  Elasticsearch queries will be fast but use 'eventual consistency'.  Queries made directly to the database will likely be slower, but provide assurance that the results are totally accurate.

Nuxeo 6.0 allows you to decide which of the two types of queries will be used, either database or Elasticsearch, and both of the query types can be used at different points in the same client application.


Wednesday, November 12, 2014

Nuxeo Platform 6.0 is Released

Version 6.0 of the Nuxeo platform was officially released today.

Nuxeo has been pretty busy over the last year and they've added some innovative features to their enterprise content management (ECM) platform that really set them apart from other ECM vendors.

While a number of the big features in the new Nuxeo release have been available via 'Fast Track' preview releases made periodically since last December, those features will now all officially roll up and become part of the fully-supported Nuxeo product feature set going forward.

Some of the major highlights of the Nuxeo platform 6.0 release include:

Elasticsearch - extremely scalable and distributed search engine.  Enables hierarchical faceted search.

Collections - a light-weight folder-like object for grouping documents.  Bulk operations like export and download can then be applied to to the collection

MongoDB - optional NoSQL-backend storage offering high flexibility, easy sharding and replication

Mule Connector - enables Nuxeo Automation operations to be inserted inside a Mule Flow, allowing easy integration with other software platforms like Salesforce, Marketo, SAP, and Magento

User Interface Enhancements - including a spreadsheet editor and lightbox support.

CMIS - supports CMIS 1.1 specification, like the new JSON browser binding

Mobile APIs - includes native client SDKs for iOS and Android, including offline sync

Javascript API - includes two implementations, one for node.js and another for jQuery

SAML2 and OAuth 2.0 - enables secure authentication for client applications

AES Encryption - encrypts content with an AES algorithm before moving into the store

A complete list that documents the changes and new features of the Nuxeo platform 6.0 release can be found in the product release notes here.

Nuxeo's Josh Fletcher will also be giving an overview next week on the Nuxeo 6.0 release in a webinar on November 18th.

You can also test drive the latest release here (login: Administrator/Administrator).

The next step in Nuxeo's open product roadmap is just two months away with the 7.1 Fast Track release planned for mid-January 2015.

Monday, November 10, 2014

CMIS Document Migration with Apache Chemistry and Camel


The Headache of Data Migration 

Migration of data between different content repositories can be difficult.  The primary goal of a migration project is to move as losslessly as possible the stored files, associated metadata and filing hierarchy from one system into another.  But data migration can be challenging.

Migrations typically require that an analyst first create a detailed map for how document types and properties will be transferred between the two systems, and then a developer implements that strategy by writing a migration script.  The actual migration process can be tedious and involve a sequence of imports and exports and things like parallel intermediate files or databases which hold normalized property data.

Something Easier: The Apache Camel camel-cmis Component

Recently while looking at how to migrate content stored in an Alfresco repository into a Nuxeo repository, I came across a blog article by Bilgin Ibryam about the Apache Camel project connector for CMIS, a component he contributed to the Camel project.  I was impressed by how he was able to define in just two lines of Java code a program that could move all the data from an Alfresco repository into Nuxeo by recursively iterating through the folder hierarchy starting at the repository root node, and preserving the hierarchy in the move.

While an indiscriminate migration of all content from one repository into another wasn't exactly what I was looking for, I did find that the camel-cmis component was a good starting point for creating a simple migration tool that could move content easily between CMIS compliant repositories.

Besides the repo-to-repo copy, the camel-cmis component also has the ability to identify groups of documents by using a CMIS query and can then pipe the document data from the result set into the next processing step of a Camel route.

Migrating Engineering Documents from Alfresco to Nuxeo

My goal was to be able to successfully migrate into Nuxeo engineering documents which were stored in Alfresco and defined by a content model and document type based on Alfresco aspects.

To do that, I tweaked the camel-cmis component to accept source and target folders, rather than migrate all documents from the repository starting at the repository root.

I modified the camel-cmis component to accept custom metadata properties, and by using CMIS 1.1 'secondary-types' Alfresco aspect data can also be handled.  Both Nuxeo and Alfresco understand CMIS 1.1.

And finally, I created a simple Camel Message Translator (Java bean) that maps the names of the document types and properties extracted from Alfresco to the names in the content model that are used by Nuxeo.  In this case, the property name translations were defined in a simple key-value property file which, when applied, maps the extracted property names before passing them into Nuxeo.



With that it's then possible to write a simple Camel route that defines a migration of data under an Alfresco folder to a Nuxeo folder:
    
from("cmis://http://54.198.64.173/alfresco/api/-default-/public/cmis/versions/1.1/atom?username=admin&password=admin&folderId=744385f3-27fd-4096-a29a-e6108d35cfa0")
    .to("bean:translate")
    .to("cmis://http://localhost:8080/nuxeo/atom/cmis?username=Administrator&password=Administrator&folderId=66d138e4-b0e6-41ee-91c2-aa6fc5991c5e");

This Camel route recursively copies the contents of a specified Alfresco folder and its children to a folder in the Nuxeo repository, maintaining the folder hierarchy.  The following screenshots show how documents and folder structure were moved from an Alfresco Share folder into Nuxeo.



Documents in Alfresco Share

Documents Migrated to Nuxeo

You can see that the documents moved from Alfresco were all engineering AutoCAD DWG files.  The files, custom metadata, and foldering hierarchy were copied into Nuxeo.  Then within Nuxeo we can see the migrated documents.  Also, through a configuration of Nuxeo, we are able to display the engineering metadata and render the AutoCAD file content as both thumbnails and preview images.

Using CMIS tools, and software plug-ins for engineering data management and AutoCAD document management, Formtek can assist organizations with ECM migration to the Nuxeo platform.

Footnotes on CMIS and Camel

The use of CMIS makes it easy to interact with compliant content repositories in a standard way.  It enables the easy sharing of content between repositories from different vendors  CMIS is based on a web services interface that accepts either REST or SOAP protocol.

The Apache Chemistry project provides open source implementation of the CMIS standard.  Both the Alfresco and Nuxeo implementations of CMIS  are based on the Chemistry libraries.  Chemistry offers CMIS server libraries only available for Java.  CMIS client libraries exist for Java, Python, PHP, .NET and ObjectiveC, but the Java libraries are the most complete and best tested.

Apache Camel is an open source framework for implementing Enterprise Integration Patterns (EIP).  It lets you use messaging and transport models like HTTP, ActiveMQ, JMS, JBI, SCA, and CXF to grab data, transform and move it to different end points.