Monday, November 17, 2014

Nuxeo 6.0 and Elasticsearch

Lucidworks' Lucene and Solr have been the dominant open source search options for the last decade.  Solr is now widely used and is tightly integrated with many products, like those from Alfresco and PTC, and it's used by a wide variety of companies and organizations like Comcast, Disney, Goldmansachs and the FCC.  Here at Formtek, we've integrated it into Formtek Orion software too.

But Solr isn't the only viable open-source search option any more.  For example, it got my attention earlier this year when ECM vendor Nuxeo upgraded the search capabilities of their core product to use Elasticsearch in their 5.9.3 fast track release.  That Elasticsearch integration is officially available now in the Nuxeo long term support (LTS) 6.0 release and was just made available this week.

Solr Versus Elasticsearch

What makes Elasticsearch attractive as a technology?

There's actually a lot of similarities between Solr and Elasticsearch technologies. Both Elasticsearch and Solr are built on top of Lucene, and they're both Java-based Apache-licensed open source software. The feature sets for both of them are very comparable, partly because they're both built on top of Lucene.  Both technologies offer:
  • Java API and REST
  • Faceting
  • Highlighting
  • Replication
  • Distribution
But despite the similarities, or maybe because of them, Elasticsearch has seen tremendous growth in mindshare over the last two years. Google Trends shows that Elasticsearch interest surpassed interest in Solr in 2014.  So, at this point, while the Solr community is significantly bigger and Solr is more mature, Elasticsearch is growing quickly and is expected to grow even faster, especially now after Elasticsearch received $70 million of venture funding in June 2014.

Compared to Solr, opinions about ElasticSearch are often that it is simpler to configure and administer, it's use of REST and JSON is more intuitive, and it is built on an architecture that was designed from the ground up for distributed scaling.

Nuxeo Implementation of Elasticsearch

Some of the benefits of Elasticsearch derived by Nuxeo in their 6.0 release include:

  • Faster full text search
  • Query features like facets, geo location, and "more results like this"
  • Consistency with Nuxeo's NXQL query language
  • Ability to aggregate data for running reports and generating statistics
  • Highly scalability horizontally by adding Elasticsearch nodes
Eric Barroca, Nuxeo CEO, commented that "with Elasticsearch, we have separated the query engine from the database, which has major implications for architectural flexibility and performance.  Because Elasticsearch scales horizontally, the Nuxeo Platform now has virtually infinite scalability.”


Eventual Consistency


When working with Alfresco and Solr implementations I first ran into the problem of 'eventual consistency'.  I like Nuxeo's solution for this with their Elasticsearch implementation.

In short, the problem is that a repository which uses an external search engine often takes time to update the search indexes after any changes are made in the repository.  As a way to make client software seem more responsive, repositories like Alfresco and Nuxeo separate out the process of updating the search index from the database transaction. 

'Eventual consistency' or 'asynchronous indexing' refers to a small gap of time, often just seconds, between when a database operation occurs and when the request to update the search index to reflect the data changes is queued and then finally processed.  Ultimately both the database and search index will be consistent.

In Alfresco 4.0 you had to choose a search engine: either Lucene or Solr.  Lucene searches were 'in transaction' so that database and search indexes were always consistent, while Solr searches would use 'eventual consistency'.  Depending on your use case, it was possible to choose either the Solr or Lucene implementation, and that one engine would then be used for all queries.  But with Alfresco 5.0 Lucene is no longer available, so 'in transaction' consistency is no longer an option.

For most use cases, eventual consistency doesn't cause a problem.  But it means that if a query were to fire off immediately after a database update, the search results may not be totally consistent with what's actually in the database.

With Nuxeo 6.0 there are two ways to search data in the repository:
  • Elasticsearch index query, and
  • Direct Relational or No-SQL database query
Based on your use case, with Nuxeo, you can control which of these types of queries to run.  Elasticsearch queries will be fast but use 'eventual consistency'.  Queries made directly to the database will likely be slower, but provide assurance that the results are totally accurate.

Nuxeo 6.0 allows you to decide which of the two types of queries will be used, either database or Elasticsearch, and both of the query types can be used at different points in the same client application.


No comments:

Post a Comment