Wednesday, November 21, 2012

Synchronization of File Properties with Alfresco Metadata Properties

At Formtek we recently had a request to customize Alfresco Share for synchronizing document header properties with corresponding metadata for documents stored in Share.

I'm not able to share the code from the project here, but I thought that outlining the basic concept of the project here would serve as an example of some of the things which are possible to implement within Share.

There were a number of requirements for this project, but two two of the core ones were:

  1. Synchronize on uploads and metadata updates the properties of Microsoft Office (Word/Excel/PowerPoint all versions) and PDF files with corresponding metadata for the document in Alfresco.
  2. Provide a method for 'publishing' a synchronized document into another location as a PDF.  The file header of the published document should bring along with it the values for the Alfresco metadata at the time the document was published.
When I talk about file content header properties, I'm referring to the types of properties that can be set in the header of Microsoft Office and PDF files.  For example, the next figure is a screenshot in Microsoft Word 2010 for setting the standard (Title, Author, keywords, and subject) properties and custom properties.


Properties and custom properties can be similarly defined in PDF files.

Property Extraction on File Upload
When one of these files with properties/custom properties is uploaded into Alfresco, the document that is created captures this additional data based on a mapping properties file that is configured.

Within Share, the metadata would get mapped to something similar to the following panel view of the property data in the Share document detail window.


In this case, the mapping file that specifies how mapping from the content file to the Alfresco metadata properties is as follows:


Property Updates on Alfresco Metadata Edits
This mapping of properties on upload resembles standard Alfresco property extraction, or a special version of it that also accepts and knows how to map the custom property values.  But what is different is that two way synchronization with the properties in the file also occurs.  Note that the mapping also correctly handles the datatypes in the mapping, like boolean, text, number and date.

The property mapping is bi-directional so that when properties are updated in Alfresco, the electronic file associated with the document will be rewritten.  That means that the next time the document is downloaded, the properties in the file will be consistent with the corresponding properties in Alfresco.

Publishing to PDF
When a synchronized file is rendered as a PDF file and 'published', the user can select the location of a folder in the current or different Share site.  Actually for our customization, we call the 'publish' action 'transfer' to avoid confusion with the 'Publish' action already available in Share.

The user clicks on the 'Transfer to...' action for the document to start the process.


After that the user selects the target location of the published PDF document using a re-engineered Copy/Move to dialog from Share:


The rendered PDF file is then available as a new document in the target location.


When we download the file associated with this document and open it into Adobe Reader, we can examine the settings of the file properties.

The standard properties in the newly created PDF file are shown as:


And custom properties are seen here:


Tracking Published Documents

Within the original document, we also keep track of when the document has been published.  A panel in the document details page in Share for the original document now shows how many times the document has been published/transferred and to where.


Monday, November 19, 2012

Book Review: "Intelligent Document Capture with Ephesoft" by Pat Myers and Ike Kavas


Intelligent Document Capture with Ephesoft is a new book from Packt Publishing.  The primary authors of the book are Pat Myers, executive vice president of Zia, and Ike Kavas, founder and CTO of Ephesoft and also former Kofax employee.  Myers and Kavas together developed the Ephesoft training program.

What is Ephesoft?  Ephesoft software is used to process and capture paper, email and fax documents for use within ECM, ERP and other enterprise software systems.  ECM systems supported by Ephesoft include Alfresco, FileNet, SharePoint, and generic CMIS repositories.  Ephesoft's capabilities include document classification, separation, and data extraction.

Ephesoft is Open Source software and similar in functionality to proprietary systems like IBM-DataCapEMC CaptivaKofax, and Athento.  It is built from Open Source components like Spring DM, Hibernate, Lucene, and jBPM.

At only 161 pages, this book on Ephesoft uses a format that's considerably shorter than many other technical books, and because of the large number of screenshots it contains, it is a relatively quick read.

The book provides a high-level overview of Ephesoft and describes a path that users can take to get an Ephesoft document capture system up and running quickly.  After finishing this book, the reader will have enough background to get started with building their own capture projects based on Ephesoft.   But that's not to say that this book is a definitive reference for Ephesoft.  Actually, there is much more detailed documentation available on-line that can be found in the Ephesoft wiki pages.  Free on-line training is also available from Ephesoft via the YouTube-based Ephesoft University.

The book consists of the following chapters:
  1. Introduction
    Discusses document capture history, benefits of capture, and a description of some typical high-ROI document capture use cases like mortgage loan processing, claims processing, and the handling of invoices and sales orders.
    At a high level, and in a way not specific to Ephesoft, the book describes different document classification methods like the use of barcodes, image layout classification, keywords, and content analysis.
    Similarly the book explains different types of extraction methods, like zonal OCR (optical character recognition), keywords, position information, and the look up of supplemental information from databases and other systems.
  2. A Quick Tour of Ephesoft
    This chapter describes each of the five tabs in the Ephesoft administrative user interface [see also the on-line Ephesoft Admin Manual]:
        - Batch Class Management
        - Batch Instance Management
        - Workflow Management
        - Folder Management
        - Reports
    It also describes the four tabs of the Operator User Interface [see also the on-line Ephesoft User Manual]:
        - Home/Batch List
        - Batch Details
        - Web Scanner
        - Batch Upload
    The description for each tab is based on a screenshot followed up with details about how to use the features available on the tab.
    This chapter is made available for free by Packt as a sample of the book and can be found online here.
  3. Creating a Batch Class
    This chapter gives an example of how to create a new batch class from the Ephesoft administrative user interface.
    The standard Ephesoft mailroom automation batch template is copied and modified to create a new custom batch class.  Then a new document type for that batch is added and configured.  With training, Ephesoft is able to recognize the document type for automatic classification and separation.
    With configuration, Ephesoft can extract content from scanned images and map the extracted data as key/value pairs to fields for the document type.  Field data can also be validated with validation rules using regular expressions.
  4. Processing a Batch
    This chapter uses the batch class created in chapter 3 and shows how incoming documents for this batch class can be processed.  Batch processing is performed from the Operator's interface.
    This is the shortest chapter in the book.  It shows how a batch is started, and from the Operator's interface, how the review and verification steps are performed.
  5. Core Ephesoft Features
    I found the book to become more interesting after this point, because starting in this chapter the examples are a bit more detailed.
    For example, there is information here about the different types of document classification and how to configure them: Search, Image, Barcode, Automatic, and Programmatic.
    Also discussed is how, once document and field data have been captured, how to export that information into a repository (primarily via CMIS) or database.
  6. Ephesoft Extended Features
    This chapter gets into more advanced features available in Ephesoft.  For example, it describes some features of classification based on image and barcode recognition that are a bit more advanced than the techniques described in chapter 5.
    The Enterprise version of Ephesoft includes an integration with OpenText's RecoStar OCR engine -- this chapter describes how to enable and configure the option.
    Discussed here are product extension points where the user can write Java 'scripts' which customize and change standard product behavior.
    The chapter also talks about how the base Ephesoft product can be extended with plugins and how to write new custom plugins.
  7. Tips
    The final chapter collects a variety of general tips and pieces of information to optimize your use of Ephesoft.  It contains troubleshooting hints like how to configure logging and how to monitor batch processes.  It also discusses how to configure Ephesoft to use authentication with LDAP and Active Directory.
Would I recommend this book?  I'd highly recommend it to someone that is not currently familiar with Ephesoft and who wants to jump start their use of the product.  But existing users of Ephesoft probably won't find too much new information here.

Again, while almost all the information presented in the book can be found elsewhere on-line, the advantage of the book is that the information is presented here in a directed and easy-to-consume format.  What's missing from the book though are more in-depth examples and perhaps more information about reporting and working with scanners.


Support for the Ephesoft Enterprise edition is available via an annual subscription. [Assistance with Ephesoft is also available from partners.  Formtek is an Ephesoft Platinum partner and we have a number of successful Ephesoft implementations.]