The current whitepaper is focused on our experience with developing a connector between an eDiscovery product and Symantec's Enterprise Vault email archiving software. The connector is developed following JSR 170 Standards to separate infrastructural services from application services.
Enterprise Vault Connector
Symantec's Enterprise Vault is a leading provider of email and content archiving solutions from Symantec Corporation. It enables companies to store, manage, and discover unstructured information across the organization. It is primarily used to archive emails within an enterprise, and is usually intimately connected with MS Exchange Server and IBM Lotus, which are the leading email management softwares.
Enterprise Vault manages content uses automated processes controlled and defined by policies for archiving and retention. In a very short span of time large organisations finds that their data stores are in range of several terabytes.
About EDiscovery and Early Case Assessment
Over the years companies have gathered and stored immense volume of data that required sophisticated solutions for easy storage. As a result many leading vendors came up with their own version of storing both structured and unstructured data that is generated in an enterprise. With the growing amount of unstructured data large enterprises have to support multiple storing solutions using different formats such as emails, documents, web content etc. Once the data is stored in various content management or archiving solutions, enterprises have spend huge amount of money when they have to discover, search and port data from these sources for legal or any other requirements.
Not all the sources yield well enough to enterprise search. The IT teams in companies facing litigation are often faced with demands from the legal teams asking them to - "Give all the emails exchanged between Person A and company X?" or "Give all the files that were stored by User Y across all systems with so so keywords?". With the limited features of discovery and search in different IT systems, the IT teams end handing over large unfiltered data. This increases the overall litigations costs for companies as legal firms charge companies based on the volume of data that is provided to them from IT.
The Enterprise Vault API : Technical Details
Enterprise Vault provides API support for customization and integration. The various Enterprise Vault (EV) APIs are implemented AS COM or .NET objects, which expose task-specific interfaces. At a high level, these APIs are categorized based on tasks they perform:
- Content Management API
- NSF Manager API
- Search API
- Retention API
- Filtering and Migrating APIs
The EV runtime installer registers the COM objects for Content Management, Search and Retention APIs, and provides interoperability libraries for .NET language bindings for .NET managed code. The two recommended ways of deploying .NET applications for EV components are:
- Build against the set of Primary Interop Assemblies (PIA) provided in EV SDK
- Generate Interop libraries – indirectly as part of build process or directly by using .NET Type Library Importer tool (tlbimp.exe).
The JSR 170 Specification : Technical Details
The connector framework was also JSR 170 Level 2 compliant. JSR-170 defines "a standard, independent way to access content bi-directionally on a granular level within a content repository," and goes on to define a content repository as "a high-level information management system that is a superset of traditional data repositories, [which] implements 'content services' such as: author based versioning, full textual searching, fine grained access control, content categorization and content event monitoring."
The Java Content Repository API (JSR-170) is an attempt to standardize an API that can be used for accessing a content repository. It defines a programmatic interface that should be used for connecting to content repository and can be thought of as a JDBC-like API for content repositories, allowing us to develop programs independently of any particular content repository implementation.
Level 1 compliance define a read-only repository and includes functionality for the reading of repository content, export of content to XML and searching. Level 2 compliance defines a writable repository. In addition to Level 1's functionality, it defines methods for writing content and importing content from XML.
<<IMAGE>>
Figure 2: JSR Compliance
In order to be JSR 170 Level 2 compliant, the relevant interfaces from the javax.jcr package that were implemented are.
- javax.jcr.nodetype - Interfaces and classes for Content repository node type functionality
- javax.jcr.util - Interfaces and classes for Content repository API
- javax.jcr : Interfaces and classes for content repository
- javax.jcr.nodetype : Interfaces and classes for content repository node type functionality
- javax.jcr.util : Interfaces and classes for content repository API
The connector provides a UI layer that allows users to configure and manage the connector.
- Validate : Function to validate the connection and user permissions for EV
- Discover : Function to discover and explore all the vault stores, archives and items in EV
- Scan : Function to scan all the items and create a manifest file with unique identifies for each item
- Import : Function to import data from EV based on the manifest file created using scan.
- Export : Function to write data into EV from external source
- Activate : Function to activate the connection with pre-existing configuration
- Monitor : Function to monitor connection between connector and EV
Architecture Overview for the High-Performance Connector
The eDiscovery product architecture was implemented in J2EE and Symantec Enterprise Vault client APIs are implemented as COM objects. Hence, one of the primary design criteria was interoperability between the two. The other considerations for the design decision were:
- The connector should be JSR 170 Level 2 compliant
- The design should meet all the functionality listed above in features
- The design should be capable to handling large amount of data for discovery and import
Since the EV APIs were implemented as COM or .NET objects and as such a .NET Adapter was implemented to support functionalities required by the Connector. The .NET Adapter will interact with EV COM components for functionalities like:
- Import
- Export Validation
- Export
The .NET Adapter was build against the set of Primary Interop Assemblies (PIAs) provided by the EV SDK and implemented the EV APIs for these tasks.
For performance reasons and the way data is stored in EV, the following functionalities were implemented using direct SQL queries:
- Validate
- Discovery
- Scan
The standard design patterns of Façade, Delegate, DTO, and DAO were be used to design classes for the .NET application running on EV client. The features in the .NET Adapter were be implemented as method calls and were exposed via WCF Web Services. WCF was chosen over ASP.NET Web Services for the following reasons:
- Flexibility to expose in different forms – hosting options (windows service, IIS, windows client), protocol options (http, tcp, named pipes, MSMQ)
- Interoperability with Java
- Ease of setting up instances – configured as singleton, per call, per session
A high level representation of this architecture is shown below:
The Connector Lifecycle
Once the connector is installed and deployed at the customer location, the connection is set and validated, a user can now define a Collection, which is a data area that is used by the eDiscovery framework for all future search, discovery or analytics operations. After collections definition, a user can choose to import files into the collection. The import process takes place in the following steps
- Scanning – As the import is initiated, the eDiscovery product sends the request to connector for scanning. The connector scans all the data areas in the E-Vault and returns a manifest file that contains the location of all the items that can be imported into the platform. The manifest file also captures other metadata useful for eDiscovery and processing.
- Importing – After receiving the manifest file, the eDiscovery product starts a parallel process using multiple threads to import (copy) the file into the local data area in the framework.
- Processing – After importing, the data is processed and metadata is created that can be used in future by the eDiscovery Platform for all search, discovery or analytics.
Performance Considerations
There were two primary performance criteria that had be kept in mind while designing the connector - the speed at which connector could scan the entire repository and send the metadata to the eDiscovery product and the rate at which data can be imported from the repository to the eDiscovery platform. The first performance challenge was solved in a two step process, a background process was created that executed at fixed interval and created a manifest file with the metadata and stored it locally. In the second step, when a request for metadata was made the connector read the latest manifest file from the local and send it as the response.
The second performance challenge was a twofold problem - first how could large files be imported across the wire? It was solved by sending the data in small chuck sizes which can configured. Hence it allowed to send large files in multiple chucks and gave the flexibility to users to define the chunk size based on their hardware setup. The second problem was sending so many items across as quickly as possible. The eDiscovery product was built on a Grid Computing technology providing it with the ability to spin multiple threads and run simultaneous import threads.


