This functionality was modified in an update. For more information, see Extract data from multiple documents (modified in an update).
The following procedure is used for extracting content from multiple documents which
are in a working or current revision state. It extracts tags and their relationships
from multiple documents.
To extract the content using the pre-processed content files, you must ensure that
corresponding file is attached to the document along with the ContentFile.xml. You
must also attach GraphicsMapFile.xml if the file type supports graphical navigation.
How can I configure DCOM to allow content extraction from Microsoft Office files?
You must enable DCOM permissions before HxGN SDx can access Microsoft Office applications.
This is mandatory if you want to extract content from:
To set the DCOM configuration for the respective file type application, complete the
following steps:
-
Click Start > Administrative Tools > Component Services.
-
In the tree view, expand Component Services > Computers > My Computer > DCOM Config.
-
Based on the Microsoft Office file type, locate and right-click the respective DCOM
Config component service:
-
On the shortcut menu, click Properties.
-
In the General tab, set the Authentication Level to None.
-
In the Identity tab, select The Launching User option.
-
In the Security tab, set the Launch and Activation Permissions to Customize, and click Edit.
-
Add the Administrators created by Server Manager.
-
Select the Allow check box for the following items:
-
Local Launch
-
Remote Launch
-
Local Activation
-
Remote Activation
-
Read
-
Special permissions
What happens when content extraction from multiple documents fail?
When content extraction from multiple documents fail, a content discovery task is
automatically created to find the problem. It does this by re-processing the large
document set into smaller and smaller batches, starting with batches of 100, then
batches of 10, and finally in batches of 1. For each batch, a child content discovery
task is created under the master content discovery task.
For example, to re-process 1000 documents, 10 content discovery tasks are created,
each with 100 documents. Each batch of 100 documents that fails would then be re-processed
into 10 child content discovery tasks, each with a batch size of 10 documents. Finally,
each batch of 10 documents that fails is re-processed with 10 child content discovery
tasks, each with 1 document to find the failed document.
Click Notifications to view the issue in the failed document that corresponds to the failed child content
discovery task.
How is the content extracted from multiple documents when multiple files are attached
to one or more documents?
When processing multiple documents, where one or more documents have more than one
file attached, the following scenarios are considered:
-
If different file types are attached, the software first checks for a file with the
ISPFNMasterFile interface, and content is extracted from that file.
-
If a file with the ISPFNMasterFile interface is not found, content is extracted from
the file with the highest priority. By default, files with the .dwg file extension
are set as highest priority. However, the priority of the file can be changed in the Data Capture Central settings module in the Desktop Client. For more information, see, see Manage file types and prioritize them for content extraction.
-
If more than one file of highest priority is attached, then the software fails to
extract the content as it was not able to select the file.
-
Click Documents > All Documents.
-
To extract content from multiple documents, select two or more documents from All Documents list, and click Actions > Extract Content.
What is the purpose of a default template group?
When you use Data Capture Content Discovery Task in the Desktop Client or Extract Content in the Web Client to extract content from multiple documents, the software automatically
considers the templates and rules defined for the default template group. The default
template group is considered only when a PDF file or a drawing file is attached to
the document. To successfully extract content, ensure that the templates and rules
are configured for the template group. However, if you have not chosen any template
group as default, the software automatically considers a template group DefaultDrawingTemplateGroup for extracting the content. This default template group is provided with the software.
In the Web Client, to extract content from a single document, the software automatically
considers the templates and rules defined for the auto selected default template group.
The default template group is considered only when a PDF file or a drawing file is
attached to the document. However, you have an option to select and apply any other
template group instead of the default template group. For more information, see Extract data from a document.
For the auto selected default template group, the Match Tag Patterns option is pre-selected.
-
By default, the property Is Data Capture Rel is set to True on document to tag relationships SPFNDocRevMasterTag, SPFNDocRevAliasTag, FDWDocRevTag
and SPFNFDWDocRevChildTag for Data Capture tags.
-
To extract content from the drawing and pdf files, the software applies the templates
and rules from the template group which is set as default. For more information, see
Manage drawing reader pre-processor templates and template groups and Manage PDF reader pre-processor templates.
-
When complete, click Notifications to view the status of content extraction.
-
FDW tags are created without applying the ENS definition.
-
In order to process documents to which 3D models are attached, you must ensure that
corresponding pre-processed content files are available. In case, pre-processed content
files are not available, then you must process one document at a time instead of multiple
documents. For more information on how to extract content from a single document,
see Extract data from a document.
-
When processing multiple documents, content cannot be extracted from a document to
which multiple files are attached. In such scenario, you can process each document
at a time instead of multiple documents, which will allow you to select the file from
which content will be extracted. For more information, see Extract data from a document.
-
Based on the attached file type, the default reader is automatically selected to process
the file and extract content. The reader is assigned based on the default settings
configured in Data Capture Central Settings module in the Desktop Client. For more information on the file types and the supported
readers, see Manage reader and application relationship and Manage file types and prioritize them for content extraction.
-
To view the status of content extraction from a selected document:
For more information about the status of a document processed using the Data Capture, see Data Capture Document Status.