External Content Extractor

This feature is currently in beta, usage in production environment is not recommended.

The external content extractor detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making it useful for search engine indexing, content analysis, translation, and much more.

Why use Tika Server as Content Extractor?

Zextras uses a Tika library that shares the same Java Virtual Machine (JVM) as the mailbox. With the Tika server you can you can have multiple Tika servers indexing the content separated from the mailbox. In case of a crash of a Tika server, the mailbox JVM remains unaffected.

Switching to the Tika Server

You can run Tika server as a docker container, on the same server as the mailbox; or on separate servers accessible by Zimbra.

Add a Tika Server

You can add a Tika server by running the following command on the Command Line Interface (CLI).

Format
zxsuite powerstore Indexing content-extraction-tool add {endpoint} [attr1 value1 [attr2 value2...]]
PARAMETER LIST
NAME           TYPE       EXPECTED VALUES
endpoint(M)    String
server(O)      String
global(O)      Boolean    true|false
Example
zxsuite powerstore Indexing content-extraction-tool add http://test.example.com:9997/tika
Explanation

Zextras adds an endpoint with address http://test.example.com listening on port 9997

Add tika endpoint for this mailbox store

Run the below command, as a zimbra user, from the same server as the mailbox

zxsuite powerstore Indexing content-extraction-tool add http://test.example.com:9998/tika
Add tika endpoint for mailbox store store1.example.com

Run the below command, as a zimbra user, from the same server as the mailbox

zxsuite powerstore Indexing content-extraction-tool add http://test.example.com/tika server store1.example.com
Add tika endpoint for all mailbox stores (applies only to mailbox stores that don’t have any endpoint specified)
zxsuite powerstore Indexing content-extraction-tool add http://test.example.com:9998/tika global true

List Tika Servers

You can list all Tika servers by running the following command on the Command Line Interface (CLI).

Command
zxsuite powerstore Indexing content-extraction-tool list
Sample Output
content-extraction-endpoints
                http://test.example.com:9998/tika
Explanation

Zextras lists all the running Tika servers with their addresses and the ports on which they are listening.

Remove a Tika Server

You can remove a previously added Tika server by running the following command on the Command Line Interface (CLI).

Format
zxsuite powerstore Indexing content-extraction-tool remove {endpoint} [attr1 value1 [attr2 value2...]]
PARAMETER LIST
NAME           TYPE       EXPECTED VALUES
endpoint(M)    String
server(O)      String
global(O)      Boolean    true|false
(M) == mandatory parameter, (O) == optional parameter
Example
zxsuite powerstore Indexing content-extraction-tool remove http://test.example.com:9997/tika
Explanation

Zextras removes the server with address http://test.example.com listening on port 9997

Is the Tika Server Running?

You can use the following methods to check if the Tika Server is running.

Graphical User Interface (GUI)
  1. Send an email with a new attachment.

  2. Search for the attachment.

Command Line Interface (CLI)
  1. Navigate to /opt/zimbra/log.

  2. View the contents of mailbox.log.

    • You can use tail -f.

Sample Output
2021-07-07 15:24:25,444 INFO [qtp413601558-41832:https://mail.example.com/service/soap/SearchRequest] [name=user@mail.example.com;mid=136;oip=192.168.0.10;port=33008;ua=ZimbraWebClient - FF89 (Linux)/8.8.15_GA_4007;soapId=3084e510;] mailbox - Using http://test.example.com:9997/tika for content extraction