This article gives an overview of OCR Server system requirements and instructions on using the OCR Server Sizing Estimator spreadsheet to determine how many OCR Servers you need for each detection server in your deployment.
The OCR Server Sizing Estimator spreadsheet is attached to this article.
The OCR Server has specific hardware, operating system, and server settings system requirements, different from the Data Loss Prevention Enforce Server and Data Loss Prevention detection servers.
Note: An OCR Server, whether it is a virtual or physical server, should not run other applications. It should be dedicated to OCR.
Processor: 3.0 GHz or more
Note: In a hyperthreaded environment, the number of logical cores is twice the number of physical cores. In a virtualized environment, the number of logical cores is the same as the number of vCPUs assigned to the VM that runs OCR Server.
Physical Memory: Total required memory is a function of the number of hardware threads/logical cores and consequently, the number of concurrent threads configured to run on that host.
For example, for a VM running with 8 logical cores or VCPUs, the total memory required is 3 GB + 200 MB x 8 = 4.6 GB.
Disk Space: 32 GB
The Symantec Data Loss Prevention OCR Server can be installed on the following versions of the Windows Server:
There are two OCR Server settings that must be configured in the
OCR.properties file, located at
num.ocr.workersto equal the number of logical cores.
Set the value of
server.tomcat.max-threads to equal the value of
num.ocr.workers + 1.
The OCR Server Sizing Estimator spreadsheet can help you to estimate the number of OCR Servers that you need in your Data Loss Prevention deployment. The spreadsheet makes the following assumptions:
Note: Not all images are sent to OCR by default. Extremely small images, photos that do not contain extractable text, and images in unsupported file formats are not sent to OCR. By default, the detection servers only extract the first 10 images (pages) from scanned multipage PDF or TIFF documents.
The ratio of OCR Servers to detection servers depends on the
A wide range of factors, including
can greatly affect recognition accuracy and performance. Generally, OCR performance and accuracy is best when processing high contrast, high DPI images containing typewritten text written in a single language that is devoid of image artifacts, rotations, or other types of transformations.
The OCR Server Sizing Estimator Spreadsheet (attached to this article) can help you estimate the number of OCR Servers per detection server. To come up with values for the spreadsheet, you must first analyze your message and file inspection volume and the number of images contained in each message or file.
Enter the values in the green cells to compute the number of OCR Servers per detection server. Note that because some cells are hidden for readability, the cell rows are not consecutively numbered.
When you change any of the values, the spreadsheet recalculates
This section gives more information on the values in the spreadsheet. Read this section for details on how to come up with the three values that you need to enter in the spreadsheet.
Number of message chains per detections server
Click a Message Chain tab at the bottom of the spreadsheet to choose the value for cell B:11. The tab selects number of message chains for your detection server.
Note: A Message Chain is the number of messages that can be processed by the Data Loss Prevention server in parallel. This value is typically set to 1x or 2x the number of CPU cores on your host system. You can change this setting by editing the
Messaging.NumChains advanced setting.
Percentage of messages containing images requiring OCR (OCR messages)
The value in cell B:10 is the percentage of message traffic that contains images that are sent to OCR.
For a Discover scan on a repository containing only scanned images, this is 100%. For email messages where approximately 1 message in every 20 contains an image file that is submitted to OCR, this is 5%.
Note that this value does not count the number of images within the message. For example, if one out of every ten messages contains a scanned document with 10 pages, insert 10% in this field. The focus is on the percentage of messages and not the number of images.
Maximum acceptable rate of OCR message timeouts
Determine your threshold for the acceptable rate of OCR Server timeouts based on your deployment.
Estimated average number of images per OCR message
Determine this estimate based on the types of images that are processed in your deployment. For example, a 10-page scanned PDF file contains 10 images. A single screenshot saved in a JPG file only contains a single image.
Number of concurrent OCR messages to handle
The maximum number of concurrent OCR messages that can be processed by the OCR Server. Any messages over this amount fail.
For an OCR message, an OCR request is generated for each individual image. For example, a message containing a 10-page scanned document would generate 10 separate OCR requests. Depending on your configuration, these requests may execute on one or more separate OCR Servers.
Number of OCR Servers per detection server
Given the configured values, this is the number of OCR Servers that are required to handle each detection server.
Subscribing will provide email updates when this Article is updated. Login is required.
Thanks for your feedback. Let us know if you have additional comments below. (requires login)
This will clear the history and restart the chat.