When the data is sent out from Outlook client 2007/2010, Network Prevent for email fails to detect the email body. If the same data is present in the attachment, DLP detects sucessfully. This issue happens when the source data contains muntilingual characters(CJK characters including alphanumericals).
This issue occurs if customer uses Multilingual characters in body of an email.
(A) The Outlook client chooses an encoding type based on a proprietary implementation of character set detection when it can¹t match a message body with the default one set in outlook. For example if Outlook detects chinese characters with code points within the GBK character set while its default charset is ISO or LATIN, Outlook sets the mime ContentType charset for the body's multipart alternative headers to gb2312
(B) Its important to note that gb2312 as implemented by microsoft is a subset of GBK. Also important to note is that there is more than one implementation of GB2312, Microsoft uses one interpretation for .NET which is incompatible with other programming languages (e.g. Java,python, Š. ) and frameworks.
As per the article for "Multilingual text (different scripts)" Outlook selects the encoding "Unicode (UTF-8)". If Outlook fails to identify the data source it encodes with the default type "Western European (ISO)".
Engineering analysis has revealed Microsoft uses an alternate form of GB2312 encoding in the current scenario. As an aside, GB2312 is a much older format than GBK and UTF8 with minimal market share, multiple implementations and extensions. In general its recommended international apps use the more common UTF8 or UNICODE formats.
When DLP receives the email body byte stream, we use encoding set in the MIME header(GB2312) to convert to characters which causes incorrect decoding and hence missed detection.
We are not guessing for the encoding instead relying upon the encoding information, as set in the MIME headers.
DLP fuctionality is working as designed.
Engineering strongly recommends to set the Outlook clients encoding to UTF8
This behaviour is same for all the DLP versions.
Forcing UTF-8 instead of MIME encoding may help limit errors in systems that strictly enforce the GB2312 character set.
Workaround is to change the Outlook client encoding to UTF8.
Below are the steps to change the encoding:
Click the orange "File" tab in the top left corner of the Outlook 2010 window and click "Options" on the pull-down menu. A new window titled "Outlook Options" appears.
Click the "Advanced" heading on the left side of the window and then scroll to the "International Options" heading near the bottom.
Place a check in the box labeled "Automatically select encoding for outgoing messages" and click "Unicode (UTF-8)" on the drop-down menu.
Place a check in the box labeled "Automatically select encoding for outgoing vCards" and click "Unicode (UTF-8)" on the drop-down menu.
Place a check in the box labeled "Allow UTF-8 support for the mailto: protocol."
Outlook encoding default:
change the Outlook encoding to UTF8
Also we can try changing the encoding using Group Policy and below is the reference article: