Issue with the font encoding impacts standard REGEX scans - Essential Objects, Inc. Support Forum

Welcome Guest

Search | Active Topics | Sign In | Register

Essential Objects Product Support Forum » All Products » Support » Issue with the font encoding impacts standard REGEX scans

Issue with the font encoding impacts standard REGEX scans

Options

Previous Topic · Next Topic

Statefarm

Posted: Friday, April 15, 2016 11:30:06 AM

Rank: Newbie
Groups: Member

Joined: 4/23/2015
Posts: 1

We have encountered an issue with the font encoding used during conversion of Lotus Notes documents to PDF/A using the HTML Converter (EO.PDF for .NET). After conversion, we are running regular expressions (REGEX) against these PDFs.

While a space or hyphen in the converted PDF appear valid visually (they appear as spaces and hyphens), they have been assigned a different ASCII code which the default REGEX scan code does not recognize. Spaces are converted to ASCII code 120; hyphens are converted to ASCII code 162.

Post conversion it appears as though all fonts used within the document are of type TrueType(CID) and encoding Identity-H. This has caused some issues with our regular expressions not identifying certain characters as described above.

Is there any plans for the Essential Objects product that will allow clients to change the encoding in the conversion process?

We are running regular expressions to scan these documents for known string formats such as credit cards, SSNs etc. In order to ensure our REGEX functions in all cases, we are updating our code to consider these variations of spaces and hyphens. We are concerned there may be other characters for which we should implement this work around.

Is anyone aware of any mapping document between commonly seen characters (hyphen, comma, whitespace etc.) and their Identity-H counterparts?

eo_support

Posted: Friday, April 15, 2016 12:17:18 PM

Rank: Administration
Groups: Administration

Joined: 5/27/2007
Posts: 24,385

Hi,

We do not have any plans to support other encodings. The encodings are mostly determined by the font. Windows give us TrueType font data, and true type font always use Identity-H. So there is no other encoding that's appropriate. As such changing encoding on our end is not the propery way to address your issue. You will need to look into TrueType font data in order to find the character value.

Thanks!

You cannot post new topics in this forum.
You cannot reply to topics in this forum.
You cannot delete your posts in this forum.
You cannot edit your posts in this forum.
You cannot create polls in this forum.
You cannot vote in polls in this forum.

Message