Rank: Newbie Groups: Member
Joined: 11/10/2011 Posts: 1
|
I'm testing the PDF .NET product for HTML to PDF conversion. In terms of layout, the resultant PDFs look fine, but I need to post-process the PDFs (text extraction for Postal Processing).
I'm concerned that the fonts are CID with Identity-H encoding.
Also, the "strings" are fragmented. If I select an address block, and copy/paste it into Notepad, for example, instead of discrete lines I get individual lines per word.
Can font type and encoding be specified / controlled? Can tolerances for whitespace, or vertical / horizontal offests, font sizes, etc. be adjusted to "keep strings together"?
|
Rank: Administration Groups: Administration
Joined: 5/27/2007 Posts: 24,239
|
Hi,
Sorry about the delay. We do use Identity-H encoding together with ToUnicode map. The value in the content stream is the CID value, you can then use ToUnicode map to translate the CID value to unicode value. You can not control font type and encoding.
As to the string fragmentation issue, we do not have a lot of control over it. Usually if your texts are of the same font within the same element, they will come out "together" from Adobe Reader. However there is no firm rule on how Adobe Reader decides which piece of text comes after which piece of text. What we do is to render the text at the right location, but how to connect different text segments at different locations to a single sentence is a totally different matter and sometimes it's not possible. For example, if your HTML has three different words "this" "is" "great" artistically arrange with different font at different locations like they do in print ads, then all we do is to render three text segments precisely at where they suppose to be. However there is no way for us or Adobe Reader to figure out this is actually a single sentence.
Hope this helps. Please feel free to let us know if you have any more questions.
Thanks!
|