|
Rank: Newbie Groups: Member
Joined: 3/19/2018 Posts: 6
|
I'm converting HTML to PDF with Eo.Pdf, and subsequently need to extract the text from the generated PDF. I'm finding that certain pairs of characters (such as 'ft', 'ff', 'tt') are being converted to ligatures, as would be expected in a modern HTML renderer, but I'm then struggling to get those characters back from the PDF, even when using techniques which claim to be 'ligature-aware'.
The easiest way to solve this would probably be to prevent the PDF from containing ligatures in the first place. I'm guessing this would be possible by injecting a piece of CSS (setting the 'font-variant-ligatures' CSS parameter to 'none' would hopefully do this) but I'm unsure of the easiest way to implement that.
Even so, this probably wouldn't be a universal solution (since it's possible that CSS in the converted file would override the setting) so understanding why the ligatures in the PDF are not being written in a way that's accessible to PDF text extractors would also be useful. Alternatively a way to globally disable ligature-creation would be great.
Thanks in advance for any help you can provide!
|
|
Rank: Newbie Groups: Member
Joined: 3/19/2018 Posts: 6
|
A bit more information on this after inspection of the PDF... when looking at the 'glyph mappings' in the PDF for a document containing the ligatures 'ff', 'ft' and 'tt', only the fl ligature contains a toUnicode Mapping. Both the other ligatures in the font just have a mapping to U+0000.
This is a serious issue (which for example means that the PDF is not PDF/A compliant) because while the document displays correctly, it is impossible for any screen reader or other text extraction software to ever correctly retrieve the content from the PDF in this case.
(I don't believe I can add attachments here, so I'll email a sample document to support referencing this topic).
|
|
Rank: Administration Groups: Administration
Joined: 5/27/2007 Posts: 24,258
|
Hi, Thanks for bring this to our attention. This does appear to be a problem. It's normal for ligatures to have no corresponding unicode mapping because it is just a glyph for displaying/printing purpose, not a character. CSS font-variant-ligatures:none does turn off ligatures and produce the right result. You can apply this style without modifying your source HTML file with the following code:
Code: C#
//Use a HtmlToPdfSession object because we need to have access
//to the underlying EO.WebBrowser.WebView object through the
//HtmlToPdfSession object's RunWebViewCallback function
using (HtmlToPdfSession session = HtmlToPdfSession.Create())
{
session.RunWebViewCallback((webView, obj) =>
{
//Load the Url to be converted
webView.LoadUrlAndWait(url);
//Use JavaScript to create a style node that is equivalent of
//the following CSS block
//<style>
// * { font-variant-ligatures:none; }
//</style>
webView.EvalScript(@"
(function()
{
var style = document.createElement('style');
style.appendChild(document.createTextNode('* { font-variant-ligatures:none; }'));
document.head.appendChild(style);
})();
");
return null;
}, null);
//Now perform the conversion
session.RenderAsPDF(result_pdf_file);
}
As you noticed, this can still be overridden by the end user. We could change our code to turn it off permanently but it could affect other users who does want it to be on. So we would rather leave it as is. Thanks!
|
|
Rank: Newbie Groups: Member
Joined: 3/19/2018 Posts: 6
|
That works well - thanks! I'd been working along similar lines by trying to set a UserStyleSheet in the BrowserOptions which might be a cleaner workaround, but I'm guessing there's no way to get to the WebView or Engine prior to creation? This is good enough for now, but I do think the lack of a Unicode mapping for ligatures is a serious flaw that needs looking at. There are unicode characters for all common ligatures, and the PDF/A standard does require them to be present (see https://www.pdfa.org/improved-pdfa-1b/ and search for 'ligatures'). It's strange that the mapping is there already for one of the ligatures but not the others, so this must already be partially implemented.
|
|
Rank: Administration Groups: Administration
Joined: 5/27/2007 Posts: 24,258
|
Hi,
UserStyleSheet is available on BrowserOptions that can be passed in when the WebView is created, but the PDF engine reuse WebView objects for different conversions and this makes it difficult to expose the underlying UserStyleSheet to the HTML to PDF interface.
Ligature can have unicode mapping but that won't really resolve the copy and paste issue. For example, when "fi" is combined and even if it gets its own unicode, when it's copied over it won't be the same as "f" and "i".
Thanks!
|
|
Rank: Newbie Groups: Member
Joined: 3/19/2018 Posts: 6
|
Understood. To be clear, I was using copy and paste as a simple technique to see what's happening - in practice, we're using tools that are aware of unicode ligatures, and will break them back into their component characters, but if no mapping is provided this becomes impossible.
|
|
Rank: Administration Groups: Administration
Joined: 5/27/2007 Posts: 24,258
|
This may not be possible. We rely on Chromium's rendering engine to render the PDF file and if that does not write ligature code, then it won't be there. More over, even if the code wants to write the unicode value, it may not be able to do so because that would depend on the font. Specifically, inside the font there is a "glyph" to "unicode" map, typically this map contains all "normal" characters, but it may not contain values for ligatures. There are also cases where a font has its own fancy ligatures that is not widely recognized. In those cases for sure you won't have a unicode value. So the safest way is probably to turn ligature off.
|
|