Welcome Guest Search | Active Topics | Sign In | Register

Finding Text in PDF inspecting Pdf does not work anymore in blink chromium mode Options
Pierre
Posted: Sunday, June 19, 2016 11:20:35 AM
Rank: Newbie
Groups: Member

Joined: 8/28/2013
Posts: 3
Hi
I have following to do:
- generate a PDF form a HTML document
- Build a Toc with Page numbers to certain Chapters the document consists of
- Of course I do not know the page numbers at the beginning

The strategy to acheive this is the following:
- First pass
- Build in HTML a temporary TOC with unique markers (like TOC_ agsgej45) instead of the actual page number
- Build in HTML with unique markers (like PAG_agsgej45) instead of the chapter title.
- Convert HTML to PDF with method HtmlToPdf.ConvertUrl()
- Second pass
- Analyze generated PDF page by page to find on which page a marker ist located with the methods Find (see below)
- replace in HTML (TOC Page number and Chapter text with appropriate information
- generate again PDF from HTML, but now with correct page numbers and text

Now here is my problem:
In older versions of EO.Pdf (and in Classic mode in new version 16.0.91.0) my methods are always finding the searchText). Unfortunatly with the new engine blink/Chromium this is not the case anymore.
Have you any explanation and alternative to search and find my searchText (= markers) in the PDF?

Thanks for your answer

aDue IT GmbH
Pierre Honsberger
Zulligerstr. 48
CH-3063 Ittigen
www.adue-it.com - pgh@adue-it.com





The Find method in C# I developped:

private bool Find(PdfDocument doc, string searchText)
{
int pageNumber = 0;
bool found = false;
foreach (PdfPage page in doc.Pages)
{
pageNumber++;
PdfContentCollection coll = page.Contents;
found = Find(coll, searchText);
if (found)
{
break;
}
}

// Do something with pageNumber!!!

return found;
}


private bool Find(PdfContentCollection coll, string searchText)
{
bool found = false;
if (coll.Count > 0)
{
foreach (var item in coll)
{
PdfTextContent ptc = item as PdfTextContent;
if (ptc != null)
{
if (ptc.Text == searchText)
{
found = true;
break;
}

}
found = Find(item.Contents, searchText);
if (found)
{
break;
}
}
}
return found;
}
eo_support
Posted: Sunday, June 19, 2016 11:39:37 PM
Rank: Administration
Groups: Administration

Joined: 5/27/2007
Posts: 24,229
Hi,

You can not do it this way. The new engine is much faster and it does not go through the PdfTextContext layer like the old version does. What you should do is to use the returned HtmlToPdfResult object to find the corresponding HtmlElement, then use HtmlElement's property to get the text and page number. It will be something like this:

Code: C#
//First pass conversion
HtmlToPdfResult result = HtmlToPdf.ConvertHtml(.....);

//Get your "marker" element
HtmlElement e = result.HtmlDocument.GetElementById("your_marker_element_id");

//Get the marker element's text and page number
string text = e.InnerText;
int pageNumber = e.Location.Page.Index;

There are other varieties of the GetElementByXXXX function that you can use. You can take a look of the reference of the HtmlDocument class to see what's available there.

Hope this helps.

Thanks!
Pierre
Posted: Monday, June 20, 2016 2:39:49 AM
Rank: Newbie
Groups: Member

Joined: 8/28/2013
Posts: 3
Hi

Thanks a lot, I could change my method as you suggested. But does this means that iterating through the PdfContentCollection is not possible anymore with the new engine? If so, it would be nice to reflect this new behaviour in the documentation.

Kind Regards

aDue IT GmbH
Pierre Honsberger
Zulligerstr. 48
CH-3063 Ittigen
www.adue-it.com - pgh@adue-it.com
eo_support
Posted: Monday, June 20, 2016 8:59:08 AM
Rank: Administration
Groups: Administration

Joined: 5/27/2007
Posts: 24,229
Hi,

It is still possible to iterate through PdfContentCollection. However you can not rely on what you can find in this collection. Depending on exactly how the PDF contents are created, sometimes we create PdfTextContent, sometimes we create PdfRawContent object. Even in the old version we do not guarantee or document that you will always get PdfTextContent. The fact that you get a PdfTextContent in the old version is just an observed behavior --- not an explicitly defined one. So generally you should only use PdfTextContent when you use it to output to a PDF file, but not using it to examine output created by other code/programs. The method we suggested in our previous reply is the official supported method for your case.

Thanks!
Pierre
Posted: Monday, June 20, 2016 9:09:20 AM
Rank: Newbie
Groups: Member

Joined: 8/28/2013
Posts: 3
Hi,
Thank a lot. Now all is perfectly clear for me.
Regards
Pierre
eo_support
Posted: Monday, June 20, 2016 9:17:55 AM
Rank: Administration
Groups: Administration

Joined: 5/27/2007
Posts: 24,229
You are very welcome. Please feel free to let us know if there is anything else.

Thanks!


You cannot post new topics in this forum.
You cannot reply to topics in this forum.
You cannot delete your posts in this forum.
You cannot edit your posts in this forum.
You cannot create polls in this forum.
You cannot vote in polls in this forum.