How to get all the elements from a PDF page or From HtmlToPdfResult

Welcome Guest

Search | Active Topics | Sign In | Register

Essential Objects Product Support Forum » All Products » Support » How to get all the elements from a PDF page or From HtmlToPdfResult

Options

Previous Topic · Next Topic

dnayak

Posted: Friday, December 8, 2023 12:30:32 PM

Rank: Newbie
Groups: Member

Joined: 11/13/2023
Posts: 3

I have table with n number of records. I am paginating the table html using Essential object by using HtmlToPdf.ConvertHtml() . Is there a way to find the number of records in each pdf page from the HtmlToPdfResult object.

eo_support

Posted: Saturday, December 9, 2023 4:37:16 PM

Rank: Administration
Groups: Administration

Joined: 5/27/2007
Posts: 24,098

Hi,

The HTML to PDF converter does not understand your record. It only understand standard HTML element/node. So you will need to find out the HTML element for each record first with some kind mechanism. For example, if you render each record as a table row and then assign a unique CSS class to the row (For example, "record_row"), then you can use HtmlToPdfResult.HtmlDocument.GetElementsByClassName to return all the tr HTMLElement for the records:

https://www.essentialobjects.com/doc/eo.pdf.htmldocument.getelementsbyclassname_overload_1.html

After you have the records HtmlElement, you can loop through its VisibleRects property:

https://www.essentialobjects.com/doc/eo.pdf.htmlelement.visiblerects.html

Then check each PdfPageRectangle's Page property:

https://www.essentialobjects.com/doc/eo.pdf.pdfpagelocation.page.html

This will help you to establish the PdfPage(s) associated with each HtmlElement. Note that one element can be associated to multiple pages if the contents of one records spans accross multiple pages.

Thanks!

Niranjan Singh

Posted: Wednesday, January 10, 2024 11:36:47 AM

Rank: Newbie
Groups: Member

Joined: 10/27/2023
Posts: 8

Hi,

You suggested that we use the approach below.

Quote:

var result = HtmlToPdf.ConvertHtml("<div style="background-color:#266a9e" class="general_titleHeader">Corporate Actions - Pending Name and ID Changes</div>", new PdfDocument());
var body = result.HtmlDocument.Body;
You do not need to use HtmlAgilityPack to get node HTML either. For example, you can use:
var bodyInnerText = body.InnerText;

To get the body element's innerText. You can loop through body's ChildElements to get all table row/cells and their text.
You can also get any element's page index by:
var pageIndex = e.Location.Page.Index;

We used this approach and created the below method to make the pages back to HTML, but we found that HtmlElements does not have information on any attributes. e.g., style. How can retrieve the attributes from the HtmlElement or HtmlNode under the HtmlToPdfResult.HtmlDocument.Body.

Code: C#

Copy

string ConvertHtmlNodeToHtmlText(EO.Pdf.HtmlNode node)
{
    StringBuilder htmlBuilder = new StringBuilder();
    HtmlElement htmlElement = node as HtmlElement;
    if (htmlElement != null)
    {
        htmlBuilder.Append($"&lt;{htmlElement.TagName} class=\"{htmlElement.ClassName}\"  style=\"\"&gt;");

        foreach (EO.Pdf.HtmlNode childNode in htmlElement.ChildNodes)
        {
            if (childNode as EO.Pdf.HtmlTextNode != null)
            {
                htmlBuilder.Append((childNode as EO.Pdf.HtmlTextNode).Text);
            }
        }
        foreach (HtmlElement element in htmlElement.ChildElements)
        {
            htmlBuilder.Append(ConvertHtmlNodeToHtmlText(element));
        }
        
        htmlBuilder.Append($"&lt;{htmlElement.TagName}/&gt;");
    }
    return htmlBuilder.ToString();
}

// Fetch the HTML of the entire page

Code: C#

Copy

string htmlText = ConvertHtmlNodeToHtmlText(body);

We want to reconstruct the pages, including the rows on a specific page, but there is a problem reading the entire HTML back.

Code: C#

Copy

var rows = element.ChildElements.FirstOrDefault(c => c.TagName == "TBODY")?.ChildElements;
rows?.GroupBy(r => r.Location.PageIndex).ToList().ForEach(g =>
{
    pageBytes.Add(GetPageHTMLBytes(elementBeforeTable, headerHTML, g.ToArray()));
});

Could you clarify how we can retrieve the every HTML details from the HtmlDocument?

eo_support

Posted: Wednesday, January 10, 2024 12:09:02 PM

Rank: Administration
Groups: Administration

Joined: 5/27/2007
Posts: 24,098

Hi,

If HtmlElement/HtmlNode does not provide you enough information, you can always use HtmlSession.ExecScript to run JavaScript code to get whatever you need:

https://www.essentialobjects.com/doc/eo.pdf.htmltopdfsession.execscript_overloads.html

The code will be something like this:

Code: C#

Copy

using (HtmlToPdfSession session = HtmlToPdfSession.Create())
{
    //Load the page
    session.LoadHtml(your_page_html);
    
    //Run JavaScript code with 5 seconds timeout
    string bodyHTML = (string)session.ExecScript("document.body.outerHTML", 5000); 
  
    //Render the page as PDF
    session.RenderAsPDF(pdf_file_name);
}

You will need to replace the JavaScript code passed to session.ExecScript to whatever JavaScript code you can use to collect the information you needed. Make sure it returns a simple value such as a string. You can not easily pass complex values back to C# from your JavaScript code.

Thanks

Niranjan Singh

Posted: Wednesday, January 10, 2024 12:18:54 PM

Rank: Newbie
Groups: Member

Joined: 10/27/2023
Posts: 8

If that's the case, we need to come up with a different strategy instead of proceeding with the slow approach. It will slow the process of getting every HTML element style or any other custom attribute. To do so, we must utilize the HtmlAgilityPack again to retrieve the HTML by making a lookup with the information obtained from HtmlToPdfResult.HtmlDocument.

eo_support

Posted: Wednesday, January 10, 2024 12:42:28 PM

Rank: Administration
Groups: Administration

Joined: 5/27/2007
Posts: 24,098

Not necessarily. The key is you need to use a single ExecScript call to get everything you need. For example, you can collect everything you need in a single piece of JavaScript code and then return the result either as an array or a single JSON string (and then decode it on C# side). This should be more efficient than using a separate library to parse the HTML again since it merely reads the information that has already been parsed by the browser engine. The other drawback of using another library to parse HTML is it can't handle dynamic contents created/modified by script. Only a full blown browser engine can handle that.

Niranjan Singh

Posted: Thursday, February 1, 2024 5:29:22 AM

Rank: Newbie
Groups: Member

Joined: 10/27/2023
Posts: 8

Thanks for your response. I used HTMLAgilityPack to retrieve the HTML rows by their ID and then compared them with the PDFDocument results.

Code: C#

Copy

//Get Pages rows from the converted PDF Html Document
List<EO.Pdf.HtmlElement> paginatedRows = GetPaginatedRowsFromHtmlDocument(pdfResult.HtmlDocument);

Code: C#

Copy

//Logic to create pages by including the Page header and table header on each page
var rowsList = rows.ToList(); // Rows fetched using the HTMLAgilityPack
var rowsLookup = from row in rows
                 join pRow in paginatedRows on row.Id equals pRow.ID into gj
                 from subpet in gj.DefaultIfEmpty()
                 select new
                 {
                     HtmlRow = row,
                     PageIndex = subpet?.Location?.PageIndex ?? -1
                 };
elementBeforeTable = GetNodesHtml(childNodes);
rowsLookup.GroupBy(r => r.PageIndex).ToList()
    .ForEach(g =>
    {
        pages.Add(GetPageHTML(elementBeforeTable, tableStylecss, headerHTML, g.Select(g => g.HtmlRow).ToList()));
    });

The table's width is according to the container, and cells adjust according to the content in the td. I have to create the page structure on the UI by adding calculated rows. We have processed the entire table to create pages using the EO library, but when these rows are placed on the UI, they do not fit into the page size.

Code: HTML/ASPX

Copy

<style>
.page {
    margin-bottom: 5px;
    border: 1px solid #DEDEDE;
}

.landscape {
    height: 8.2in;
    font-size: 10px;
    position: relative;
    page-break-after: always;
    /* page-break-inside: avoid; */
    font-family: Verdana;
}
table {
    width: 99%;
    margin: auto;
    padding-top: 8px;
    border-collapse: collapse;
}
</style>
<div id="Report_1_Page_1" data-report-id="5604" data-report-name="Report Name" data-report-pageno="1" class="page landscape" style="height:8.2in !important; width:11in !important;">
  <div style="background-color:#4b8dbc" class="general_titleHeader">Report Name</div>
  <table style="border-collapse:collapse;border:none" class="report-table">
    <thead>
      <tr class="headerRow">
        <th style="text-align:right" class="center-align-report-builder report-builder-date-startFormat-MM/dd/yyyy-endFormat">Orig Acq Date</th>
        <th style="text-align:left" class="left-align-report-builder">Security</th>
        <th style="text-align:right" class="right-align-report-builder decimal">Quantity</th>
        <th style="text-align:right" class="right-align-report-builder decimal">Unit Cost</th>
        <th style="text-align:right" class="right-align-report-builder decimal">Price</th>
        <th style="text-align:right" class="right-align-report-builder decimal">Total Cost</th>
        <th style="text-align:right" class="right-align-report-builder decimal">Market Value</th>
        <th style="text-align:right" class="right-align-report-builder decimal">Unrealized Gain/Loss</th>
        <th style="text-align:right" class="right-align-report-builder decimal">Percent of Assets</th>
        <th style="text-align:right" class="right-align-report-builder decimal">Annual Income</th>
        <th style="text-align:right" class="right-align-report-builder decimal">Yield</th>
      </tr>
    </thead>
    <tbody>
 <tr style="background-color:lightgray;-stp-background-color-even:lightgray;-stp-background-color-odd:white" class="dataRow">
        <td style="text-align:right" class="center-align-report-builder report-builder-date-startFormat-MM/dd/yyyy-endFormat"></td>
        <td style="text-align:left" class="left-align-report-builder">US DOLLARS</td>
        <td style="text-align:right" class="right-align-report-builder decimal">330</td>
        <td style="text-align:right" class="right-align-report-builder decimal">1.00</td>
        <td style="text-align:right" class="right-align-report-builder decimal">1.00</td>
        <td style="text-align:right" class="right-align-report-builder decimal">329.88</td>
        <td style="text-align:right" class="right-align-report-builder decimal">329.88</td>
        <td style="text-align:right" class="right-align-report-builder decimal">0.00</td>
        <td style="text-align:right" class="right-align-report-builder decimal">0.0%</td>
        <td style="text-align:right" class="right-align-report-builder decimal">0.00</td>
        <td style="text-align:right" class="right-align-report-builder decimal">0.0</td>
      </tr>
    </tbody>
  </table>
  <div style="bottom: 3px; position: absolute; width: 100%;">    <div style="float: left; text-align: center; width: 97%;"></div>        <div style="float: right; font-weight: bold; width: 2%;">1</div>    <div style="clear: both;"></div></div>
</div>

The table layout differs on each page due to content on a single page. For example, we have 1000 rows in the table and want to repeat the table header on each page then somehow browser view is different than the result we got from the HTMLToPDF conversion.

Please help us find a way to process the PDF generation process for each page individually instead of the entire table.

eo_support

Posted: Thursday, February 1, 2024 10:08:00 AM

Rank: Administration
Groups: Administration

Joined: 5/27/2007
Posts: 24,098

Niranjan Singh wrote:

Please help us find a way to process the PDF generation process for each page individually instead of the entire table.

I am not sure if I understand your question correctly. What do you mean by "process the PDF generation process for each page"? What are you trying to achieve? PDF generation is always processed on the entire document level. Splitting the entire document into multiple pages is just one step of the process.

Also can you PM us your order number? We need to verify your order number in order to continue to provide tech support to you.

Niranjan Singh

Posted: Thursday, February 1, 2024 1:44:11 PM

Rank: Newbie
Groups: Member

Joined: 10/27/2023
Posts: 8

Thank you for explaining the internal process of creating pages in the library. We are currently facing a problem with synchronizing the pagination on the user interface and the exported PDF. We have created the page size with respect to the orientation and are trying to fit the table rows so that they will not span on the next page. We have implemented our own pagination logic and created multiple pages. After processing these pages one by one, we add them to the PDF document using the EO library.

We have considered a new approach after suggestions from your side where we process the PDF document first and calculate how many rows will fit on each page. However, as you mentioned, the EO.PDF library's pagination logic renders the HTML on a large page and then cuts it off into multiple pages. This causes a problem because when rendering pages on different pages, the columns adjust according to the content, which can lead to inconsistencies with previous pages. Additionally, the EO library returns more rows on the pages because it uses the first page table header as a reference to freeze the column width, but in HTML, this is not true. To know how many rows will fit on a page, we are processing rows by adding them to the page one by one and then trying to find whether these rows fit on a single page or not. It is too slow a process to do pagination, but it is working great. The UI and PDF are not in sync.

Below is the logic we implemented to make it work, but we are trying to do it in one go to improve the performance. This will give you a hint of what we are trying to achieve. we are creating pages HTML that includes a table header and rows, then trying to know if that fits on the page or not.

Code: C#

Copy

//Call
List<string> dataRowsHtml = GetDataRowsHtml(startRowIndex, endRowIndex);
dataRowsHtml = CalculateRows(width, height, dataRowsHtml);


private List<string> CalculateRows(float width, float height, List<string> dataRowsHtml)
{

    var pdfResult = RenderPageHtmlUsingPdfSession(width, height, string.Join(" ", dataRowsHtml));
    if (pdfResult.PdfDocument.Pages.Count == 1)
    {
        dataRowsHtml = CalculateRowsByAdding(width, height, dataRowsHtml);
    }
    else
    {
        dataRowsHtml = CalculateRowsByRemoving(width, height, dataRowsHtml, pdfResult);
    }
    return dataRowsHtml;
}
List<string> CalculateRowsByRemoving(float width, float height, List<string> dataRowsHtml, HtmlToPdfResult previousPageResult)
{
    if (startRowIndex <= endRowIndex)
    {
        int decrement = 1;
        var paginatedRows = GetPaginatedRowsFromHtmlDocument(previousPageResult.HtmlDocument);
        if (paginatedRows != null)
        {
            //Use the rows on the first page only
            paginatedRows = paginatedRows.Where(r => r.Location.PageIndex == 0).ToList();

            if ((dataRowsHtml.Count - paginatedRows.Count) > 3)
            {
                decrement = (dataRowsHtml.Count - paginatedRows.Count) + 2;
            }
        }

        for (int i = 0; i < decrement; i++)
        {
            dataRowsHtml.RemoveAt(dataRowsHtml.Count - 1);
        }
        endRowIndex -= decrement;
        var pdfResult = RenderPageHtmlUsingPdfSession(width, height, string.Join(" ", dataRowsHtml));
        if (pdfResult.PdfDocument.Pages.Count > 1)
        {
            dataRowsHtml = CalculateRowsByRemoving(width, height, dataRowsHtml, pdfResult);
        }
    }
    return dataRowsHtml;
}
List<string> CalculateRowsByAdding(float width, float height, List<string> dataRowsHtml)
{
    int pagecount = 1;
    while (pagecount == 1 && endRowIndex <= rows.Count)
    {
        if (endRowIndex == rows.Count)
        {
            break;
        }
        endRowIndex++;
        dataRowsHtml.Add(rows[endRowIndex - 1].OuterHtml);
        var pdfResult = RenderPageHtmlUsingPdfSession(width, height, string.Join(" ", dataRowsHtml));
        pagecount = pdfResult.PdfDocument.Pages.Count;
    }
    //TECH-67539 - decrease the endRowIndex as the last record caused the html to span multiple pages.
    if (pagecount > 1)
    {
        endRowIndex--;
        dataRowsHtml.RemoveAt(dataRowsHtml.Count - 1);
    }
    return dataRowsHtml;

}
private int GetPageCount(float width, float height)
{
    string finalHTML = BuildPageHTML(width, height);
    var url = ConfigurationManager.AppSettings["PDF_baseurl"]?.ToString();
    Runtime.AddLicense(ExportConstants.EssentialObjectsLicense);
    HtmlToPdf.MaxConcurrentTaskCount = EOEngineConstants.MaxConcurrentTasks;
    HtmlToPdf.Options.PageSize = new SizeF(width, height);

    //Set page layout arguments
    HtmlToPdf.Options.OutputArea = new RectangleF(0, 0, width, height);
    HtmlToPdf.Options.BaseUrl = url;
    int pageCount = GetPageCountUsingHtmlToPdfSession(finalHTML, HtmlToPdf.Options);
    return pageCount;
}
private HtmlToPdfResult RenderPageHtmlUsingPdfSession(float width, float height, string dataRowsHtml)
{
    string tableHtml = BuildTableHTML(tableStylecss, headerHTML, dataRowsHtml, tableInlineStyles);
    string finalHTML = BuildPageHTML(width, height, tableHtml, Orientationcss);
    var url = ConfigurationManager.AppSettings["PDF_baseurl"]?.ToString();
    Runtime.AddLicense(ExportConstants.EssentialObjectsLicense);
    HtmlToPdf.MaxConcurrentTaskCount = EOEngineConstants.MaxConcurrentTasks;
    HtmlToPdf.Options.PageSize = new SizeF(width, height);

    //Set page layout arguments
    HtmlToPdf.Options.OutputArea = new RectangleF(0, 0, width, height);
    HtmlToPdf.Options.BaseUrl = url;
    HtmlToPdfResult pdfResult = ConvertHtmlUsingPdfSession(finalHTML, HtmlToPdf.Options);
    return pdfResult;
}
private string BuildPageHTML(float width, float height, string tableHtml, string pageOrientationCss)
{
    tableInlineStyles = $"style=\"max-height:{height}in !important;max-width:{width}in !important\"";
    StringBuilder tablebuilder = new StringBuilder();
    tablebuilder.Append("&lt;html&gt;&lt;body&gt;");
    tablebuilder.Append($"&lt;div {tableInlineStyles} class=\"page {pageOrientationCss}\"&gt;");
    tablebuilder.Append($"{tableHtml}");
    tablebuilder.Append("&lt;/div&gt;");
    tablebuilder.Append("&lt;/body&gt;&lt;/html&gt;");
    var finalHTML = _htmlStyleHandler.LoadCssContentAsStyleTag(tablebuilder.ToString(), cssContent);
    finalHTML = _urlToFilePathHtmlHandler.SearchAndReplace(finalHTML);
    return finalHTML;
}

If anything is already implemented in the library, that could reduce our efforts to create pages that will not clip rows on the page or have consistent pages with the correct number of rows.

eo_support

Posted: Thursday, February 1, 2024 1:57:41 PM

Rank: Administration
Groups: Administration

Joined: 5/27/2007
Posts: 24,098

To Niranjan Singh:

Can you provide your order number to us through private message?

Niranjan Singh

Posted: Tuesday, February 6, 2024 2:08:31 AM

Rank: Newbie
Groups: Member

Joined: 10/27/2023
Posts: 8

I am sharing the HTML that we created to display the pages on the UI and process each page HTML individually to export PDF so that the PDF output matches the preview in the application.

https://jsfiddle.net/niranjankala/25uy0h16/27/

We have developed a method for creating pages with table rows one by one and verifying if they fit by checking the pdfResult.PdfDocument.Pages.Count. However, we now need to use the below approach to convert the HTML to PDF, which will help us determine the number of rows that can fit on a page.
https://www.essentialobjects.com/doc/pdf/htmltopdf/paging.html#custom

It is very slow when there are many pages created with our logic due to the iteration for each page. Could you please help identify a quicker way to do this logic with the EO.PDF library?

eo_support

Posted: Tuesday, February 6, 2024 10:05:24 AM

Rank: Administration
Groups: Administration

Joined: 5/27/2007
Posts: 24,098

Hi,

The key to improve performance is to do everything in one run. You keep saying that you need to determine the number of rows that can fit on a page ---- you don't need to do that. You just need to use the built-in paging mechanism to let the HTML to PDF converter to do the paging for you.

For example, if you do NOT wish to split a single table row into multiple page, you can simply add the following CSS style into your page:

Code: CSS

Copy

td
{
    page-break-inside: avoid;
}

Then the HTML to PDF converter will automatically try to avoid splitting a single table row into multiple page. Why do you need to do this yourself with your own code?

Thanks!

Niranjan Singh

Posted: Tuesday, February 6, 2024 2:39:16 PM

Rank: Newbie
Groups: Member

Joined: 10/27/2023
Posts: 8

Hi,
I added the above CSS class to the td element, processed the entire table, and used the row page number to determine the current page. However, the rows are still being split across multiple pages.

Niranjan Singh

Posted: Tuesday, February 6, 2024 2:39:50 PM

Rank: Newbie
Groups: Member

Joined: 10/27/2023
Posts: 8

I have shared a sample with you to reproduce the behaviour. Please review that and let us know if any further information is required.

Niranjan Singh

Posted: Wednesday, February 7, 2024 10:31:43 AM

Rank: Newbie
Groups: Member

Joined: 10/27/2023
Posts: 8

Quote:

You just need to use the built-in paging mechanism to let the HTML to PDF converter to do the paging for you.

We are creating a PDF document by merging multiple reports. To ensure accuracy, we are processing each report page by page instead of the entire HTML at once. By converting each page to a PDF, it matches the preview on the user interface. If we process the entire HTML, the PDF export will keep the column width the same on each page if there is no width specified. However, if the content width changes, the column width will also change on each page.

eo_support

Posted: Wednesday, February 7, 2024 11:59:08 AM

Rank: Administration
Groups: Administration

Joined: 5/27/2007
Posts: 24,098

Niranjan Singh wrote:

I have shared a sample with you to reproduce the behaviour. Please review that and let us know if any further information is required.

We ran your sample project and where do we need to look for rows being split across multiple pages? We checked the files in PDF_Output directory and they looks fine to us.

We did see a problem in your SharedReport.css for the following CSS rule:

Code: CSS

Copy

table.tablesorter thead tr .header { background-image: url(/Content/images/tablesorter/bg.gif') ....}

Note the Url for the background image is missing a starting '.

eo_support

Posted: Wednesday, February 7, 2024 12:06:44 PM

Rank: Administration
Groups: Administration

Joined: 5/27/2007
Posts: 24,098

Niranjan Singh wrote:

You are unnecessarily complicating things. The result PDF file is controlled by your input HTML. So if you get your input HTML right, then you will get the PDF output right. This is true regardless you run the converter multiple times with separate HTMLs or a single time with merged HTML. You just somehow have this page by page mechanism fixed in your head and then you try to justify it by pointing out the problems you might run into. Those problems may be true, but they can be fixed by adjusting your input HTML because your input HTML/style directly controls your output.

It comes down to one rule: If you want performance, then run the converter once. If you don't care about performance, then you can do it as many times as you want. You can't have both. So if you want performance, your focus should be on how to modify your HTML to avoid whatever problems you run into with merged HTML. We've been trying to tell you this over and over and yet you keep pushing your own way.

You cannot post new topics in this forum.
You cannot reply to topics in this forum.
You cannot delete your posts in this forum.
You cannot edit your posts in this forum.
You cannot create polls in this forum.
You cannot vote in polls in this forum.

Message