Welcome Guest Search | Active Topics | Sign In | Register

Reducing the size of a PDF file generated with HtmlToPdf.ConvertHtml Options
aurelsa
Posted: Wednesday, April 29, 2020 4:59:09 PM
Rank: Member
Groups: Member

Joined: 2/21/2020
Posts: 10
Hi

I use EO.PDF in my company, to convert web pages to PDF files.
The size of the files is quite big, for example it uses 88KB for a file with only text (no images) that contains only one page.
I tried to set the PdfDocument.EmbedFont option to false, but I don't see any difference in the file size.

Maybe I'm using it wrong, here's the code I tested :

Code: C#
var stream = new MemoryStream();
var pdfDocument = new PdfDocument();
pdfDocument.EmbedFont = false;
var result = HtmlToPdf.ConvertHtml(html, pdfDocument, baseOptions);
pdfDocument.EmbedFont = false;
pdfDocument.Save(stream);
return stream.ToArray();


Another example, using HtmlToPdf.ConvertHtml(...) with only this HTML text: "<b>test</b>", generates a 32KB PDF file !
How to optimize this?
What is the best practice?

Best regards
eo_support
Posted: Thursday, April 30, 2020 9:45:54 AM
Rank: Administration
Groups: Administration

Joined: 5/27/2007
Posts: 24,217
Hi,

These are indeed font data. Whenever a new font is introduced, you will see a noticeable increase. However as you use more text, you won't see much increase at all. For example, you won't see much a difference when you render 'test' and a long paragraph of the same font.

EmbedFont only has very limited impacts on a very small set of fonts. The very reason that PDF is so popular is because it has font data embedded, so it can be rendered correctly on a target system even if the font used does not exist on the target system. Adobe however defined a very small set of fonts, so called "standard 14" that can be omitted from the PDF file. Only if a font used falls into those 14 fonts it can be omitted from the file.

In practice, a font rarely falls into these standard 14 fonts due to:

1. The "standard 14" was defined set decades ago and modern systems introduces many modern fonts it does not cover;
2. While it's called "standard 14", it really contains only several fonts because in fact each style of a font is counted as a different font. For example, "Arial", "Arial Italic", "Arial Bold", "Arial Bold Italic" are counted as 4 fonts;
3. Even for fonts that normally should falls into these 14, the font file can report a different name. For example, for the "Arial" font on Windows 10, the font file reports the "Post Script Font Name" as "ArialMT". This causes a mismatch and cause it to be excluded from standard 14 as well;

All these situations cause the font not to be omitted even when Embedded is set to false.

Hope this helps.

Thanks!
aurelsa
Posted: Thursday, April 30, 2020 10:16:01 AM
Rank: Member
Groups: Member

Joined: 2/21/2020
Posts: 10
Okay, I understand the problem.
So there aren't other fonts in "standard 14" that could be used to reduce the size of my PDF files?
Times New Roman or Helvetica for example?

If I write something like this :

Code: C#
var pdfDocument = new PdfDocument();
pdfDocument.EmbedFont = false;
var result = HtmlToPdf.ConvertHtml("&lt;span style='font-family:Times New Roman,Helvetica'&gt;test&lt;/span&gt;", pdfDocument, baseOptions);
pdfDocument.Save(stream);


Best regards
eo_support
Posted: Thursday, April 30, 2020 11:38:56 AM
Rank: Administration
Groups: Administration

Joined: 5/27/2007
Posts: 24,217
Hi,

Those are about it --- but unfortunately in the current version of Windows 10 they all reported different font names. For example, "Times New Roman" is reported as "TimesNewRomanPSMT" in our test system. This would exclude it to be omitted.

Thanks
aurelsa
Posted: Thursday, April 30, 2020 12:49:22 PM
Rank: Member
Groups: Member

Joined: 2/21/2020
Posts: 10
It's the same problem, if we use EO.PDF on a Windows Server machine (2016/2019) ?

Best regards
eo_support
Posted: Thursday, April 30, 2020 1:28:48 PM
Rank: Administration
Groups: Administration

Joined: 5/27/2007
Posts: 24,217
Hi,

It may or may not. Windows used to report "Arial" as "Arial" and "Times New Roman" as "Times New Roman". Somewhere along the way it has been changed --- due to the large number of Windows version/distribution along with numerous Windows updates patches on each of them, it is not possible for us to track down exactly when it has been changed on each platform. We could hard code in our code to treat "Arial" and "ArialMT" the same, but we do not know when MS is going to change it again, then we would be at this situation all over again. As such we decided not to do anything with it.

In my opinion you are probably chasing a ghost. These "standard 14" font set were introduced decades ago when PDF standard was first introduced. The actual concept was probably even much earlier (probably due to PostScript printers had these fonts built-in to reduce memory usage). As time goes by it carries less and less significance. An increase of 30K of the PDF file size may look significant to you but to many other users this is quite negligible, especially when the PDF file contains modern fonts and images/charts.

Thanks


You cannot post new topics in this forum.
You cannot reply to topics in this forum.
You cannot delete your posts in this forum.
You cannot edit your posts in this forum.
You cannot create polls in this forum.
You cannot vote in polls in this forum.