Rank: Newbie Groups: Member
Joined: 2/7/2012 Posts: 2
|
Hi, we got a commercial license and were using eo for html to pdf conversion. Now we're trying to figure out how to read the text from a PDF. We got the PdfDocument and pages but all we see is PdfRawContent, but there's no way to extract the text out. This PDF is not an image and itextsharp and other library could be used to extract the text no problem. We don't want to use two different libraries for this. Can you give us some advice on how this can be achieved?
Thank you
|
Rank: Administration Groups: Administration
Joined: 5/27/2007 Posts: 24,196
|
Hi,
Unfortunately this is not supported in the current version. For some PDF it is possible to extract text while for some PDF file it is no possible. This is because PDF file is much more about "drawing the output" rather than information exchange. PDF file format went to great length to ensure the output quality, but it is possible that the file only contains information about how to "draw" each letter while lacks information about what character it is drawing. Further more, for the files that do contain character code information, it may not have enough information to piece different text blocks together. For example, if you have multiple words "This" "is" "a" "PDF" "file" artistically arranged on a page with different fonts at different location, a human being can instantly piece them together as a sentence, but a machine would not know which words goes first and which words go next. For these reasons, we do not have such a feature. So you may still want to rely on iTextSharp for this purpose.
Thanks!
|