Text Extraction

Questions, comments and suggestions concerning VintaSoft PDF .NET Plug-in.

Moderator: Alex

Post Reply
kwaltman
Posts: 30
Joined: Fri Aug 07, 2009 9:48 pm

Text Extraction

Post by kwaltman »

If there a way to extract the text using a text region, but supply a mask (rectangle) to ignore.

So for example, if I was to extract all the text on a page, but specify a region in the middle of page to ignore, can that be done?

So basically a grab all text in a region or subregion which doesn't exist in the specified rectangle.
Alex
Site Admin
Posts: 2305
Joined: Thu Jul 10, 2008 2:21 pm

Re: Text Extraction

Post by Alex »

Hello Kevin,

For solving your task you need divide your page to sub regions, extract text from sub regions and combine extracted text.

Best regards, Alexander
kwaltman
Posts: 30
Joined: Fri Aug 07, 2009 9:48 pm

Re: Text Extraction

Post by kwaltman »

When getting the formatted text from a text subregion, is there a way to get to untrimmed or fully padded to the specified rectangle of the subregion?

I need to calculate out line spacing and character space for the formatted text that is returned from the text subregion specified by rectangle. The problem is that the formated text that is returned has the trailing space trimmed and trailing blank lines trimmed. So because of that I cannot calculate the average character width based on line length version the subregion width. Same goes with calculating the average line height based on number of lines that can fit in the text region.
Alex
Site Admin
Posts: 2305
Joined: Thu Jul 10, 2008 2:21 pm

Re: Text Extraction

Post by Alex »

Hello Kevin,
When getting the formatted text from a text subregion, is there a way to get to untrimmed or fully padded to the specified rectangle of the subregion?
Yes, you can get formatted text using the TextRegion.FormattedTextContent property.

Best regards, Alexander
kwaltman
Posts: 30
Joined: Fri Aug 07, 2009 9:48 pm

Re: Text Extraction

Post by kwaltman »

Alex,

I am using the formatted text property. The problem is that the formated text property is trimming trailing spaces and text lines. So this makes it almost impossible to calculate the total number of characters that can fit in the width of the selected subregion, or the text number of lines in the height of the subregion.

Or if there were properties that could be used to get those calculations, then that would be helpful.


But ideally, what I am trying to do this to be able to mask out an area on the page, before text extraction.

Cutting it up into multiple regions and then trying to put it all back together again doesn't really work that well as adjoining regions can get back the same data depending on where the boundaries lay across a line or char area. And once someone has multiple masks on the same page, the cutting up regions and trying to splice them back together gets exponentially more difficult.

Is there a way to remove all the text data in a certain text subregion first, so when a larger overlapping subregion attempts to get data it wont find any text in the area that was removed?

Or if there was a way to specify a "Mask" or "Ignore" rectangle when getting a subregion, then that would be the easiest method.
Alex
Site Admin
Posts: 2305
Joined: Thu Jul 10, 2008 2:21 pm

Re: Text Extraction

Post by Alex »

Hello Kevin,
I am using the formatted text property. The problem is that the formated text property is trimming trailing spaces and text lines.
Please send us (to support@vintasoft.com) a small working project which demonstrates your problem. We need reproduce your problem.

Is there a way to remove all the text data in a certain text subregion first, so when a larger overlapping subregion attempts to get data it wont find any text in the area that was removed?
Version 8.3 will have functionality like Redact tool of Adobe Reader, i.e. you will be able to delete text or images from PDF page.

Best regards, Alexander
kwaltman
Posts: 30
Joined: Fri Aug 07, 2009 9:48 pm

Re: Text Extraction

Post by kwaltman »

Perfect! What is the expected time frame for the release of 8.3?
Alex
Site Admin
Posts: 2305
Joined: Thu Jul 10, 2008 2:21 pm

Re: Text Extraction

Post by Alex »

I hope version 8.3 will be available at the end of Summer.

Best regards, Alexander
Post Reply