Hi,
We are trying to extract Table from Image based PDF files using Pega-RPA.
We can able to convert Image based PDF to readable PDF using document OCR , after that we are not able to highlight or extract the table element(after converting Image PDF to readable PDF whole PDF is converted to text).
I have added reference PDF Screenshot in the attachment.
Is there any way to extract Tables from Image based PDFs ?
@ManikandanT17003272 We explicitly don’t support gathering tables from OCR’d PDFs. Additionally, the tables rely on each cell being contained within intersecting lines. Neither of these tables have intersecting lines. You can however read these with the PDF connector by using pages, lines, words, and segments. You will need to first use the ProcessToPdfFile method on the original file to convert it into a PDF that the connector can read. You can then open this file in the PDF connector and iterate through each Page and then each line on each page. You could even look at the words on each line and parse the table that way. This works because both tables have text that appears before and after to help you identify when a particular line is within the table.
One trick to help you with this is the Developer Tools in the PDF Viewer. You can use this to highlight the various lines, words, and segments in the PDF to determine how you might navigate it. Here is a quick example with the item you provided. It works best with number 4, but 5 works as well, although I don’t see all of the first line for some reason. I am investigating that though.
OCR_PDF_Table.zip (45.2 KB)