How to Extract Table from Image based PDF files

ManikandanT17003272 · January 3, 2024, 1:26pm

Hi,

We are trying to extract Table from Image based PDF files using Pega-RPA.

We can able to convert Image based PDF to readable PDF using document OCR , after that we are not able to highlight or extract the table element(after converting Image PDF to readable PDF whole PDF is converted to text).

I have added reference PDF Screenshot in the attachment.

Is there any way to extract Tables from Image based PDFs ?

ThomasSasnett · January 3, 2024, 6:22pm

@ManikandanT17003272 We explicitly don’t support gathering tables from OCR’d PDFs. Additionally, the tables rely on each cell being contained within intersecting lines. Neither of these tables have intersecting lines. You can however read these with the PDF connector by using pages, lines, words, and segments. You will need to first use the ProcessToPdfFile method on the original file to convert it into a PDF that the connector can read. You can then open this file in the PDF connector and iterate through each Page and then each line on each page. You could even look at the words on each line and parse the table that way. This works because both tables have text that appears before and after to help you identify when a particular line is within the table.

One trick to help you with this is the Developer Tools in the PDF Viewer. You can use this to highlight the various lines, words, and segments in the PDF to determine how you might navigate it. Here is a quick example with the item you provided. It works best with number 4, but 5 works as well, although I don’t see all of the first line for some reason. I am investigating that though.

OCR_PDF_Table.zip (45.2 KB)

Conversation		Replies	Views
How to extract table from PDF if the table spanning to next page(one table spanning multiple pages) General intelligent-document-processing , pega-robotic-desktop-automation	7	402	February 5, 2024
How to extract the text from file General pega-platform , platform	1	35	February 19, 2026
How to extract multiple tables from excel sheet using Pega Robotics? General robotic-process-automation , data-integration , pega-robotic-process-automation	3	149	February 9, 2024
Reading PDF Document User Experience pega-platform , user-experience , senior-system-architect , healthcare-and-life-sciences , 8-7	2	214	October 10, 2024
Pega RPA robot inspector General robotics-system-architect , robotic-process-automation , other-industry , pega-robotic-desktop-automation , pega-academy	11	93	April 1, 2024

How to Extract Table from Image based PDF files

Related topics