How to extract table from PDF if the table spanning to next page(one table spanning multiple pages)

Hi,

We are attempting to extract a table that spans multiple pages from readable PDF file.

whole PDF is highlighting when we are trying to select Table region(please refer attachment).

is there any way to extract the table spans multiple pages from a readable PDF.

@ManikandanT17003272 Is it possible to attach an example of that PDF (without any real data of course)? I believe it is possible to ignore the headers and footers, but it is not something I do regularly, so having the PDF to test with would be helpful.

@ThomasSasnett

Continuing with mani’s post. Please find the test pdf attached.

MultiPageInvoice.pdf (78.9 KB)

@AbhishekR17024116 I believe there is something odd with this specific PDF. I have opened a support request to get an explanation as to why this table is being misread. The INC is INC-B434.

Normally, you can simply elect to have the table span pages, however in this case, this table seems to include the entire page. While it is possible to read this and work with this, it is not ideal. If you had to work with this PDF now, you would have an extra column which essentially splits the Amount column into two parts. You could join them together in your automation to get the full value. In addition, it would contain most of the values on each page, so you would need to exclude the information from the table that you do not want. I believe there is an explanation for this PDF though, and I will update once I get word back from support.

@ThomasSasnett The customer has opened INC-B1427 on this issue and the one I opened has been closed.

@AbhishekR17024116 @ManikandanT17003272 BUG-849540 and INC-B434 (Issue with PDF Table in 22.1.24) was closed with the conclusion that the PDF file is not supported :

Issue being investigated was why the vertical and horizontal lines are not correctly counted after reduction.

The PDF was found to have non visible table lines - lines not for display but are listed in the structure of the PDF. That causes the table recognition code to interpret the table is designed from top to bottom of the pdf page.

This causes the pdf connector table recognition code to identify the table as full page. This structure is atypical of an ordinary pdf table.

A feature enhancement would be required for the pdf connector to recognize when the table lines are visible and non visible for such a pdf structure.

If support for such a pdf is needed please request assistance from our GCS team in placing a feature enhancement request through the ticketing system.

cc @ThomasSasnett cc @Mitchell

@ThomasSasnett Here is a link to the documentation on working with PDFs.

https://docs.pega.com/bundle/robotic-automation-221/page/robotic-automation/pdf-connector/usepdfconnector-component.html

@ThomasSasnett INC-B1427 does not relate to PDF but to Pega.ServerDeploy overrides.

INC-B434 Pega Support ticket (Issue with PDF Table in 22.1.24) is still open!

GCS will contact you today w.r.t testing out the PDF connector.

Update: BUG-849540 logged and team is investigating further