Blog

From Scanned PDFs to Financial Excel sheets: Leveraging GPT-4o and LlamaParse for Data Extraction

By Dilbagh Dhindsa Innovation Head AI and Data Analytics

Posted on Jul 31, 2024

scanned-pdfs-blog-header

We have often struggled while extracting information from low quality or distorted text in scanned PDF documents and the problem is especially acute for extracting tabular data like financial statements.

Traditional OCR solutions used to struggle to accurately extract tabular financial statements from scanned PDFs due to issues with layout recognition and formatting. The presence of complex table structures, such as nested tables and merged cells, further complicated the process. They were also prone to errors in recognizing and converting dates, amounts, and currency symbols.

Lot of progress have been made and some advanced OCR solutions are available to extract tabular data. We recently used one such modern solution LlamaParse, an offering from LLamaIndex to extract tabular data.

LlamaParse is a powerful and modern solution for extractive table recognition that leverages large language models to accurately identify and extract tables from scanned PDFs, including financial statements.

We will need to get API key for LlamaParse by logging into https://cloud.llamaindex.ai/login It provides upto 1000 pages of data extraction in free tier. Optionally OpenAI GPT-4o key can be passed for better results.

It provides a simple interface to configure and extract data from documents. We need to pass Llama cloud key, result type as markdown or text, gpt40-mode and key to improve results with GPT4o. We then load the scanned PDF document.

Extracted tables are stored in CSV files. Depending on scanned document PDF size and number of tables we had to adjust similarity_top_k to extract the required tables.

LlamaParse is part of Llama Cloud that helps parsing complex documents with embedded objects such as tables and figures. We observed high quality results parsing scanned financial statements with accurate results.

References:

About Dilbagh Dhindsa

Innovation Head AI and Data Analytics

Dilbagh is a hands-on leader in Generative AI, AI/ML engineering, Data Science and software development. With over 20 years of International experience. He has developed groundbreaking AI and Generative AI solutions for global customers that helped solve complex business problems and optimize processes.

He had developed GenAI Accelerators for generating Sections of SoW(Statement of Work) using innovative metadata-driven dynamic chunk mapping. A US patent have been filed for the solution. Other GenAI Solutions included Secure Private GPT, an Email processor for license information, Recruitment tool for matching JD with resumes and chat.