Beyond the Hype: Lessons Learned from Building an LLM-Based Extraction MVP

1 day ago 2

Forgent AI team, May 2025

Executive Summary

We built a module for LLM-based text extraction of German tenders. These are 30–50-page documents (sometimes 500 + pages) written in bureaucratic/legal German and containing hundreds of requirements. We were surprised by how poorly many out-of-the-box solutions performed on this task, since text extraction is such a common problem. Through proper model choice and prompt engineering, we were able to drastically increase the reliability of the read-out — specifically, the percentage of correctly extracted requirements (measured mainly by recall) rose from about 70 % to more than 95 %.

During the implementation, we had many learnings and benchmarked a range of models and out-of-the-box solutions. Our key learnings are summarized below:

Deeply Understand Human Processes & Consensus: Before automating, know how humans do it, why they do it that way, and where they agree.
Modularize, Modularize, Modularize: Break tasks into the smallest testable units.
Build Your Evaluation Infrastructure and UIs: Building UIs that let you quickly inspect data and results — and spot patterns that lead to better prompts or reveal errors — enables much faster iteration.
Don’t Trust General Benchmarks — Do Your Own: There is too much hype in the space; don’t believe anything you haven’t tested on your own data or problem.
Iterate Rapidly with a Small Evaluation Set: Start with a small, high-quality evaluation set, optimize directionally, and iterate quickly.
Know Your Metrics (and Use Them Wisely): Use a combination of precision, recall, F1, latency, and cost to get a balanced view, and, where possible, factor in cost and other constraints. Cost, API rate limits, and speed are all crucial for a product.
Build for the Future: Model capabilities advance incredibly fast. Stick to the latest frontier models, since performance jumps are still large enough to enable results that were previously impossible.

Background

At Forgent AI, we are developing an AI product for public procurement. Here, text-heavy workflows are common due to the bureaucratic burden in European states. We are trying to change this by pushing the boundaries of what’s possible with artificial intelligence. Recently, we embarked on a journey to build a key module in our MVP that turned out to be a particularly thorny challenge: high-fidelity extraction of requirements (‘Anforderungen’) from German tender documents. The goal was to transform dense, complex PDFs into clean, actionable JSON objects using structured output.

This process was, as any R&D effort is, full of many setbacks, breakthroughs, and great lessons. Today, we want to share some of our key learnings on what works and what doesn’t work for high-fidelity text extraction and specifically our latest experiences with the latest generation of Large Language Models (LLMs) and end-to-end solutions for text parsing and extraction like Reducto, LlamaIndex, or Docling. We also share a high-level outcome of our benchmarking results for high-fidelity extraction from German tender documents.

This is not a general guide to building with LLMs, and there are many other great blogs and resources for this, e.g. here or here for building LLM products, here for agents, or here for evals. We focus rather on our learnings and results for reliable text extraction — something a lot of people need — but issurprisingly still more challenging than expected (as even the LlamaIndex CEO recently stated after months of claims that it is solved). For more on this topic, we also highly recommend the blog posts by Sergey.

We began our process by creating a ground truth data set — a.k.a. evaluation (eval) data. We sourced two real German tender PDF documents consisting of 20–50 A4 pages and asked human experts to manually extract the requirements. Our initial aim was for our models to replicate the experts’ output.

The goal was to use this data to benchmark all models and solutions in parallel to understand how we can achieve maximum performance on text extraction, as in tender applications a single missed requirement can result in disqualification.

We started the benchmark and two days in we stumbled upon the first issue: inconsistency in human extraction. Different experts, despite their domain knowledge, weren’t identifying the same set of requirements. Some, relying on latent knowledge, would skip requirements they deemed “obvious” or commonly known. Others had slightly different interpretations of what constituted a “requirement”. This resulted in the first lesson — something in hindsight perhaps obvious but easier to overlook than expected.

Learning #1: Check Your Eval Data for Consistency. The first rule of any applied ML and similarly AI benchmark is to perform a preliminary data analysis. This can be a more manual inspection of the data (e.g. look at what’s in there and if it makes sense), but it should also examine the data distribution etc. Preliminary data analysis in the AI world must also include a check for consistency or bias — e.g. how much your human reviewers agree with their ratings. We realized our initial dataset was inherently biased by individual expert preferences and unstated knowledge. This ‘latent’ knowledge prevents any LLM (and in fact also any human) from performing the extraction task exactly as an expert is doing it.

As one of our key learnings became clear: Understand exactly how the human performs the task and why, understand how much consensus there is between different human experts (and abandon if there is none), and then define a consistent process that can be solved.

Instead of trying to mimic an ill-defined, preference-laden human process, we therefore redefined the task:

First, extract all potential requirements comprehensively (which turned out to be hundreds of requirements per document).
Then, develop a separate mecha
nism to filter these requirements based on user-defined preferences or criteria — specific to a single user or subset of consistent users.

This shift also underscored another important lesson: Modularize your tasks as much as possible. Breaking down the problem allowed for more focused testing and iteration at each step and hence much faster iteration and better guardrails and testing.

Learning #2: Build a good eval infrastructure and UI. When running tens to hundreds of experiments it’s very hard to keep track of what you have done. Initially, we began running our evaluations in Jupyter notebooks as it was easier to implement and iterate on these. We traced all our experiments in Langfuse and checked the results there. However, we quickly realized that Langfuse just didn’t allow us to review our data, results, and filter or query it to understand where the biggest improvements came from. We also weren’t able to let people without the technical know-how (e.g. tender experts) optimize the prompts.

This led to our second big learning: Investing in a good eval infrastructure and UI to view results, respectively is critical to detect patterns, and iterate quickly. This aligns with other people’s observations, for example Hamel’s blog post here which we highly recommend. Building these UIs was a game changer for us and enabled much faster iterations and getting to the desired outcomes. Building useful UIs has become so simple using modern (AI) coding tools, so there is simply no excuse to not do it.
In our case — we built everything on the same structure as our product so building the UIs took barely any effort and reused most of our production code.

Below are examples from our evaluation UI. Here any user (technical or non-technical) can select the model(s), prompts, extraction schema, and other parameters and then perform runs to test and optimize the system. It also enables the user to run on either training or hold-out datasets (only for the final runs after optimization to test generalization).

The UI allows experts and non-experts to quickly change prompts and other parameters and kick-off runs.

Once a run is completed, the user can see all metrics conveniently listed for the experiment. Below we see an example of a dashboard for a Reducto run. These are averages across multiple repeated, runs and we therefore also display the standard deviation per metric.

The overview dashboard show averages per run with some key numbers.

Finally, if a user wants to investigate a specific run, they can examine the particular behaviour of the run in detail. This is shown below. The good thing about this view is that it enables the user to visually inspect the output as well (bottom of the page).

Dashboard for each runs display all performance metrics.

The input document is shown to the user (left) and the extracted requirements are highlighted in the text (yellow). At the same time the list of extracted requirements is displayed on the right. This enables the user to manually review what exactly has been extracted in the run and visually inspect the results for any unusual behaviour.

Learning #3: Build small, high-quality eval datasets, optimize directionally, and iterate rapidly. Our friend Ali at DeepMind explained to us recently that the best approach to optimizing a product is building small eval data sets and using them to iterate rapidly. Similarly, a colleague at OpenAI mentioned that they have observed that often customers or partners build large eval data sets — only to find these are no longer relevant a few weeks later (e.g. because the problem or metric, or something else has changed). During the process of optimizing the product, and also developing the benchmark, we continuously learned new things about (a) the data, (b) the metrics that we were chasing, and (c) how models behaved. If we had built a massive eval data set first and then started the optimization, we likely would have ended up doing it all for nothing, since there are so many learnings along the way — and bottlenecks –and hence data– often shift once you understand the problem better. Working in this manner also prevents what Shankar et al. (2024) dubbed ‘criteria drift’. To cite from their paper:

“We observed a “catch-22” situation: to grade outputs, people need to externalize and define their evaluation criteria; however, the process of grading outputs helps them to define that very criteria. We dub this phenomenon criteria drift, and it implies that it is impossible to completely determine evaluation criteria prior to human judging of LLM outputs. Even when participants graded first, we observed that they still refined their criteria upon further grading, even going back to change previous grades. Thus, our findings suggest that users need evaluation assistants to support rapid iteration over criteria and implementations simultaneously.”

An additional piece of advice here: For us, trust in our evals was essential. We therefore implement for every eval also eval unit tests. For example, for the requirements we generated a mock data set where we can easily calculate the expected results (even by hand). Concretely, we generated artificial model outputs (synthetic model data) with known errors (e.g. one false positive or one false negative). These are then evaluated by the eval and the results can be confirmed by hand. This is also useful if evals are changed as it is easy to validate the correct function. We usually tested four different scenarios: No errors, one false Positive, one false negative, and one false positive and one false negative each. Note that true negatives are in this instance ill-defined so accuracy or other related metrics are not measured. One can of course extend these to large numbers for testing purposes. We highly recommend using LLMs to generate synthetic data here.

Next, we turn to our benchmarking results and the lessons learned from these.

Extracting text from PDFs, especially complex tender documents, isn’t just about getting words onto a page. It’s about understanding structure, context, and nuance to pull out specific, well-defined entities with high precision and recall. This is what we mean by “high-fidelity text extraction.” Simply put, we don’t just want a lot of text; we want the right text, correctly formatted. For our application specifically, we care most about being complete — i.e. maximizing recall is more important than precision. It is still advisable and best practice to always look at multiple relevant metrics at once to fully understand model behavior in practice. Note that this might differ for your use case.

Text extraction is a hot topic as a vast range of LLM workflows rely on text in one way or another. There are many benchmarks and systems out there that claim to solve the problem.

From Mistral to LlamaIndex to Reducto, many companies have claimed their models are state of the art.

For example, LlamaIndex CEO Jerry Liu posted in March about Mistral’s OCR models and shared an internal benchmark where he claimed that LlamaParse as well as other standard models (Gemini2.0/openai/sonnet-3.5/3.7, etc.) beat Mistral’s OCR (results below).

LlamaParse benchmarks (Source).

Another provider, OmniAI, claims that their model outperformed Gemini 2.0 Flash among other providers like Unstructured on their own benchmarks — albeit at substantially higher cost (Source).

OmniAI Benchmarks (Source).

Reducto offers a similar service, and has been considered the gold standard in this space for a while. They published their own benchmark results (RD-Table Bench) and reported superior performance compared to GPT-4o and Sonnet 3.5.

Reducto benchmarks (Source).

Although their results now seem outdated, others previously reported strong results using Reducto even compared to Gemini 2.0 Flash — albeit at significantly increased cost, similar to OminAI (Below image from Sergey’s benchmarks).

Sergey’s benchmarks (1/2).

Sergey’s benchmarks (2/2).

One of the more recent additions is Docling, an open-source project that has received a good amount of attention. In some benchmarks Docling also outperformed other commercial solutions on certain metrics (e.g. below benchmark from Procycons).

Procycons benchmarks (Source).

Next, we summarize the results of our own benchmarks for some of the most common models and text parsing/extraction providers. This was evaluated based on n=5 repeats (i.e. repeated extractions with the same model) across our data set of ~500 requirements (entities) across 2 tender documents (30–50 pages each). Each tender document contained a few hundred requirements that needed to be reliably extracted. We assessed the metrics of the extraction by comparing the set of requirements that were output by the model using set logic with the set of ground truth data. Specifically, we embedded each requirement and then did a cosine similarity search and if there is a match (using a similarity threshold of 0.95 which we calibrated on our data and for the specific embedding model) removed it from the set. Based on the number of matches we were then able to obtain a confusion matrix and used this to calculate precision, recall, and F1 scores. These were then each averaged over the 5 runs. We later performed larger runs (n=10, n=50 to confirm selected results and these were generally consistent). Where we report concrete numbers below we refer to the averages, and only report these if the variance was sufficiently low for a practical application. For models where a high variance was observed this is also noted. Please note we will report a more qualitative rather than a quantitative summary of our results. Note: all runs were performed at zero temperature (where possible).

Disclaimer: Model performance can vary hugely based on prompt engineering, document language, and many other factors. The below is hence not necessarily a representation of their general performance but just highlights our own experience with these models on our own data and benchmarks. These results hence might not be indicative for other applications and only represent a snapshot in time.

We note that we did not use more autonomous approaches (e.g. ‘agents’) in this benchmark as we wanted to establish a clear baseline on the models. More complex systems likely would have made the evaluation and comparison even harder.

Gemini 2.5 Pro & Flash: The New Powerhouses

Google’s Gemini models were at the heart of our experimentation because of their performance and cost:

Performance: Gemini 2.5 Pro showed exceptional capability with typically around 98% recall. Our best results consistently came from combining chunking with Gemini 2.5 (Pro or Flash). Gemini 2.5 Pro was also the only model which gave top performance with full documents, i.e. direct extraction of hundreds of requirements without chunking. This is notable, since even o3 failed to deliver here and returned plenty of hallucinations or ignored entire pages. While we tested the system prompt (e.g. all instructions there and only content in the prompt), we found that generally keeping the system prompt short and more in the actual prompt performed best.
Few-Shot Prompting: Including 2–5 examples in our prompts boosted recall by an average of 5%. Interestingly, providing too many examples (>8) sometimes decreased performance — a reminder that more isn’t always better, a nuance also seen in prompt engineering best practices where the quality and relevance of examples often trump quantity. Selecting specific examples for performance was also critical, an insight that is also consistent with the literature.
Prompt Language: For Gemini 2.5 Flash, we obtained best performance with a German prompt. However, with an English prompt, 2.5 Pro still outperformed Flash slightly in overall performance (97% vs. 95.6% on our internal recall metrics for the “extract all” phase).
Pro vs. Flash Trade-offs: Our experience was that Pro offered marginally better peak performance (with chunking) while Flash was more reliable for completion and speed. This comes at about 2x the speed for us and much reduced cost. However, pro extraction was unparalleled for the entire document — as mentioned earlier. This aligns with observation from others, for example, DocsBot.ai’s comparison (May 2025) of Gemini 2.5 Flash and 2.5 Pro notes significant cost differences (2.5 Pro being nearly 10x more expensive for input/output tokens than 2.5 Flash), reinforcing the practical need for models like Flash. Perhaps the best strategy here is to do 90% of experiments on Flash, then 10% final experiments on Pro for the hard cases.

Reducto — past performance seems to fade against new models:

Reducto was earlier this year still on par with Gemini 2.0 — and indeed outperformed it on multiple benchmarks (even if so by a small margin only) — see e.g. the one by Sergey. With the 2.5 model series this advantage seems to be gone entirely and Reducto, the previous gold standard, was not competitive despite several iterations of prompt engineering and changing the process. Indeed, we struggled to consistently achieve a recall above 80% versus 95%+ with the Gemini 2.5 models. In particular, given the high cost of Reducto (versus e.g. Flash 2.5) we decided against using it for this process. We also noted that Reducto had a much higher variance compared to other providers which we were not able to reduce. We note that we used Reducto’s own chunkers and even with their chunking we were not able to achieve a higher performance. For other methods (e.g. Gemini) which didn’t have native chunking we implemented a page or multi-page based chunking (see below).

Interestingly, in a curious bit of prompt engineering, “threatening” the model (e.g. ‘if you don’t deliver I will lose my job’) led to a ~10% increase in recall (with a drop in precision). This highlights the sometimes non-intuitive, almost psychological aspects of interacting with current LLMs, a phenomenon less documented in formal benchmarks but known anecdotally in the prompt engineering community. How much this would translate to more recent reasoning models, however, is hard to know. Perhaps this is also an indication that Reducto relies under the hood on older models that seem to be less competitive for our task (e.g. 4o).

OpenAI o3 and 4o

Surprisingly, when testing the most recent OpenAI reasoning models, aka o3, these failed when testing them on entire documents. For example, o3’s performance for entire document ingestion (after prompt optimization) was about 60% recall, surprisingly far behind Gemini 2.5. Indeed o3 generally seems to be hallucinating a lot more or ignoring entire pages. On the other hand, chunking the input documents (e.g. to 5 pages at a time with 1 page overlap) resulted in performance nearly on par with all other models.

Docling + LLMs: A Funnel Approach

We also explored Docling (see here for their technical report) for parsing combined with Gemini for extraction. This strategy often yielded perfect recall (98% or higher). While these were decent results we decided against moving forward with it for two reasons: First, we believe that foundational model performance will further increase and hence outperform other tools. Second, Docling comes with overhead (setting up, running, and maintaining) and since we are still in the MVP phase we decided that the performance doesn’t justify the effort.

Anthropic (Claude Sonnet and Opus)

We also tested Anthropic’s models — specifically Claude 3.7. The model achieved overall decent recall (average of 83%). However, this was at a much higher cost than e.g. Gemini Flash and still at reduced performance. We further encountered repeated issues with the returned JSON formats. Overall, we hence found Claude less suitable for this application. Again, chunking significantly improved the results, but Claude was still by far the most expensive individual model in our experiments. Claude Opus was released the day after we completed our benchmarks so we only ran a reduced set of experiments on it. Our overall impression is that it performs similarly to Gemini 2.5 Pro (Opus) / Flash (Sonnet) while the cost is substantially (about 20x) higher. We also observed occasional errors in the formatting and were also running into issues do the output token limits — which are half the size of Gemini Pro (Gemini 2.5 Pro: 64000, Claude Opus 4: 32000). Notably, on full documents (without chunking), we repeatedly ran into output token limits with Opus, while Sonnet underperformed notable with a recall of 44% (±2.8% Std) compared to the Gemini models.

LlamaIndex:

LlamaIndex generally performed suprisingly bad in our benchmarks. Without implementing our own chunking method, performance was far below average (about 0.5 recall) compared to other methods. However, chunking — similar to Docling or Reducto — fixed performance issues (five-page chunks with one-page overlap worked best for us). The underlying model is Gemini Flash 2.0, so it was not a surprise that it performed worse than 2.5 Flash. After implementing chunking and a deduplication step (using Gemini 2.5 Pro), however, we achieved very high performance (near-perfect recall). This comes, however, at the cost of long runtime (5–8 minutes per 100 requirements). Given the cost and time we hence also decided to not progress with LlamaIndex.

Chunking: A Reliable Ally

Regardless of the model, chunking documents proved crucial for reliability and accuracy. This insight is not surprising, and aligns indeed with many reported results and known behavior (e.g. context fading, hallucinations, etc.). We found that 5-page chunks with a 1-page overlap worked best for our specific extraction task. This aligns with general advice — although exact chunking sizes will depend hugely on the application. While our extraction is somewhat local to sections, the overall document structure influences requirement definition, making chunk size and overlap critical in this application as well.

Deduplication

Extracting all requirements using chunking often leads to duplicates. Using Gemini 2.5 Pro for deduplication worked best when an LLM was necessary, but it was slow and occasionally erratic. Gemini 2.5 Flash led to much higher deviations here rendering it also less useful. We ultimately decided to implement a clustering algorithm and then use 2.5 Pro to remove duplicates within each cluster. This resulted in the overall highest performance at minimal costs.

Overall, running these benchmarks internally resulted in two additional insights.

Learning #4: Don’t trust general benchmarks — do your own. While there were many public benchmarks by different model or solution providers, and plenty written on LinkedIn and X about the performance of different solutions. Most of these didn’t quite hold up in our application. While there is a big claim that text extraction is solved, many approaches just didn’t result in the required performance. Therefore, it is best to quickly test and run benchmarks on your own problems and not blindly trust other people’s results. They might simply not apply to your own use case.

Learning #5: Use a balanced set of metrics — and look at the outputs. While this is nothing new — it’s been insightful during our benchmarks to look at multiple metrics at once. This goes beyond the simple performance metrics (accuracy, precision, recall, F1 score) to also include cost or whatever else matters most for your product. Further, it is critical to visually inspect the outputs since LLM-generated content can sometimes be different in surprising ways. Going deep on the outputs and manually inspecting these does make a big difference and really enabled us to find where models (or prompts) failed.

Learning #6: Models advance so quickly. Build for the future not the past. The overall quality of the foundational models improves so dramatically and quickly with each generation. Simply jumping from Gemini 2.0 to 2.5 meant that many other solutions now were not competitive any longer. Building with the latest generation of models is hence likely just an early indication of where average model capabilities will be in 6 months from today, so it’s likely best to just go with that. However, there are exceptions for this — e.g. if reliability is key. Then standard best practices such as chunking still make a difference.

The above learnings interestingly very much resemble the insights shared in the paper “AI Agents That Matter“ from last year: They specifically advise to:

Couple performance evaluations to cost and optimize both in parallel (for maximal real-world impact).
Don’t take shortcuts with your evaluation (e.g. overfitting to benchmarks is common) as they break in real-world (RW) environments — e.g. keep hold-out data sets and test them on those.
Don’t trust public benchmarks as many are overinflated and not indicative of real world performance.

The AI landscape is evolving at an astonishing pace. With LLM capabilities potentially doubling every few months and cost reducing about 10x every 12 months, we should build for the capabilities models will have in 3–6 months, not just their current state. Staying on the cutting edge and iterating fast is key.

Our MVP journey for requirement extraction has been incredibly enlightening. The challenge of high-fidelity text extraction is significant, but with careful problem definition, modular design, robust evaluation, and leveraging the rapidly advancing capabilities of models like Gemini, we’re making exciting progress.

Stay tuned as we continue to build with AI at Forgent!

P.S. Want to work with us on testing and implementing state-of-the-art AI / agentic workflows? We are always looking for exceptional talent to join us in Berlin. More details here.

Disclaimer: Gemini 2.5 Pro helped us writing and editing this text based on our notes and instructions.

Read Entire Article