How Noxx Uses Validation to Parse Complex Resumes with AI

4 days ago 2

masaishi

  • Why traditional OCR methods fail when parsing resumes
  • LLM-based information extraction using OCR text and its spatial coordinates
  • Built evaluation metrics to compare different OCR approaches objectively
  • Evaluation is essential when building production-ready LLM applications

As a Machine Learning Researcher at Noxx, I developed a method to extract structured data from complex resume PDFs. Noxx builds AI-driven Applicant Tracking Systems that help companies hire overseas talent within budget and time zone constraints by using LLMs that understand resume meaning, not just keywords.

Resumes with their multi-column layouts and varied formatting styles present significant challenges for standard text extraction methods. Our approach preserves both text content and spatial information to maintain logical connections between resume elements.

“The key to success, as with any LLM features, is measuring performance and iterating on implementations.” — Anthropic Engineering Blog, Building Effective Agents

The evaluation framework described here has helped us create more reliable AI systems and can be applied to many LLM-based applications beyond resume parsing. Whether you’re working with documents, images, or structured data, the techniques for combining OCR with LLMs outlined in this article will provide valuable insights for your own projects.

Many resumes come in complex layouts that break sequential text flow. Let’s look at two common examples:

Note: Sample resume generated by Claude 3.7 Sonnet
  1. Side-by-Side Layout (left) — Related information appears in left and right columns (like company names on left, job details on right)
  2. Split Layout (right) — Different information types appear in separate columns (like skills on left, experience on right)

Standard OCR reads text sequentially (top-to-bottom, left-to-right), causing problems:

  • Job titles get attached to the wrong companies
  • Skills get mixed with experience descriptions
  • The chronological narrative becomes fragmented

Instead of extracting just text, we capture where text appears on the page using bounding box (BBox) coordinates. This maintains the spatial relationships between resume elements.

For example, we transform:

"Senior Software Engineer"

Into this spatially-aware format:

{
"text": "Senior Software Engineer",
"bbox": { "x0": 120, "y0": 310, "x1": 320, "y1": 330 }
}

When processing a lot of resumes, efficiency becomes critical. We compared several formats to find the most token-efficient approach:

Our testing revealed significant differences:

  • CSV used the fewest tokens (634 tokens)
  • ltwh notation came in second (758 tokens)
  • JSON and XML were the most verbose (1,300+ tokens)

While CSV used the fewest tokens, we discovered that as the resume text increased, its accuracy gradually declined. My hypothesis:

CSV lacks enough meta-information to help the model retain context. When inputs get long, the model may forget what each number represents, leading to confusion.

Therefore, we ultimately adopted “ltwh” notation (left, top, width, height) as it provides the best balance between accuracy and token reduction.

Senior Software Engineer, ltwh 120 310 200 20

To determine the most effective OCR approach for resume parsing, I evaluated multiple OCR methods including standard text extraction and those that preserve spatial information through bounding boxes. I built a comprehensive CLI tool (available at https://github.com/knot-inc/blog-ocr-exps) that benchmarks different OCR pipelines against ground-truth data.

Our evaluation included the following processors:

  • mistralOcr.ts: Custom OCR implementation using Mistral API to process image inputs
  • processImage.ts: Direct image processing approach
  • tesseractOcr.ts: Standard text extraction using Tesseract OCR
  • tesseractOcrWithBbox.ts: Tesseract OCR with bounding box coordinates
  • tesseractOcrWithImage.ts: Image-based approach with Tesseract
  • textractOcr.ts: AWS Textract for standard text extraction
  • textractOcrWithBbox.ts: AWS Textract with bounding box information
  • textractOcrWithImage.ts: Image-based approach with AWS Textract

Accuracy Comparison

To calculate match rates, I created ground-truth schema aligned with our extraction goals. The schema follows this Zod structure:

workExperiences: z.array(
z.object({
title: z.string().optional(),
company: z.string().optional().describe("Company name only"),
description: z.string().optional(),
startDate: z.string().optional(),
endDate: z.string().optional(),
}),
),

For each extracted field, I computed Jaccard Similarity between the ground-truth and extracted strings to measure the accuracy of information extraction. This provides a quantitative measure of how well each method captures the essential resume data according to our schema requirements.

Result of information extraction accuracy:

Performance Comparison

Execution times for processing 10 resumes (sorted by speed):

After careful consideration, we selected the textractOcrWithBbox solution for our production environment. While our sample size of 10 resumes is admittedly small, having empirical data—even limited—provides valuable insights compared to making decisions without any metrics.

We chose AWS Textract with bounding box information for several key reasons:

  1. Balanced performance: It offers an excellent compromise between accuracy (97.5% match rate) and processing speed (80.45 seconds for 10 resumes), requiring only about 5 seconds longer per resume compared to standard textractOcr.
  2. Lower implementation costs: Our serverless architecture uses AWS Lambda, making AWS Textract integration significantly more implementation cost-effective than building custom OCR pipelines from scratch.

The textractOcrWithBbox approach provides sufficient accuracy for downstream processing while maintaining reasonable resource utilization, making it our preferred solution for production deployment.

Here’s an example of our system in action. On the left, my resume with complex formatting. On the right, our JSON output capturing the structured data:

Note: This resume is my own (the author’s)

The system captures key information like:

  • Job titles (“Machine Learning Research Intern”)
  • Companies (“Noxx”)
  • Detailed descriptions.
  • Date ranges (“2024–07–01” to “Present”)

In this way, we extract the necessary information, and then subsequent AI workflows are processed, ultimately recommending candidates.

To wrap up, here’s a quick summary of what I’ve covered:

  • 🧭 Preserving layout matters — We attach bounding box (bbox) information to extract content accurately from complex resume layouts.
  • 🪶 Efficiency matters too — To reduce both token cost and improve prompt clarity, we format output using a compact ltwh (left, top, width, height) representation.
  • 🧪 We built a CLI tool — This lets us systematically test and compare multiple OCR pipelines with consistent metrics.
  • textractOcrWithBbox performs well — While some variations in spelling or formatting do occur, it still extracts the essential information with high accuracy.

Evaluating pipeline performance with actual scores — rather than relying on intuition or visual inspection — is essential when building scalable, AI-driven systems. While our sample size of 10 resumes is admittedly small, having some empirical data is vastly better than none at all.

Of course, comprehensive evaluation with larger datasets would be ideal. However, balancing evaluation depth with development time is a practical reality. The key is to assess how critical accuracy is for your specific product and allocate evaluation resources accordingly. As more developers build LLM-powered systems, I believe that robust evaluation strategies will become as fundamental to AI development as test suites are to traditional software engineering — defining the difference between experimental prototypes and production-ready AI applications.

If you’re building LLM workflows, I hope the experiences shared in this article can help you design a more reliable AI system.

While Tesseract-based pipelines (like tesseractOcr, tesseractOcrWithBBox, and tesseractOcrWithImage) produced slightly noisier output, they achieved the highest average match rates (up to 97.7%). This suggests that:

If you’re planning to pass the output to an LLM like ChatGPT, it’s often beneficial to retain raw information — even if it contains noise. LLMs are surprisingly good at interpreting loosely structured data, and more raw content often results in better overall understanding.

On the other hand:

If you’re using OCR output directly without further AI-based cleanup, engines like MistralOCR may be more suitable because they produce cleaner text out of the box — though their match rates were slightly lower (~93.6%).

Original Data -> The further to the right, the more data cleaning

Read Entire Article