Automating PDF data extraction with AI and Azure

The lessons we learned transforming 8,000+ legacy PDFs into structured data using Azure Document Intelligence.

By Alex Welding, Lead Developer @ Steer73

In the world of digital transformation, not every project starts with shiny new datasets and perfectly structured APIs. Sometimes, you’re handed a stack of 8,000+ PDFs and told: “This is our current data. We need it in the new system.”

This was the case for us recently, when we were creating a new system for a client where the legacy data existed entirely as individual PDFs, each representing a user profile. The goal was to import this data efficiently, securely, and with minimal human intervention.

The experience forced us to rethink how we approach data migration in challenging scenarios, and along the way, we picked up some valuable lessons.

1. Tools are only part of the answer

Our first instinct was to find an off-the-shelf PDF reader that could parse the documents. But we quickly realised these tools often rely on the PDFs being consistent in format. A slight variation, an extra field, a changed font, or even a swapped layout, would throw off the entire extraction. And coding defensively for every possible variation? That’s not scalable.

Lesson: When the input data is unpredictable, you need intelligence, not rigid logic.

2. Manual isn't a strategy

We briefly considered manual data entry. At over 8,000 PDFs, that was a non-starter. It would have taken weeks, cost significantly more, and been prone to human error. For us, automation wasn’t just a nice-to-have, it was the only option.

3. Document Intelligence gave us the flexibility we needed

After ruling out manual options and rigid PDF parsing, we explored Azure’s Document Intelligence offering. It gave us a promising path forward: a way to train a model on sample PDFs and extract structured data in a repeatable way. Our initial prototype showed good results, giving us a clear baseline to iterate on.

This decision wasn’t just about functionality, it was about data security and architecture fit. Since we were already building the platform in .NET Blazor and hosting on Azure, we could ensure all sensitive data stayed within the client’s environment, even during upload and processing. Not even our developers had access to the raw data.

4. Good training data is everything

One of the early challenges we encountered was around training quality. Some sample PDFs performed well, others completely broke the model. It quickly became clear that the quality and consistency of training data was the single biggest factor in model accuracy.

Lesson: The AI is only as smart as the examples you give it. Invest early in curating diverse, high-quality training data.

5. It's never just one tool

While Document Intelligence handled the text extraction well, it struggled with embedded images, critical in this case because each user profile included a photo. We had to supplement our pipeline with an image extraction library to process and attach those separately.

It was a reminder that no one tool does everything. A well-architected solution often involves stitching together multiple technologies, each playing to their strengths.

6. Speed, scale, and cost matter

Once the system was in place, we ran the full migration: over 8,000 PDFs processed, parsed, and imported into the new database in under two hours. The total compute cost? Somewhere between €200 – €300.

Considering the alternative, manual data entry over several weeks, this was a win for all involved.

Final thoughts

This wasn’t just a technical exercise. It was a case study in problem-solving under constraint. We had to balance security, scalability, flexibility, and cost, all while dealing with less-than-ideal legacy data.

The real takeaway? Modern AI-powered tools like Document Intelligence can be game-changers but only when combined with thoughtful engineering, clear architecture decisions, and a willingness to iterate.

For agencies or teams facing similar challenges, our experience shows that with the right mindset and tooling, you can turn even the messiest data into meaningful results.

Subscribe to our newsletter

For regular insights into UX, product management, innovation and technology, sign up to our newsletter.