A global risk engineering firm was sitting on 25 years of institutional knowledge locked inside 2,500+ PDF inspection reports. Each report contained structured risk recommendations that needed to be tracked, compared, and queried across sites and years. Using Knowi’s Document AI, they extracted all of it into a searchable analytics database at 99.5% accuracy, processing 12 reports in under an hour.
Quick Summary (TL;DR)
- A global risk engineering firm had 2,500+ PDF inspection reports with structured recommendation data that could not be queried or analyzed at scale.
- Knowi’s Document AI extracted 9 structured fields per recommendation from each report, including reference number, title, description, aim, and remediation status.
- The system handled variable document structures, multi-year status tracking, and long-form 200-page PDFs through intelligent chunking.
- Accuracy reached 99.5% across 12 test documents, verified by manual review.
- Processing time was 4 minutes per document, with a full batch of 12 reports completing in under 50 minutes.
- The extracted data feeds directly into a Knowi dashboard, replacing a manual Excel process that previously required three people working full time.
- The firm connects their OneDrive file share directly to Knowi, with no custom integration or ETL pipeline required.
Table of Contents
- The Use Case: Unlocking Risk Intelligence Trapped in PDFs
- What Knowi Extracted: 9 Fields Per Recommendation, Across Every Document
- How Knowi Handles Document Complexity
- Connecting to the Document Source: OneDrive and SharePoint
- The Results
- What the Recommendation Database Enables
- Is This Use Case Right for Your Team?
- Frequently Asked Questions
The Use Case: Unlocking Risk Intelligence Trapped in PDFs
Risk engineering firms produce detailed site inspection reports for industrial clients in mining, energy, and manufacturing. A single report can run 200+ pages and covers everything from site overview and insured values to detailed risk recommendations. The recommendations section is the most operationally valuable part: each recommendation has a reference number, a title, a description of the risk, an aim, and a status that evolves across annual visits.
The challenge is that this data lives in unstructured PDFs. A firm with a global portfolio might revisit the same site six times over a decade, producing six separate reports where the same recommendation appears with different statuses: new, in progress, completed. Without a way to extract and link that data, the only way to understand remediation progress across a site portfolio is to open each PDF manually.
This firm had accumulated over 2,500 reports spanning 25 years. Their backlog since 2021 alone was 800 reports, with 200 new ones added every year. A team of three people had previously read reports manually and entered data into an Excel spreadsheet. That process did not scale.
What Knowi Extracted: 9 Fields Per Recommendation, Across Every Document
Knowi’s Document AI was configured to extract a defined set of fields from every report in the firm’s library. The extraction covers both the report-level metadata and the recommendation-level detail, structured as rows in a queryable dataset.
Report-Level Fields
- Company name: the client organization that owns the inspected site
- Site name: the specific facility or location
- Date of survey: when the site visit occurred
- Report prepared by: the lead engineer on the inspection
Recommendation-Level Fields
- Recommendation number: the unique reference ID within that report
- Recommendation title: a short, action-oriented summary (e.g., “Install fire suppression system in substation 3B”)
- Description: the full body text explaining the risk and the required action
- Aim: the intended outcome of the recommendation
- Status: current remediation state, typically new, in progress, or completed
How Knowi Handles Document Complexity
PDF extraction sounds straightforward until the documents are 200 pages long, inconsistently formatted, and span two decades of different report templates. These reports had several structural challenges that Knowi’s Document AI was built to handle.
Variable Section Numbering
The recommendations section does not always appear at the same section number. Depending on the report version, it might be section 8, 9, or 10. The section title, however, is always identical: “Recommendations for Risk Reduction.” Knowi locates the section by title rather than position, which means formatting variations across report versions do not affect extraction accuracy.
Long-Form PDF Chunking
At 200+ pages, these reports exceed what a standard AI model can process in a single pass. Knowi chunks long documents intelligently, processing each section independently and then assembling the results. This approach eliminates timeouts and ensures that recommendations near the end of a long report are extracted with the same accuracy as those at the front.
Multi-Year Status Tracking
The same recommendation can appear in six consecutive annual reports with evolving statuses. A recommendation made in 2019 might show as “no progress” in 2020, “in progress” in 2021 and 2022, and “completed” in 2023. Knowi extracts each instance with its corresponding status and report date, so analysts can query the full remediation history of any recommendation across all site visits in their database.
Connecting to the Document Source: OneDrive and SharePoint
The firm stores all completed inspection reports in a central OneDrive directory. Knowi connects directly to OneDrive and SharePoint as native data sources, so there is no manual export, no email attachment workflow, and no intermediate storage layer required. When a new report is uploaded to the firm’s file share, it can be ingested into Knowi without moving it to a separate system.
For the initial backlog of 800 reports, the firm processes documents in batches. Each batch of 10 to 12 reports runs in under an hour. New reports can be added on a rolling basis as they are completed, with each one processed and appended to the central recommendation database automatically.
The Results
After configuring the extraction prompts and testing against a sample set of 12 documents covering three sites across multiple years, the firm achieved 99.5% extraction accuracy across all 9 fields. Of several hundred extracted values, fewer than 10 required correction. The Knowi team validated accuracy by comparing the extracted output against the source PDFs line by line.
Performance at Scale
- 4 minutes per document for full extraction of all 9 fields
- 50 minutes for 12 documents processed in a single batch
- 800-report backlog processable in batches over a short period without manual intervention
- 200 new reports per year can be ingested on a rolling basis
Before and After
| Workflow Step | Before Knowi | With Knowi Document AI |
| Data extraction method | Three staff members reading PDFs and typing into Excel | Automated extraction via Document AI prompt |
| Fields captured per recommendation | Inconsistent, dependent on the person doing data entry | 9 structured fields extracted consistently from every report |
| Processing time per report | Hours of manual reading and entry | 4 minutes per document, unattended |
| Multi-year status tracking | Manual cross-referencing of separate Excel files | All years and statuses in one queryable dataset |
| Accuracy | Human-dependent, no systematic QA | 99.5% verified accuracy across 12 test documents |
| Document access | Individual PDFs opened manually on file share | Native OneDrive/SharePoint connector, no file movement needed |
| Querying the data | Not possible without opening each PDF | Full SQL-level querying and dashboard analytics in Knowi |
What the Recommendation Database Enables
Once the extraction is complete, the recommendation data lives in Knowi as a standard dataset. Analysts can query it the same way they would query any structured data source: filter by site, filter by status, group by engineer, sort by year, or search across all 2,500+ reports for every instance of a specific risk type. Reports that previously required a human to open and read are now queryable in seconds.
The firm can now answer questions that were previously impossible to answer without days of manual research:
- Which sites have the highest number of open recommendations older than three years?
- Which engineers have the highest remediation follow-through rate across their site portfolio?
- What is the average time from a fire safety recommendation to completion, by industry sector?
- Which clients are making the least progress on recommendations over time?
Is This Use Case Right for Your Team?
This approach works for any organization where operational data lives in consistently structured documents rather than databases. The pattern is the same across industries: the documents have a predictable format, the data fields are defined, and the volume makes manual extraction impractical. Common fits include compliance audit reports, engineering inspection logs, medical assessments, legal review summaries, and regulatory filings.
The key requirement is that the documents have a consistent structure. When the same section always appears under the same heading, and the same fields always appear in the same positions within that section, Knowi’s Document AI can be trained to extract them reliably at scale.
Frequently Asked Questions
How accurate is Knowi’s Document AI for extracting data from PDFs?
In this implementation, Knowi achieved 99.5% extraction accuracy across all nine fields, verified through manual line-by-line comparison against the source PDFs. Out of several hundred extracted values across 12 test documents, fewer than 10 required correction.
Can Knowi handle PDF reports that are 200+ pages long?
Yes. Knowi processes long documents through intelligent chunking, splitting each report into sections, processing them independently, and assembling the results. This eliminates timeout errors and ensures recommendations near the end of a report are extracted with the same accuracy as those at the beginning.
What happens if inspection reports use different section numbering across versions?
Knowi locates sections by title rather than by position. If the recommendations section appears as Section 8 in one report and Section 10 in another, Knowi still identifies it correctly. Formatting and numbering changes across report versions do not affect extraction accuracy.
Does Knowi require a custom ETL pipeline to connect to OneDrive or SharePoint?
No. Knowi connects to OneDrive and SharePoint as native data sources. There is no manual export process, no email attachment workflow, and no intermediate storage layer. When a new report is uploaded to the firm’s file repository, Knowi can automatically detect, process, and extract the required information.
How does Knowi track the same recommendation across multiple annual reports?
Each extraction captures the recommendation along with its report date and status. Because the same recommendation can appear across multiple annual visits with evolving statuses (such as New, In Progress, or Completed), Knowi stores each occurrence as a separate record. Analysts can then query the complete remediation history of any recommendation across all site visits within a single dataset.
What other document types work with Knowi’s Document AI?
Any document set with a consistent structure is a strong fit, including:
- Compliance audit reports
- Engineering inspection logs
- Medical assessments
- Legal review summaries
- Regulatory filings
The key requirement is that the same fields appear under consistent headings across documents. When that pattern exists, Knowi can be configured to extract information reliably and at scale.