Parameter Extraction from Engineering Documents

Our latest AI-enhanced interfaces work in the Virtual Sea Trials project is a PDF information extraction tool for parameter extraction published as an open-source Python package (Novia-RDI-Seafaring/MERI: Modality-Aware Extraction and Retrieval of Information (github.com)). The tool leverages AI-based models to extract pre-defined parameters of interest given a PDF document, e.g., a datasheet.

Use Case

PDF documents are widely used for information conservation and sharing, especially for manufacturing components, where specifications are commonly presented in datasheets (PDFs). These documents are highly unstructured, and depending on the domain and manufacturer, they come in various layouts and styles, making it difficult to parse them using traditional rule-based methods. Automating parameter extracting from these types of documents using newly available AI models thus constitutes a challenging but impactful task. Our work has focused on parameter extraction for initializing simulation models from datasheets of used components. Currently, engineers go through the data sheets by hand to find the needed parameters; our tool could, therefore, save valuable time by helping engineers extract the required parameters faster.

Method and Technologies

We propose MERI – Modality-Aware Extraction and Retrieval of Information, a modular and flexible method for parameter extraction from complex documents. The approach consists of three steps: (1) Detection of layout elements, such as tables and figures. (2) Conversion of detectable elements into a machine-readable intermediate text format. (3) Parameter extraction from the intermediate text format using LLMs.

The proposed method requires two inputs: (1) the PDF document and (2) a JSON schema. The layout detection modules search for elements in the PDF, like paragraphs, tables, and figures, and call upon dedicated information extraction tools that convert the elements into an intermediate Markdown format. Finally, an LLM utilizes the intermediate format of the whole datasheet to try to generate a JSON document containing the required parameters as specified in the JSON schema.

Demo

Authors
Christian Möller, Johan Westö