Main Challenges Encountered in the Project

The project faced several challenges, including:

1. Data Collection and Preprocessing from Test Benches:

The test data comes from various sources and formats, often raw or incomplete. Cleaning, structuring, and processing it to ensure reliability required careful and time-consuming work.

2. Complex Document Structure

The documents you work with are not standardized: they include a mix of free text, embedded images, multi-column tables, and custom formatting.

Extracting clean, structured data from such unstructured formats is technically demanding.

3. Accurate Table Extraction

Tables often span multiple pages or have nested structures.

Preserving the logical relationships (headers, merged cells, footnotes) during extraction is very difficult.

4. Handling Images and Diagrams

Diagrams or screenshots might contain crucial information (e.g., component configurations, test results)

You may need to apply OCR (Optical Character Recognition) or image classification to interpret them.

5. Decoding Abbreviations and Technical Jargon

Documents are full of domain-specific abbreviations (e.g., "TPS", "CAN", "ECU") which aren’t always explained.

Building a dictionary or model to expand and understand these abbreviations is essential for correct data interpretation.

6. Requirement Extraction

Requirements are written in natural language, often buried within large paragraphs.

Extracting and tagging them automatically (e.g., with NLP) is a challenge due to variability in wording.

7. Semantic Understanding

Simply extracting data isn’t enough—you need to understand the context (e.g., is a requirement mandatory? what component does it refer to?).

This involves deeper NLP techniques and potentially named entity recognition (NER).

8. Document Variability

Each document may follow a different structure or layout depending on the supplier, test phase, or department.

This makes it hard to build a universal extraction logic.

9. Tool Limitations

PDF parsing libraries (like pdfplumber, pdfminer, PyMuPDF, etc.) often fail on complex layouts or miss relationships between elements.

You may need to combine multiple tools or use AI-based extraction (like layout-aware transformers).

10. Multi-format Output

You want to export extracted tables or data into different formats: .md, .csv, .xlsx, .docx, etc

Ensuring consistency and formatting across formats is a real implementation challenge.

11. Evaluation of Extraction Quality

You need to design metrics and benchmarks to measure the success of your extraction (e.g., table accuracy, requirement completeness).

Data Labeling and Preparation

Challenge:

The available data (PDF reports, Excel spreadsheets, industrial standards, etc.) must be properly labeled and structured to be effectively utilized by language models (LLMs).

For example, tables extracted via OCR are sometimes misinterpreted (misaligned columns, missing information), which can distort downstream results.

impact:

Poorly labeled data can lead to inaccurate results, making it difficult to trust the outputs of the LLMs.

This can result in wasted time and resources, as well as potential safety risks in the automotive domain.

For example, if a table is misaligned, it can lead to incorrect interpretations of the data, which can have serious consequences.

Poor labeling can significantly affect the accuracy of the models and the reliability of the platform’s outputs.

Solution:

Use higher-performing tools such as Docling (for better structural extraction of tables) or frameworks like HuggingFace Transformers to enhance embedding quality.

The data labeling process involves several steps:

1. Data Collection: Gather all relevant data sources, including PDF reports, Excel spreadsheets, and industrial standards.

2. Data Preprocessing: Clean and preprocess the data to ensure it is in a usable format.

3. Data Labeling: Use a combination of manual and automated techniques to label the data accurately.

4. Data Validation: Validate the labeled data to ensure its accuracy and reliability.

5. Data Storage: Store the labeled data in a structured format for easy access and retrieval.

Model Performance and Comparison

Challenge:

Comparing, evaluating, and selecting the most suitable LLM (GPT-4, Claude, Llama-2/3, etc.) is challenging, as each has strengths and weaknesses:

GPT-4: Excellent in language understanding but expensive.

Claude: Good for conversational tasks but less effective in technical domains.

Llama-2/3: Open-source and customizable but may lack the robustness of commercial models.

Docling: Reliable for table extraction but limited for broader tasks.

impact:

Selecting the wrong model could limit the platform’s capabilities or result in inaccurate outputs.

Solution:

Implement rigorous benchmarking based on multiple metrics (accuracy, perplexity, processing time).

Test with tools like HuggingFace Evaluation or OpenAI Evals to compare model performance in your specific use case.

Project Summary - Impact and Solutions

Challenges Summary: Impact and Key Solutions

Challenge	Impact	Key Solutions
Data Collection and Preprocessing	Raw or incomplete data can lead to unreliable results.	Careful cleaning, structuring, and processing of data.
Complex Document Structure	Difficulty extracting clean, structured data from unstandardized formats.	Use advanced document layout analysis techniques.
Accurate Table Extraction	High likelihood of misinterpretation of nested tables and relationships.	Implement robust algorithms to preserve logical relationships during extraction.
Handling Images and Diagrams	Loss of crucial information found in visual content.	Apply OCR and image classification techniques to extract data from visuals.
Decoding Abbreviations and Technical Jargon	Inaccurate data interpretation due to unrecognized abbreviations.	Build a dictionary/model to interpret domain-specific abbreviations.
Requirement Extraction	Natural language variability complicates automated extraction.	Use NLP techniques to enhance extraction and tagging accuracy.
Semantic Understanding	Lack of context understanding can lead to incorrect application of requirements.	Employ deeper NLP techniques and named entity recognition (NER).
Document Variability	Difficulty in establishing a universal extraction logic due to diverse layouts.	Develop adaptable extraction frameworks to accommodate various formats.
Tool Limitations	Existing libraries may fail on complex document layouts.	Combine multiple tools and explore AI-based extraction methods.
Multi-format Output	Inconsistent formatting across output formats may cause data integrity issues.	Ensure standardized formatting protocols for various output types.
Evaluation of Extraction Quality	Inability to measure success accurately could lead to undetected issues.	Design robust metrics and benchmarks for effective evaluation.