'

📄 PDF Data Extraction and Structuring Tool (Custom Scripts)

🧩 Overview

This Python tool automates content extraction from complex PDFs for:

- Language Models (LLMs) - summarization, QA

- RAG Pipelines for document search, semantic retrieval, and knowledge base injection.

🌍 Primary Use Case

Objective: Extract structured data for NLP/AI systems:

- ChatGPT/LLaMA/Mistral

- Vector stores (FAISS, ChromaDB)

- RAG pipelines (LangChain, LlamaIndex)

🔁 Workflow Summary

[Large PDF]
   ↓
[Split into Pages]
   ↓
[Extract Text, Tables, Images]
   ↓
[Save as .txt + .md]
   ↓
[LLMs/Vector Stores/RAG]

📦 Key Features

Feature Description
PDF Splitting Breaks into single-page PDFs
Text Extraction Extracts clean text using pdfplumber
Table Extraction Pulls and summarizes tables
Image Extraction Crops images + suggests captions using NLP
Markdown + Text + Excel Generation Writes structured `.md` '.txt', '.csv', '.xlsx', files

⚙️ Prerequisites

`bash pip install PyPDF2 pdfplumber spacy matplotlib pillow `

`bash python -m spacy download fr_core_news_sm `

🔍 Function-by-Function Explanation

Key Libraries Used

PyPDF2: For handling the splitting of PDFs and creating new PDF files.

pdfplumber: For extracting text, tables, and images with precision.

SpaCy: For natural language processing tasks, including handling text descriptions.

Pillow (PIL): For handling image cropping and file storage.

Matplotlib: For visualizing images and generating captions.

import os
import shutil
import pdfplumber
from PyPDF2 import PdfReader, PdfWriter
import spacy
from PIL import Image
import matplotlib.pyplot as plt
import pandas as pd



# CREATE OUTPUT DIRECTORIES
# Create directories for storing extracted data
OUTPUT_DIR = "/teamspace/studios/this_studio/OUTPUT_DIR"
os.makedirs(OUTPUT_DIR, exist_ok=True)
IMAGES_DIR = "extracted_images"
os.makedirs(IMAGES_DIR, exist_ok=True)
TEXT_DIR="/teamspace/studios/this_studio/extracted_text"
os.makedirs(TEXT_DIR, exist_ok=True)
README_DIR="/teamspace/studios/this_studio/extracted_readme"
os.makedirs(README_DIR, exist_ok=True)

Split_pdf(input_pdf_path)

def split_pdf(input_pdf_path):
    pdf_reader = PdfReader(input_pdf_path)
    base = os.path.splitext(os.path.basename(input_pdf_path))[0]
    output_files = []
    for i, page in enumerate(pdf_reader.pages):
        writer = PdfWriter()
        writer.add_page(page)
        file_path = os.path.join(OUTPUT_DIR, f"{base}_page_{i+1}.pdf")
        with open(file_path, "wb") as f_out:
            writer.write(f_out)
        output_files.append(file_path)
    return output_files

🔧 Purpose: Splits a multi-page PDF into individual one-page files.

Input: Path to input PDF

Output: List of file paths (1 PDF per page)

Use Case: Enables page-by-page extraction, faster and more manageable for pipelines

Summarize_table(table)

def summarize_table(table):
    """Résumé simple : nbre de lignes, colonnes, premiers éléments"""
    if not table or not any(table):
        return "Tableau vide"
    num_rows = len(table)
    num_cols = len(table[0])
    head = table[0]
    description = f"Tableau de {num_rows} lignes et {num_cols} colonnes. En-tête : {head}."
    return description

🔧 Purpose: Generates a summary for each table.

Input: A list of rows from pdfplumber.extract_tables()

Output: A short human-readable description of the table (rows, columns, headers)

Use Case: Allows LLMs to understand the structure of data tables (useful for summarization or QA)

Extract_image_caption(page, bbox, nlp)

def extract_image_caption(page, bbox, nlp):
    # Extraire un rectangle sous l'image pour trouver un texte d'accompagnement
    # (Heuristique : boîte juste sous l'image, 30 points de hauteur)
    cap_top = bbox[3]
    cap_bot = cap_top + 30
    try:
        crop = page.within_bbox((bbox[0], cap_top, bbox[2], cap_bot))
        text = crop.extract_text() if crop else ""
        if not text:
            return "Image extraite (pas de description détectée)"
        doc = nlp(text)
        relevant = " ".join([sent.text for sent in doc.sents])
        return relevant or "Image extraite"
    except Exception:
        return "Image extraite"

🔧 Purpose: Extracts a caption or description from below an image using NLP.

Input: Image bounding box, page object, spaCy NLP model

Output: A caption string (or fallback)

Use Case: Helps LLMs or RAG pipelines understand visual content (e.g., graphs, charts)

Extract_content(pdf_path, page_num, nlp)

def extract_content(pdf_path, page_num, nlp):
        result = {
            "text": "",
            "tables": [],
            "tables_desc": [],
            "images": [],
            "images_desc": []
        }
        ..................................................
        ..................................................
        # The function code is in `structuration_pdf.ipynb`. Open the notebook to view and experiment.
        return result

🔧 Purpose: Core function that extracts text, tables, images, and metadata from a single-page PDF.

Input: Path to one-page PDF

Output: Dictionary with:

  • Text: Extracted page text

  • Tables: Raw table data

  • Tables_desc: Descriptions of tables

  • Images: Saved image paths

  • Images_desc: Captions

Use Case:

Converts unstructured PDFs into structured data

Ideal for embedding into vector DBs or prompting LLMs

Save_txt_and_md(base_name, page_number, content)

def save_txt_and_md(base_name, page_number, content):
    """Génère les versions .txt et .md dans des dossiers séparés"""

    txt_filename = os.path.join(TEXT_DIR, f"{base_name}_page_{page_number}.txt")
    md_filename  = os.path.join(README_DIR, f"{base_name}_page_{page_number}.md")
            ..................................................
        ..................................................
        # The function code is in `structuration_pdf.ipynb`. Open the notebook to view and experiment.

🔧 Purpose: Saves extracted content into:

.txt → for simple indexing

.md → for structured, human-readable rendering

Input: Base filename, page number, extracted content

Output: Files saved in extracted_text/ and extracted_readme/

Use Case:

  • .txt: For vectorization / NLP pipelines

  • .md: For documentation, GitHub, or prompt templates

Process_pdf(pdf_path)

def process_pdf(pdf_path):
    # SpaCy pour FR (ou changer en EN si besoin)
    nlp = spacy.load("fr_core_news_sm")
    all_pages = split_pdf(pdf_path)
    basename = os.path.splitext(os.path.basename(pdf_path))[0]
    for idx, page_file in enumerate(all_pages, 1):
        print(f"Traitement page {idx}...")
        content = extract_content(page_file, idx, nlp)
        save_txt_and_md(basename, idx, content)

process_pdf("/teamspace/studios/this_studio/new.pdf")

🔧 Purpose: Full pipeline controller.

Steps:

Loads spaCy NLP model

Splits PDF into pages

Runs extract_content() on each page

Writes .txt and .md outputs

Use Case: One-call function to process any PDF for use in LLM pipelines or RAG systems.

  • .txt: For vectorization / NLP pipelines

  • .md: For documentation, GitHub, or prompt templates

Clean_generated_files(base_name, total_pages)

import os
import glob
import shutil

# Dossier contenant les images extraites
IMAGES_DIR = "extracted_images"
OUTPUT_DIR = "/teamspace/studios/this_studio/OUTPUT_DIR"
TEXT_DIR="/teamspace/studios/this_studio/extracted_text"
README_DIR="/teamspace/studios/this_studio/extracted_readme"



def clean_generated_files(base_name, total_pages):
    """Supprime les fichiers .txt, .md et les images générées"""
    # Supprimer les fichiers texte et markdown

    ..................................................
    ..................................................
    # The function code is in `structuration_pdf.ipynb`. Open the notebook to view and experiment.


clean_generated_files(base_name="new", total_pages=450)

🔧 Purpose: Deletes all generated files (images, text, markdown, PDFs).

Use Case: Keeps workspace clean between runs or during batch processing.

🧠 How This Supports LLMs and RAG

✅ Perfect for LLMs

Clean, structured input = better prompts

Table summaries = easier understanding

Captions = image context

✅ Ideal for RAG Pipelines

Processed .txt / .md files → vectorized → stored in FAISS/Chroma

Embeddings allow semantic search (e.g., “What are the sales trends in Q3?”)

Extracted content can be retrieved at query time and passed to LLMs

🔬 Example Pipeline: LLM + RAG

[PDF File]
   ↓
[This Script]
   ↓
[.md / .txt Files]
   ↓
[Embedding & Vector Store]
   ↓
[Query → RAG → Prompt to LLM]
   ↓
[Answer Generated]

📚 Technologies Used

Library Purpose
PyPDF2 Splitting PDFs
pdfplumber Extracting text, images, tables
spaCy NLP for caption generation
Pillow Image cropping and saving
matplotlib (Optional) image visualization

Practice

link:

    Colab Notebook - PDF Structuration

    GitHub Repository - PDF Extraction Script