.. raw:: html

📄 PDF Data Extraction and Structuring Tool (Custom Scripts) =========================================================== 🧩 Overview ---------------- .. raw:: html

This Python tool automates content extraction from complex PDFs for:

- Language Models (LLMs) - summarization, QA

- RAG Pipelines for document search, semantic retrieval, and knowledge base injection.

🌍 Primary Use Case ------------------------ .. raw:: html

Objective: Extract structured data for NLP/AI systems:

- ChatGPT/LLaMA/Mistral

- Vector stores (FAISS, ChromaDB)

- RAG pipelines (LangChain, LlamaIndex)

🔁 Workflow Summary ------------------------ .. raw:: html

    [Large PDF] 
       ↓ 
    [Split into Pages] 
       ↓ 
    [Extract Text, Tables, Images] 
       ↓ 
    [Save as .txt + .md] 
       ↓ 
    [LLMs/Vector Stores/RAG]

📦 Key Features --------------------- .. raw:: html

Feature	Description
PDF Splitting	Breaks into single-page PDFs
Text Extraction	Extracts clean text using pdfplumber
Table Extraction	Pulls and summarizes tables
Image Extraction	Crops images + suggests captions using NLP
Markdown + Text + Excel Generation	Writes structured `.md` '.txt', '.csv', '.xlsx', files

⚙️ Prerequisites --------------------- ```bash pip install PyPDF2 pdfplumber spacy matplotlib pillow ``` ```bash python -m spacy download fr_core_news_sm ``` 🔍 Function-by-Function Explanation ------------------------------------------- Key Libraries Used ____________________ .. raw:: html

PyPDF2: For handling the splitting of PDFs and creating new PDF files.

pdfplumber: For extracting text, tables, and images with precision.

SpaCy: For natural language processing tasks, including handling text descriptions.

Pillow (PIL): For handling image cropping and file storage.

Matplotlib: For visualizing images and generating captions.

.. code-block:: python import os import shutil import pdfplumber from PyPDF2 import PdfReader, PdfWriter import spacy from PIL import Image import matplotlib.pyplot as plt import pandas as pd # CREATE OUTPUT DIRECTORIES # Create directories for storing extracted data OUTPUT_DIR = "/teamspace/studios/this_studio/OUTPUT_DIR" os.makedirs(OUTPUT_DIR, exist_ok=True) IMAGES_DIR = "extracted_images" os.makedirs(IMAGES_DIR, exist_ok=True) TEXT_DIR="/teamspace/studios/this_studio/extracted_text" os.makedirs(TEXT_DIR, exist_ok=True) README_DIR="/teamspace/studios/this_studio/extracted_readme" os.makedirs(README_DIR, exist_ok=True) Split_pdf(input_pdf_path) ________________________________ .. code-block:: python def split_pdf(input_pdf_path): pdf_reader = PdfReader(input_pdf_path) base = os.path.splitext(os.path.basename(input_pdf_path))[0] output_files = [] for i, page in enumerate(pdf_reader.pages): writer = PdfWriter() writer.add_page(page) file_path = os.path.join(OUTPUT_DIR, f"{base}_page_{i+1}.pdf") with open(file_path, "wb") as f_out: writer.write(f_out) output_files.append(file_path) return output_files 🔧 Purpose: Splits a multi-page PDF into individual one-page files. .. raw:: html

Input: Path to input PDF

Output: List of file paths (1 PDF per page)

Use Case: Enables page-by-page extraction, faster and more manageable for pipelines

Summarize_table(table) ____________________ .. code-block:: python def summarize_table(table): """Résumé simple : nbre de lignes, colonnes, premiers éléments""" if not table or not any(table): return "Tableau vide" num_rows = len(table) num_cols = len(table[0]) head = table[0] description = f"Tableau de {num_rows} lignes et {num_cols} colonnes. En-tête : {head}." return description 🔧 Purpose: Generates a summary for each table. .. raw:: html

Input: A list of rows from pdfplumber.extract_tables()

Output: A short human-readable description of the table (rows, columns, headers)

Use Case: Allows LLMs to understand the structure of data tables (useful for summarization or QA)

Extract_image_caption(page, bbox, nlp) ________________________________ .. code-block:: python def extract_image_caption(page, bbox, nlp): # Extraire un rectangle sous l'image pour trouver un texte d'accompagnement # (Heuristique : boîte juste sous l'image, 30 points de hauteur) cap_top = bbox[3] cap_bot = cap_top + 30 try: crop = page.within_bbox((bbox[0], cap_top, bbox[2], cap_bot)) text = crop.extract_text() if crop else "" if not text: return "Image extraite (pas de description détectée)" doc = nlp(text) relevant = " ".join([sent.text for sent in doc.sents]) return relevant or "Image extraite" except Exception: return "Image extraite" 🔧 Purpose: Extracts a caption or description from below an image using NLP. .. raw:: html

Input: Image bounding box, page object, spaCy NLP model

Output: A caption string (or fallback)

Use Case: Helps LLMs or RAG pipelines understand visual content (e.g., graphs, charts)

Extract_content(pdf_path, page_num, nlp) ________________________________ .. code-block:: python def extract_content(pdf_path, page_num, nlp): result = { "text": "", "tables": [], "tables_desc": [], "images": [], "images_desc": [] } .................................................. .................................................. # The function code is in `structuration_pdf.ipynb`. Open the notebook to view and experiment. return result 🔧 Purpose: Core function that extracts text, tables, images, and metadata from a single-page PDF. .. raw:: html

Input: Path to one-page PDF

Output: Dictionary with:

- **Text:** Extracted page text - **Tables:** Raw table data - **Tables_desc**: Descriptions of tables - **Images: Saved** image paths - **Images_desc**: Captions .. raw:: html

Use Case:

Converts unstructured PDFs into structured data Ideal for embedding into vector DBs or prompting LLMs Save_txt_and_md(base_name, page_number, content) _________________________________ .. code-block:: python def save_txt_and_md(base_name, page_number, content): """Génère les versions .txt et .md dans des dossiers séparés""" txt_filename = os.path.join(TEXT_DIR, f"{base_name}_page_{page_number}.txt") md_filename = os.path.join(README_DIR, f"{base_name}_page_{page_number}.md") .................................................. .................................................. # The function code is in `structuration_pdf.ipynb`. Open the notebook to view and experiment. 🔧 Purpose: Saves extracted content into: .txt → for simple indexing .md → for structured, human-readable rendering .. raw:: html

Input: Base filename, page number, extracted content

Output: Files saved in extracted_text/ and extracted_readme/

.. raw:: html

Use Case:

* **.txt:** For vectorization / NLP pipelines * **.md:** For documentation, GitHub, or prompt templates Process_pdf(pdf_path) _____________________ .. code-block:: python def process_pdf(pdf_path): # SpaCy pour FR (ou changer en EN si besoin) nlp = spacy.load("fr_core_news_sm") all_pages = split_pdf(pdf_path) basename = os.path.splitext(os.path.basename(pdf_path))[0] for idx, page_file in enumerate(all_pages, 1): print(f"Traitement page {idx}...") content = extract_content(page_file, idx, nlp) save_txt_and_md(basename, idx, content) process_pdf("/teamspace/studios/this_studio/new.pdf") 🔧 Purpose: Full pipeline controller. .. raw:: html

Steps:

Loads spaCy NLP model Splits PDF into pages Runs extract_content() on each page Writes .txt and .md outputs .. raw:: html

Use Case: One-call function to process any PDF for use in LLM pipelines or RAG systems.

* **.txt:** For vectorization / NLP pipelines * **.md:** For documentation, GitHub, or prompt templates Clean_generated_files(base_name, total_pages) _________________________________ .. code-block:: python import os import glob import shutil # Dossier contenant les images extraites IMAGES_DIR = "extracted_images" OUTPUT_DIR = "/teamspace/studios/this_studio/OUTPUT_DIR" TEXT_DIR="/teamspace/studios/this_studio/extracted_text" README_DIR="/teamspace/studios/this_studio/extracted_readme" def clean_generated_files(base_name, total_pages): """Supprime les fichiers .txt, .md et les images générées""" # Supprimer les fichiers texte et markdown .................................................. .................................................. # The function code is in `structuration_pdf.ipynb`. Open the notebook to view and experiment. clean_generated_files(base_name="new", total_pages=450) 🔧 Purpose: Deletes all generated files (images, text, markdown, PDFs). .. raw:: html

Use Case: Keeps workspace clean between runs or during batch processing.

🧠 How This Supports LLMs and RAG _________________________________ ✅ Perfect for LLMs Clean, structured input = better prompts Table summaries = easier understanding Captions = image context ✅ Ideal for RAG Pipelines Processed .txt / .md files → vectorized → stored in FAISS/Chroma Embeddings allow semantic search (e.g., “What are the sales trends in Q3?”) Extracted content can be retrieved at query time and passed to LLMs 🔬 Example Pipeline: LLM + RAG ________________________________ .. raw:: html

    [PDF File]
       ↓ 
    [This Script]
       ↓ 
    [.md / .txt Files]
       ↓ 
    [Embedding & Vector Store]
       ↓ 
    [Query → RAG → Prompt to LLM]
       ↓ 
    [Answer Generated]

📚 Technologies Used ____________________ .. raw:: html

Library	Purpose
PyPDF2	Splitting PDFs
pdfplumber	Extracting text, images, tables
spaCy	NLP for caption generation
Pillow	Image cropping and saving
matplotlib	(Optional) image visualization

Practice ---------------------- .. raw:: html

link:

        Colab Notebook - PDF Structuration
        
        GitHub Repository - PDF Extraction Script