.. raw:: html
'
📄 PDF Data Extraction and Structuring Tool (Custom Scripts) =========================================================== 🧩 Overview ---------------- .. raw:: htmlThis Python tool automates content extraction from complex PDFs for:
- Language Models (LLMs) - summarization, QA
- RAG Pipelines for document search, semantic retrieval, and knowledge base injection.
🌍 Primary Use Case ------------------------ .. raw:: htmlObjective: Extract structured data for NLP/AI systems:
- ChatGPT/LLaMA/Mistral
- Vector stores (FAISS, ChromaDB)
- RAG pipelines (LangChain, LlamaIndex)
🔁 Workflow Summary ------------------------ .. raw:: html
[Large PDF]
↓
[Split into Pages]
↓
[Extract Text, Tables, Images]
↓
[Save as .txt + .md]
↓
[LLMs/Vector Stores/RAG]
📦 Key Features
---------------------
.. raw:: html
| Feature | Description |
|---|---|
| PDF Splitting | Breaks into single-page PDFs |
| Text Extraction | Extracts clean text using pdfplumber |
| Table Extraction | Pulls and summarizes tables |
| Image Extraction | Crops images + suggests captions using NLP |
| Markdown + Text + Excel Generation | Writes structured `.md` '.txt', '.csv', '.xlsx', files |
PyPDF2: For handling the splitting of PDFs and creating new PDF files.
pdfplumber: For extracting text, tables, and images with precision.
SpaCy: For natural language processing tasks, including handling text descriptions.
Pillow (PIL): For handling image cropping and file storage.
Matplotlib: For visualizing images and generating captions.
.. code-block:: python import os import shutil import pdfplumber from PyPDF2 import PdfReader, PdfWriter import spacy from PIL import Image import matplotlib.pyplot as plt import pandas as pd # CREATE OUTPUT DIRECTORIES # Create directories for storing extracted data OUTPUT_DIR = "/teamspace/studios/this_studio/OUTPUT_DIR" os.makedirs(OUTPUT_DIR, exist_ok=True) IMAGES_DIR = "extracted_images" os.makedirs(IMAGES_DIR, exist_ok=True) TEXT_DIR="/teamspace/studios/this_studio/extracted_text" os.makedirs(TEXT_DIR, exist_ok=True) README_DIR="/teamspace/studios/this_studio/extracted_readme" os.makedirs(README_DIR, exist_ok=True) Split_pdf(input_pdf_path) ________________________________ .. code-block:: python def split_pdf(input_pdf_path): pdf_reader = PdfReader(input_pdf_path) base = os.path.splitext(os.path.basename(input_pdf_path))[0] output_files = [] for i, page in enumerate(pdf_reader.pages): writer = PdfWriter() writer.add_page(page) file_path = os.path.join(OUTPUT_DIR, f"{base}_page_{i+1}.pdf") with open(file_path, "wb") as f_out: writer.write(f_out) output_files.append(file_path) return output_files 🔧 Purpose: Splits a multi-page PDF into individual one-page files. .. raw:: htmlInput: Path to input PDF
Output: List of file paths (1 PDF per page)
Use Case: Enables page-by-page extraction, faster and more manageable for pipelines
Summarize_table(table) ____________________ .. code-block:: python def summarize_table(table): """Résumé simple : nbre de lignes, colonnes, premiers éléments""" if not table or not any(table): return "Tableau vide" num_rows = len(table) num_cols = len(table[0]) head = table[0] description = f"Tableau de {num_rows} lignes et {num_cols} colonnes. En-tête : {head}." return description 🔧 Purpose: Generates a summary for each table. .. raw:: htmlInput: A list of rows from pdfplumber.extract_tables()
Output: A short human-readable description of the table (rows, columns, headers)
Use Case: Allows LLMs to understand the structure of data tables (useful for summarization or QA)
Extract_image_caption(page, bbox, nlp) ________________________________ .. code-block:: python def extract_image_caption(page, bbox, nlp): # Extraire un rectangle sous l'image pour trouver un texte d'accompagnement # (Heuristique : boîte juste sous l'image, 30 points de hauteur) cap_top = bbox[3] cap_bot = cap_top + 30 try: crop = page.within_bbox((bbox[0], cap_top, bbox[2], cap_bot)) text = crop.extract_text() if crop else "" if not text: return "Image extraite (pas de description détectée)" doc = nlp(text) relevant = " ".join([sent.text for sent in doc.sents]) return relevant or "Image extraite" except Exception: return "Image extraite" 🔧 Purpose: Extracts a caption or description from below an image using NLP. .. raw:: htmlInput: Image bounding box, page object, spaCy NLP model
Output: A caption string (or fallback)
Use Case: Helps LLMs or RAG pipelines understand visual content (e.g., graphs, charts)
Extract_content(pdf_path, page_num, nlp) ________________________________ .. code-block:: python def extract_content(pdf_path, page_num, nlp): result = { "text": "", "tables": [], "tables_desc": [], "images": [], "images_desc": [] } .................................................. .................................................. # The function code is in `structuration_pdf.ipynb`. Open the notebook to view and experiment. return result 🔧 Purpose: Core function that extracts text, tables, images, and metadata from a single-page PDF. .. raw:: htmlInput: Path to one-page PDF
Output: Dictionary with:
- **Text:** Extracted page text - **Tables:** Raw table data - **Tables_desc**: Descriptions of tables - **Images: Saved** image paths - **Images_desc**: Captions .. raw:: htmlUse Case:
Converts unstructured PDFs into structured data Ideal for embedding into vector DBs or prompting LLMs Save_txt_and_md(base_name, page_number, content) _________________________________ .. code-block:: python def save_txt_and_md(base_name, page_number, content): """Génère les versions .txt et .md dans des dossiers séparés""" txt_filename = os.path.join(TEXT_DIR, f"{base_name}_page_{page_number}.txt") md_filename = os.path.join(README_DIR, f"{base_name}_page_{page_number}.md") .................................................. .................................................. # The function code is in `structuration_pdf.ipynb`. Open the notebook to view and experiment. 🔧 Purpose: Saves extracted content into: .txt → for simple indexing .md → for structured, human-readable rendering .. raw:: htmlInput: Base filename, page number, extracted content
Output: Files saved in extracted_text/ and extracted_readme/
.. raw:: htmlUse Case:
* **.txt:** For vectorization / NLP pipelines * **.md:** For documentation, GitHub, or prompt templates Process_pdf(pdf_path) _____________________ .. code-block:: python def process_pdf(pdf_path): # SpaCy pour FR (ou changer en EN si besoin) nlp = spacy.load("fr_core_news_sm") all_pages = split_pdf(pdf_path) basename = os.path.splitext(os.path.basename(pdf_path))[0] for idx, page_file in enumerate(all_pages, 1): print(f"Traitement page {idx}...") content = extract_content(page_file, idx, nlp) save_txt_and_md(basename, idx, content) process_pdf("/teamspace/studios/this_studio/new.pdf") 🔧 Purpose: Full pipeline controller. .. raw:: htmlSteps:
Loads spaCy NLP model Splits PDF into pages Runs extract_content() on each page Writes .txt and .md outputs .. raw:: htmlUse Case: One-call function to process any PDF for use in LLM pipelines or RAG systems.
* **.txt:** For vectorization / NLP pipelines * **.md:** For documentation, GitHub, or prompt templates Clean_generated_files(base_name, total_pages) _________________________________ .. code-block:: python import os import glob import shutil # Dossier contenant les images extraites IMAGES_DIR = "extracted_images" OUTPUT_DIR = "/teamspace/studios/this_studio/OUTPUT_DIR" TEXT_DIR="/teamspace/studios/this_studio/extracted_text" README_DIR="/teamspace/studios/this_studio/extracted_readme" def clean_generated_files(base_name, total_pages): """Supprime les fichiers .txt, .md et les images générées""" # Supprimer les fichiers texte et markdown .................................................. .................................................. # The function code is in `structuration_pdf.ipynb`. Open the notebook to view and experiment. clean_generated_files(base_name="new", total_pages=450) 🔧 Purpose: Deletes all generated files (images, text, markdown, PDFs). .. raw:: htmlUse Case: Keeps workspace clean between runs or during batch processing.
🧠 How This Supports LLMs and RAG _________________________________ ✅ Perfect for LLMs Clean, structured input = better prompts Table summaries = easier understanding Captions = image context ✅ Ideal for RAG Pipelines Processed .txt / .md files → vectorized → stored in FAISS/Chroma Embeddings allow semantic search (e.g., “What are the sales trends in Q3?”) Extracted content can be retrieved at query time and passed to LLMs 🔬 Example Pipeline: LLM + RAG ________________________________ .. raw:: html
[PDF File]
↓
[This Script]
↓
[.md / .txt Files]
↓
[Embedding & Vector Store]
↓
[Query → RAG → Prompt to LLM]
↓
[Answer Generated]
📚 Technologies Used
____________________
.. raw:: html
| Library | Purpose |
|---|---|
| PyPDF2 | Splitting PDFs |
| pdfplumber | Extracting text, images, tables |
| spaCy | NLP for caption generation |
| Pillow | Image cropping and saving |
| matplotlib | (Optional) image visualization |
link:
Colab Notebook - PDF Structuration
GitHub Repository - PDF Extraction Script