Stop being crazy about text extraction! One-stop text extraction artifact Kreuzberg helps you solve the problem of text extraction in multiple format files such as PDF, pictures, documents, etc.

Hello everyone, I am Brother Six. I believe many friends must have had the experience of extracting text from various documents. The process is so troublesome! Today I will share with you a super practical modern Python library - Kreuzberg, which will help you easily solve the problem of text extraction.

1. What problems did Kreuzberg solve?

Many text extraction tools now either rely on external API calls or have particularly complex configurations, which are inconvenient to use. Kreuzberg is designed to solve the text extraction requirements in RAG (retrieval enhancement generation) applications, but it is more than this useful, and any text extraction scenario can be perfectly adapted. It focuses on local processing, has few dependencies, is simple and efficient.

2. The powerful functions of Kreuzberg

Universal text extraction: Whether it is searchable PDF, scanned PDF, pictures, or office documents, Kreuzberg can accurately extract text from it. For example, if you want to extract key terms from the contract PDF or grab text information from the product promotional map, it can be easily handled.
Intelligent processing: Scanned documents can be automatically recognized by OCR, and text files can detect and encode. For example, when processing text materials from different sources, it can automatically identify and encode, so that garbled problems will no longer occur.
Modern Python design: Adopt an asynchronous priority API, built based on anyio, supports comprehensive type prompts, which is convenient for development in the IDE, and has detailed error handling, including context information, making the development process more worry-free.

3. The distinctive features of Kreuzberg

Simple and convenient: Provides a simple API that can be run without complicated configurations, and even a novice novice can easily get started.
Local processing: No need to call external APIs, do not rely on cloud services, data security is guaranteed, and it can work normally without a network.
Resource efficient: Lightweight processing, does not rely on GPU, ordinary computers can run smoothly, saving hardware costs.
Full format support: The supported formats are very rich, covering various formats such as documents, images, text, etc., which can basically meet all daily needs.

4. The method of using is super simple

Install
- Install Python package:pip install kreuzberg
- Installation system dependencies: Pandoc (for document format conversion) and Tesseract OCR (for image and PDF optical character recognition) are required, and the installation is done according to the respective installation guide.
Basic use
Kreuzberg provides a simple asynchronous text extraction API, which has two main functions:
- extract_file(): Extract text from a file, can accept string paths or .

from pathlib import Path
 from kreuzberg import extract_file, extract_bytes

 # Basic file extraction
 async def extract_document():
     # Extract from PDF file
     pdf_result = await extract_file("")
     print(f"PDF text: {pdf_result.content}")

     # Extract from images
     img_result = await extract_file("")
     print(f"Image text: {img_result.content}")

     # Extract from Word Documents
     docx_result = await extract_file(Path(""))
     print(f"Word text: {docx_result.content}")

- `extract_bytes()`: Extract text from bytes and accepts byte strings.  For example, processing uploaded files:

from kreuzberg import extract_bytes

 async def process_upload(file_content: bytes, mime_type: str):
     """Processing uploaded file contents of known MIME types."""
     result = await extract_bytes(file_content, mime_type=mime_type)
     Return

 # Example usage of different file types
 async def handle_uploads():
     # Process PDF upload
     pdf_result = await extract_bytes(pdf_bytes, mime_type="application/pdf")

     # Process image upload
     img_result = await extract_bytes(image_bytes, mime_type="image/jpeg")

     # Handle Word document uploads
     docx_result = await extract_bytes(docx_bytes,
         mime_type="application/")

Advanced features
- PDF processing options: PDFs containing embedded images or scanned content can be forced to be processed OCR.

from kreuzberg import extract_file

 async def process_pdf():
     # Force OCR on PDFs containing images or scanned content
     result = await extract_file("", force_ocr=True)

     # Process scanned PDF (automatically use OCR)
     scanned = await extract_file("")

- **Extract result object**: All objects returned by the extract function contain the extracted text (`content`) and the output format (`mime_type`).

from kreuzberg import ExtractionResult

 async def process_document(path: str) -> tuple[str, str]:
     # Access as a named tuple
     result: ExtractionResult = await extract_file(path)
     print(f"Content: {}")
     print(f"format: {result.mime_type}")

     # or unpack as a tuple
     content, mime_type = await extract_file(path)
     return content, mime_type

- **Error Handling**: Kreuzberg provides comprehensive error handling through multiple exception types, all exceptions inherited from `KreuzbergError`, each containing context information that facilitates debugging.

from kreuzberg import extract_file
 from import (
     ValidationError,
     ParsingError,
     OCRError,
     MissingDependencyError
 )

 async def safe_extract(path: str) -> str:
     try:
         result = await extract_file(path)
         Return

     except ValidationError as e:
         # Enter verification question
         # - MIME types that are not supported or undetectable
         # - File missing
         # - Invalid input parameters
         print(f"Verification failed: {e}")

     except OCRError as e:
         # OCR-specific issues
         # - Tesseract processing failed
         # - Image conversion problem
         print(f"OCR failed: {e}")

     except MissingDependencyError as e:
         # System dependency issues
         # - Missing Tesseract OCR
         # - Pandoc is missing
         # - Version incompatible
         print(f"dependency missing: {e}")

     except ParsingError as e:
         # General handling errors
         # - PDF parsing failed
         # - Format conversion problem
         # - Coding issues
         print(f"Processing failed: {e}")

     return ""

 # Example error context
 try:
     result = await extract_file("")
 except ValidationError as e:
     # The error will contain the context:
     # ValidationError: Unsupported mime types
     # Context: {
     # "file_path": "",
     # "supported_mimetypes": ["application/pdf",...]
     # }
     print(e)

 try:
     result = await extract_file("")
 except OCRError as e:
     # The error will contain the context:
     # OCRError: OCR failed to return non-0 code
     # Context: {
     # "file_path": "",
     # "tesseract_version": "5.3.0"
     # }
     print(e)

5. Super rich formats supported

Document format: PDF (document that can be searched and scanned), Microsoft Word (.docx, .doc), PowerPoint presentations (.pptx), OpenDocument text (.odt), Rich text format (.rtf), EPUB (.epub), DocBook XML (.dbk , .xml ), FictionBook (.fb2), LaTeX (.tex,.latex), Typst (.typ).
Tags and text formats: HTML (.html , .htm ), plain text (.txt ) and Markdown (.md , .markdown ), reStructuredText (.rst), Org-mode (.org), DokuWiki (.txt), Pod (.pod ), man pages (.1, .2, etc.).
Data and research results formats: Excel spreadsheet (.xlsx), CSV (.csv) and TSV (.tsv) files, Jupyter Notebooks (.ipynb), BibTeX (.bib) and BibLaTeX (.bib), CSL-JSON (.json), EndNote XML (.xml), RIS (.ris), JATS XML (.xml).
Image format: JPEG (.jpg,.jpeg,.pjpeg), PNG (.png), TIFF (.tiff,.tif), BMP (.bmp), GIF (.gif), WebP (.webp), JPEG 2000 (. jp2 , .jpx , .jpm , .mj2 ), portable Anymap (.pnm ), portable bitmap (.pbm ), portable grayscale (.pgm ), and portable pixel map (.ppm ).

6. The architecture design is clever

Designed as advanced asynchronous abstraction on top of existing open source tools, Kreuzberg integrates multiple tools to achieve powerful features:

PDF processing: Use pdfium2 to process searchable PDF files, and Tesseract OCR to process scanned content.
Document conversion: Use Pandoc to support a variety of document and markup formats, python-pptx handles PowerPoint files, html-to-markdown handles HTML content, and tools that specialize in processing Excel spreadsheets.
Text processing: Implement intelligent coding detection, as well as Markdown and plain text processing.

If you are interested in Kreuzberg, want to gain insight or participate in development, you can visit the project link:/Goldziher/kreuzberg 。