If your PDF is a scanned image of Khmer text, you need OCR. The verified combination is pdf2image + pytesseract with the .
from reportlab.pdfgen import canvas from reportlab.pdfbase import pdfmetrics from reportlab.pdfbase.ttfonts import TTFont pdfmetrics.registerFont(TTFont('KhmerFont', 'KhmerOSBattambang.ttf')) python khmer pdf verified
for i, page in enumerate(pages): # Use 'khm' for Khmer language verification text = pytesseract.image_to_string(page, lang='khm') print(f"Page i+1 verified text:\ntext") Before running any Python script, you can verify if a PDF contains real Khmer text (not just images) using this simple script: If your PDF is a scanned image of Khmer text, you need OCR
sudo apt-get install tesseract-ocr-khm pip install pdf2image pytesseract Verified Fix: Re-generate the PDF using weasyprint (HTML
import fitz # pymupdf doc = fitz.open("broken_khmer.pdf") for page in doc: text = page.get_text() print(text) # Often better than pdfminer for complex scripts Cause: The PDF uses a custom encoding map. Verified Fix: Re-generate the PDF using weasyprint (HTML to PDF), which uses HarfBuzz for shaping.
To extract Khmer text from an existing PDF, pdfminer.six is the most reliable. However, you must bypass its default fallback fonts.