PDF Pre-Processing Before OCR with OpenCV
This project demonstrates how to convert PDF files into images and preprocess them using OpenCV to optimize for Optical Character Recognition (OCR). The preprocessing steps include grayscale conversion, noise removal, Gaussian blurring, and binarization to improve OCR accuracy.
Features
- Convert PDFs to Images: Uses
pdf2imageto extract PDF pages as JPEG images. - Grayscale Conversion: Simplifies the image for further processing.
- Noise Removal: Applies dilation and erosion to clean up the image.
- Gaussian Blur: Reduces noise by smoothing the edges.
- Binarization: Converts the image to black-and-white for OCR using Otsu's threshold.
Directory Structure
project/
├── pdfs/ # Place your PDF files here
├── images/ # Extracted images will be saved here
├── pre_processing.py # Main Python script
└── README.md # This README file
Dependencies
Make sure you have the following libraries installed:
OS Packages
sudo apt-get update && sudo apt-get install -y poppler-utils tesseract-ocrPython Packages
pip install opencv-python pdf2image pillow pytesseract
How to Use
-
Clone this repository:
git clone <your-repo-url> cd project
-
Place your PDFs in the
pdfs/directory. -
Run the script:
-
Check the
images/directory for the extracted and processed images.
Pre-Processing Techniques Used
- Grayscale Conversion: Reduces the image to a single color channel for easy processing.
- Dilation & Erosion: Cleans up noise and connects broken parts of objects.
- Gaussian Blur: Smooths out small variations in the image.
- Binarization: Converts the image to black-and-white for better OCR performance.
Example Output
After running the script, you should see the processed images saved in the images/ directory.
References
License
This project is licensed under the MIT License. See the LICENSE file for more details.