Add a check for complex pdfs by aadland6 · Pull Request #4268

Add a check for complex pdfs by aadland6 · Pull Request #4268 · Unstructured-IO/unstructured

Claude finished @qued's task —— View job

Code Review and Testing Suggestions

Todo List:

Read and analyze the changed files
Review the is_pdf_too_complex function implementation
Address previous review comments about regex compilation
Address formatting/line length issues
Suggest comprehensive unit testing approach
Provide final code review feedback

Code Review Feedback

✅ Strengths

Good performance optimization: Early exits based on file size and stream size minimize overhead
Clear algorithm: Well-documented approach using graphics-to-text ratio for complexity detection
Robust error handling: Proper exception handling with logging and cursor position restoration
Defensive programming: Handles various file input types (bytes, file objects, filenames)

🔧 Issues to Address

1. Regex Compilation Performance (Previous Review Comment)

Location: unstructured/partition/pdf.py:95-105

The regex patterns are currently compiled inside the module scope, which is correct. However, looking at the previous review comment about compiling outside the function, I believe this was already addressed properly. The patterns GRAPHICS_OPS_PATTERN and TEXT_OPS_PATTERN are compiled at module level (lines 95-105), not inside the function.

2. Line Length Formatting Issues (Previous Review Comment)

Location: Multiple locations in unstructured/partition/pdf.py

Several lines exceed the 100-character limit specified in your pyproject.toml. Key examples:

Line 598: min_graphics_to_text_ratio: float = 20.0, (part of function signature that's too long)
Line 599: min_file_size_bytes: int = int(1 * 1024 * 1024), # 1 MB
Line 707-713: The logging message is split well but could be formatted more consistently

3. Magic Numbers Could Be Constants

Location: unstructured/partition/pdf.py:599-601

Consider defining module-level constants:

DEFAULT_MIN_FILE_SIZE_MB = 1 * 1024 * 1024  # 1 MB
DEFAULT_MIN_RAW_STREAM_BYTES = 100_000      # 100 KB

4. Type Hints Enhancement

Location: unstructured/partition/pdf.py:594

The function signature could benefit from more explicit typing for the file parameter:

def is_pdf_too_complex(
    filename: str = "",
    file: Optional[Union[bytes, IO[bytes]]] = None,
    # ...

5. Potential Division by Zero Edge Case

Location: unstructured/partition/pdf.py:704

While max(num_text_ops, 1) prevents division by zero, consider logging when num_text_ops == 0 as this might indicate an unusual PDF structure.

🧪 Comprehensive Unit Testing Strategy

Here's a detailed testing approach for the is_pdf_too_complex function:

Test Categories

Basic Functionality Tests

def test_simple_text_pdf_not_complex():
    """Test that a simple text PDF is not flagged as complex"""
    # Use a minimal PDF with mostly text content
    
def test_complex_vector_pdf_detected():
    """Test that a CAD/vector-heavy PDF is correctly identified"""
    # Create or use a PDF with high graphics-to-text ratio

Edge Case Tests

def test_small_file_size_skipped():
    """Test that files below min_file_size_bytes are skipped"""
    # Create tiny PDF, ensure function returns False without analysis
    
def test_empty_pdf_handled():
    """Test handling of empty or corrupted PDF files"""
    
def test_pdf_with_no_content_streams():
    """Test PDFs where pages have no /Contents"""
    
def test_small_content_streams_skipped():
    """Test that pages with small content streams are skipped"""

Input Type Tests

def test_filename_input():
    """Test function with filename string parameter"""
    
def test_bytes_input():
    """Test function with bytes parameter"""
    
def test_file_object_input():
    """Test function with file-like object parameter"""
    
def test_file_cursor_position_preserved():
    """Ensure file cursor position is restored after analysis"""

Threshold Boundary Tests

def test_graphics_ops_threshold_boundary():
    """Test behavior at max_graphics_ops threshold boundaries"""
    # Test with exactly max_graphics_ops, max_graphics_ops+1, etc.
    
def test_ratio_threshold_boundary():
    """Test behavior at ratio threshold boundaries"""
    # Test with ratios just above/below min_graphics_to_text_ratio

Error Handling Tests

def test_corrupted_pdf_handling():
    """Test graceful handling of corrupted PDF files"""
    
def test_password_protected_pdf():
    """Test behavior with password-protected PDFs"""
    
def test_file_access_errors():
    """Test handling when file cannot be accessed"""

Performance Tests

def test_large_file_performance():
    """Verify performance optimizations work on large files"""
    # Ensure function completes quickly on large files
    
def test_many_pages_performance():
    """Test performance with PDFs containing many pages"""

Test Data Strategies

Generated Test PDFs: Create minimal PDFs programmatically with controlled graphics/text ratios
Real-world Samples: Include sanitized examples of CAD drawings, technical diagrams, and text documents
Synthetic Content: Use pypdf to create PDFs with specific operator counts for precise threshold testing

Mock Testing for Isolation

def test_regex_pattern_matching():
    """Test regex patterns match expected PDF operators"""
    # Mock the content stream bytes and verify regex matches
    
def test_file_size_calculation():
    """Test file size determination logic across input types"""
    # Mock different input types and verify size calculation

Integration Testing Considerations

Test integration with the main partition_pdf_or_image function
Verify logging messages are appropriate and informative
Test that complex PDFs properly fall back to hi_res strategy

📋 Summary

The is_pdf_too_complex function is well-designed with good performance optimizations. The main issues are formatting-related (line length) and minor improvements around constants and logging. The testing strategy should focus on boundary conditions, different input types, and real-world PDF complexity scenarios.

Priority fixes:

Address line length issues for consistency with project standards
Consider extracting magic numbers to constants
Add comprehensive unit tests as outlined above