PDFTextStripper - parsing incorrectness

Hello,

I am using PDFTextStripper, from the PDFbox library, to parse the text out of the pdf generated from html using openhtmltopdf.

Code for parsing:
final PDDocument document = PDDocument.load(pdfBytes);
final PDFTextStripper pdfTextStripper = new PDFTextStripper();
return pdfTextStripper.getText(document);

However, I am seeing a few problems:

  1. Invisible, redundant text
    sometimes the PDF will have invisible text in front of the actual text.
    e.g.

HTML:
line1
line2
line3

PDF:
line1
line2 (<--- invisible)
line2
line3

This happens even when you just open the pdf and select / copy the text.

  1. commas are places in the wrong position, when parsed
    commas show up correctly, but when parsed, they show in incorrect position
    e.g.
    HTML:
    hello, my name, is

PDF:
,,hello my name is

NOTE this does not happen when you open the pdf and select / copy the text.

  1. Interestingly, the comma problem goes away when I parse like this
    final PDDocument document = PDDocument.load(pdfBytes);
    final PDFTextStripper pdfTextStripper = new PDFTextStripper();
    pdfTextStripper.setSortByPosition(true);
    return pdfTextStripper.getText(document);

However, all superscripts / subscripts then gets messed up on the output
e.g. receptiońs becomes receptións

Do you know why these happens?

Thank you!