GitHub - fbernhart/officeextractor: officeextractor extracts media files (images, videos, music) from Microsoft Office and LibreOffice files.

About

officeextractor is a Python library to extract media files like images, audio and video from office documents (Microsoft Office & LibreOffice).

Supported File Types

Supported File Types Supported Media Formats
Microsoft Word docx, docm, dotm, dotx images
Microsoft Excel xlsx, xlsb, xlsm, xltm, xltx images
Microsoft PowerPoint potx, ppsm, ppsx, pptm, pptx, potm images, video & audio
LibreOffice Writer odt, ott images
LibreOffice Calc ods, ots images
LibreOffice Impress odp, otp, odg images

NOTE: Microsoft Office 2003 files (doc, dot, xls, xlt, ppt, pot) are not supported.

Installation

pip install officeextractor

Usage

>>> import officeextractor

>>> officeextractor.extract(src=("File1.docx", "Folder/File2.xlsx"), dest="Path/To/Output/Folder")

4 media files extracted from File1.docx:
- 2 jpeg
- 1 gif
- 1 png

1 media file extracted from Folder/File2.xlsx:
- 1 png

Parameters

officeextractor.extract(src, dest, log=True)

src : str, list of str or tuple of str

Either a single file (string) or several files (list/tuple of strings) as relative or full path.

dest : str

Output directory as relative or full path. If the directory doesn't exist, it will be created.

log : bool, optional

Whether logging should be actived or not. If True, print a summary of the extraction. Default is True.

Release Notes

Can be found here on GitHub

Licence

GNU General Public License v3.0