GitHub - fbernhart/officeextractor: officeextractor extracts media files (images, videos, music) from Microsoft Office and LibreOffice files.

About

officeextractor is a Python library to extract media files like images, audio and video from office documents (Microsoft Office & LibreOffice).

Supported File Types

Supported	File Types	Supported Media Formats
Microsoft Word	docx, docm, dotm, dotx	images
Microsoft Excel	xlsx, xlsb, xlsm, xltm, xltx	images
Microsoft PowerPoint	potx, ppsm, ppsx, pptm, pptx, potm	images, video & audio
LibreOffice Writer	odt, ott	images
LibreOffice Calc	ods, ots	images
LibreOffice Impress	odp, otp, odg	images

⚠ NOTE: Microsoft Office 2003 files (doc, dot, xls, xlt, ppt, pot) are not supported.

Installation

pip install officeextractor

Usage

>>> import officeextractor

>>> officeextractor.extract(src=("File1.docx", "Folder/File2.xlsx"), dest="Path/To/Output/Folder")

4 media files extracted from File1.docx:
- 2 jpeg
- 1 gif
- 1 png

1 media file extracted from Folder/File2.xlsx:
- 1 png

Parameters

officeextractor.extract(src, dest, log=True)

src : str, list of str or tuple of str

Either a single file (string) or several files (list/tuple of strings) as relative or full path.

dest : str

Output directory as relative or full path. If the directory doesn't exist, it will be created.

log : bool, optional

Whether logging should be actived or not. If True, print a summary of the extraction. Default is True.

Release Notes

Can be found here on GitHub

Licence

GNU General Public License v3.0