Converting files from .doc
(or .docx
) to .pdf
format using Python can be achieved through several methods, each with its own set of requirements and capabilities. This article explores different Python libraries and approaches, providing a detailed guide for automating this conversion process.
Automating the conversion of .doc
files to .pdf
using Python offers numerous benefits:
comtypes
and Microsoft WordThis method leverages the comtypes
library to directly interact with Microsoft Word. It requires MS Word to be installed on the machine.
import sys
import os
import comtypes.client
def convert_to_pdf(input_file, output_file):
wdFormatPDF = 17
in_file = os.path.abspath(input_file)
out_file = os.path.abspath(output_file)
word = comtypes.client.CreateObject('Word.Application')
doc = word.Documents.Open(in_file)
doc.SaveAs(out_file, FileFormat=wdFormatPDF)
doc.Close()
word.Quit()
if __name__ == "__main__":
input_doc = sys.argv[1]
output_pdf = sys.argv[2]
convert_to_pdf(input_doc, output_pdf)
sys
, os
, and comtypes.client
.convert_to_pdf
Function:
wdFormatPDF
as 17, which is the code for PDF format in MS Word.comtypes.client.CreateObject('Word.Application')
..doc
file using word.Documents.Open(in_file)
..pdf
file using doc.SaveAs(out_file, FileFormat=wdFormatPDF)
.convert_to_pdf
function to perform the conversion.word.Visible = False
to run the conversion in the background, saving processing time.docx2pdf
PackageThe docx2pdf
package simplifies the process but also relies on Microsoft Office.
pip install docx2pdf
from docx2pdf import convert
convert("input.docx", "output.pdf") # Convert a single file
convert("my_docx_folder/") # Convert all docx files in a folder
convert
function takes either a single file path or a directory as input..docx
files to .pdf
using Microsoft Word in the background..doc
files directly.For Linux environments, utilizing LibreOffice via subprocess is a viable option.
import sys
import subprocess
import re
def convert_to_pdf(folder, source, timeout=None):
args = [libreoffice_exec(), '--headless', '--convert-to', 'pdf', '--outdir', folder, source]
process = subprocess.run(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE, timeout=timeout)
filename = re.search('-> (.*?) using filter', process.stdout.decode())
if filename:
return filename.group(1)
else:
print(process.stderr.decode())
return None
def libreoffice_exec():
if sys.platform == 'darwin':
return '/Applications/LibreOffice.app/Contents/MacOS/soffice'
return 'libreoffice'
if __name__ == "__main__":
result = convert_to('TEMP Directory', 'Your File.doc', timeout=15)
if result:
print(f"Conversion successful: {result}")
else:
print("Conversion failed.")
sys
, subprocess
, and re
.convert_to_pdf
Function:
subprocess.run
.libreoffice_exec
Function:
convert_to_pdf
function with the desired input and output.unoconv
with OpenOffice/LibreOfficeunoconv
is a universal document converter that utilizes OpenOffice or LibreOffice.
unoconv
:pip install unoconv
While unoconv
is primarily a command-line tool, it can be called from Python using subprocess
.
import subprocess
def convert_with_unoconv(input_file, output_file):
try:
subprocess.run(['unoconv', '-f', 'pdf', '-o', output_file, input_file], check=True)
print(f"Successfully converted {input_file} to {output_file}")
except subprocess.CalledProcessError as e:
print(f"Error converting {input_file}: {e}")
if __name__ == "__main__":
convert_with_unoconv("input.doc", "output.pdf")
subprocess.run
to execute the unoconv
command with the necessary arguments to convert the input file to PDF.comtypes
, you might encounter COM errors. Ensure that Word is visible initially (word.Visible = True
) and add a delay (time.sleep(3)
) to allow the COM server to initialize properly.The best method for converting .doc
to .pdf
with Python depends on your specific requirements and environment:
comtypes
or docx2pdf
for direct interaction with Word.docx2pdf
is suitable for Windows and macOS, provided MS Office is installed.unoconv
for a robust, server-friendly solution.By understanding these methods and their considerations, you can effectively automate the conversion of .doc
files to .pdf
using Python, streamlining your document processing workflows.