I need to convert 1K pdf files to doc on a debian server. I can convert a PDF to word using libreoffice commandline: libreoffice --headless --invisible --convert-to doc Sample-doc-file-100kb.pdf
The main problem with the above two commands, is that the doc file doesn't include images in the pages, it only contains the formatted text. Is there a better way to convert pdf to doc, including also the images present in the pdf? I am not interested in web services like zamzam, I need to do that from command-line on the server.
I managed to do it by using this: libreoffice --infilter=="writer_pdf_import" --headless \ --convert-to doc:"writer_pdf_Export" Brief.pdf This solution gives me the same output as @igiannak's answer.
I tried converting the PDF to HTML and then to doc, but I encountered a problem with the resultant doc file being detected as a pdf and libreoffice opening it in Draw.
any direct command line interface command is available with pdf to docx conversion including images present in the pdf and I tried libreoofice and soffice commands it was giving only simple formatted text like any other pywin32 com clinet library is available on linux/ubuntu during pdf to word conversion
import os
import sys
import comtypes.client
wdFormatPDF = 17
def covx_to_pdf(infile, outfile):
"""Convert a Word .docx to PDF"""
word = comtypes.client.CreateObject('Word.Application')
doc = word.Documents.Open(infile)
doc.SaveAs(outfile, FileFormat=wdFormatPDF)
doc.Close()
word.Quit()
But this package can not support to linux/debian platforms.
Can we have any suggestion for this same implementation on Linux/debian for pdf to word conversion?