The challenge of converting Word documents (.doc and .docx) to PDF format in a .NET Core environment, without relying on Microsoft Office Interop, is a common one. This article explores various approaches to tackle this problem, providing a detailed guide for developers seeking a robust and efficient solution.
Many applications require the ability to display Word documents within a browser. Since browsers natively support PDF, converting Word files to PDF on the server-side becomes a necessity. While Microsoft Office Interop offers a solution, it's not compatible with .NET Core, especially in cross-platform environments like Azure or Docker containers.
One popular method involves leveraging the Open XML SDK to read .docx files and convert them into HTML. Then, an HTML to PDF converter transforms the HTML into a PDF document.
Download and Build OpenXMLSDK-PowerTools: Obtain the .NET Core project and build the OpenXMLPowerTools.Core
and OpenXMLPowerTools.Core.Example
projects.
Add Word Document: Include a .docx file (e.g., test.docx) in the project and set its "Copy to Output Directory" property to "If Newer".
Run Console Project: Execute the console project with the following code:
using System.IO.Packaging;
using DocumentFormat.OpenXml.Packaging;
using OpenXmlPowerTools;
using System.Xml.Linq;
static void Main(string[] args)
{
var source = Package.Open(@"test.docx");
var document = WordprocessingDocument.Open(source);
HtmlConverterSettings settings = new HtmlConverterSettings();
XElement html = HtmlConverter.ConvertToHtml(document, settings);
Console.WriteLine(html.ToString());
var writer = File.CreateText("test.html");
writer.WriteLine(html.ToString());
writer.Dispose();
Console.ReadLine();
}
Handle Images and Links: Address issues with missing or broken images and links using the approach outlined in this CodeProject article. The code snippet below demonstrates how to fix broken URIs:
public static Uri FixUri(string brokenUri)
{
string newURI = string.Empty;
if (brokenUri.Contains("mailto:"))
{
int mailToCount = "mailto:".Length;
brokenUri = brokenUri.Remove(0, mailToCount);
newURI = brokenUri;
}
else
{
newURI = " ";
}
return new Uri(newURI);
}
HTML to PDF Conversion with DinkToPdf: Use DinkToPdf to convert the generated HTML to PDF. Ensure the libwkhtmltox.so
and libwkhtmltox.dll
files are in the root of your project.
var doc = new HtmlToPdfDocument()
{
GlobalSettings = {
ColorMode = ColorMode.Color,
Orientation = Orientation.Landscape,
PaperSize = PaperKind.A4,
},
Objects = {
new ObjectSettings() {
PagesCount = true,
HtmlContent = File.ReadAllText(@"C:\path\to\test1.html"),
WebSettings = { DefaultEncoding = "utf-8" },
HeaderSettings = { FontSize = 9, Right = "Page [page] of [toPage]", Line = true },
FooterSettings = { FontSize = 9, Right = "Page [page] of [toPage]" }
}
}
};
Another approach involves using the LibreOffice binary (soffice
) to convert documents. This method supports various formats beyond .doc and .docx.
soffice
binary.Identify LibreOffice Path: Determine the path to the soffice
binary based on the operating system.
static string getLibreOfficePath()
{
switch (Environment.OSVersion.Platform)
{
case PlatformID.Unix:
return "/usr/bin/soffice";
case PlatformID.Win32NT:
string binaryDirectory = System.IO.Path.GetDirectoryName(Assembly.GetExecutingAssembly().Location);
return binaryDirectory + "\\Windows\\program\\soffice.exe";
default:
throw new PlatformNotSupportedException ("Your OS is not supported");
}
}
Execute Conversion: Use ProcessStartInfo
to run the soffice
command.
ProcessStartInfo procStartInfo = new ProcessStartInfo(libreOfficePath, string.Format("--convert-to pdf --nologo {0}", args[0]));
procStartInfo.RedirectStandardOutput = true;
procStartInfo.UseShellExecute = false;
procStartInfo.CreateNoWindow = true;
procStartInfo.WorkingDirectory = Environment.CurrentDirectory;
Process process = new Process() { StartInfo = procStartInfo, };
process.Start();
process.WaitForExit();
if (process.ExitCode != 0)
{
throw new LibreOfficeFailedException(process.ExitCode);
}
FreeSpire.Doc is a .NET library that allows converting .docx files to PDF with a limitation of 3 pages for the free version.
Spire.Doc.Document document = new Spire.Doc.Document(listOfDocx[i], FileFormat.Auto);
document.SaveToFile(savePath, FileFormat.PDF);
Gotenberg is a Docker-based solution that utilizes LibreOffice for document conversions, offering a stateless API.
The free "Report-From-DocX-HTML-To-PDF-Converter" library, built on .NET Core under the MIT license, provides a simple solution, requiring only LibreOffice.
Converting Word documents to PDF in .NET Core without relying on Microsoft Office Interop requires a strategic approach. Whether you choose to leverage the Open XML SDK, LibreOffice, or a dedicated library like FreeSpire.Doc, understanding the nuances of each method will help you select the best solution for your specific needs. Remember to consider factors such as platform compatibility, licensing costs, and the complexity of the documents being converted to ensure a smooth and efficient conversion process.