Monday, May 27, 2013

Converting Word Documents to HTML in Powershell with DefaultWebOptions

In working on a SharePoint-related project I found myself needing to convert a large number of Word documents to clean HTML while maintaining the integrity of the embedded images.  So I turned to PowerShell to automate the process but found it wasn't as easy as I had hoped.  After digging through some rather unhelpful MSDN documentation (is there any other kind?) I finally figured out how to force it to do what I wanted.

If you've ever used the "Save As Web Page" option in Word, then you know that the HTML it creates is a complete mess, full of Word styles and formatting options, and nothing you would ever actually use on a real web page.  Thankfully, there is a "Web Page, Filtered" option that strips most of the nonsense out of the resulting markup.  But for some reason it still insists on exporting down-sampled GIF images instead of full-fidelity JPEG or PNG.  The fix for this problem, which apparently cannot be set on a global basis without slinging some VBA code, is in the Tools > Web Options > Browsers menu next to the "Save" button in the "Save As" dialog.

(NOTE: You can also change the 'screen size' and 'pixels per inch' options for images under the "Pictures" tab but I found that it only resizes images with anti-aliasing so they end up blurry and unreadable.)

Now, that's all well and good for saving individual files, but it's no help at all for converting a bunch of them at once.  This is where PowerShell saves the day.  A quick script with some Word automation will get us most of the way there, allowing us to specify the type of conversion we want (WdFormatFilteredHTML, which equates to a value of '10' in the WdSaveFormat enumeration - see here for a complete list of supported formats) but it uses the default web options when performing the save operation, resulting in the same poor quality image conversions.

After hours digging around in C# samples and Word automation docs, I finally found a way to set the image conversion options in code.  Turns out the Word application object exposes the DefaultWebOptions collection through the WebOptions property, allowing us to specify values for all the options exposed in the Tools menu.  Setting "AllowPNG" to TRUE did the trick and all images came through the conversion process looking much better than they did before.

Here is the final script, which iterates a single directory using Get-ChildItems, instantiates a new Word application object, the runs the conversion with the proper web options:


$files = gci -filter "*.docx"

$savedir = "F:\ConvertedFiles\"

$word = New-Object -ComObject "Word.Application"

$word.Visible = $true

foreach ($document in $files)



{


write-host $document.Name

$existingDoc=$word.Documents.Open($document.FullName)

$name = $document.Name.Replace(" ","_");

$saveaspath = $savedir + $name.Replace('.docx','.htm')

   
$wdFormatHTML = [ref] 10

$existingDoc.WebOptions.AllowPNG = $true

$existingDoc.SaveAs( [ref] $saveaspath,$wdFormatHTML )

$existingDoc.Close()



}


$Word.Quit()