Monday, May 27, 2013

Converting Word Documents to HTML in Powershell with DefaultWebOptions

In working on a SharePoint-related project I found myself needing to convert a large number of Word documents to clean HTML while maintaining the integrity of the embedded images.  So I turned to PowerShell to automate the process but found it wasn't as easy as I had hoped.  After digging through some rather unhelpful MSDN documentation (is there any other kind?) I finally figured out how to force it to do what I wanted.

If you've ever used the "Save As Web Page" option in Word, then you know that the HTML it creates is a complete mess, full of Word styles and formatting options, and nothing you would ever actually use on a real web page.  Thankfully, there is a "Web Page, Filtered" option that strips most of the nonsense out of the resulting markup.  But for some reason it still insists on exporting down-sampled GIF images instead of full-fidelity JPEG or PNG.  The fix for this problem, which apparently cannot be set on a global basis without slinging some VBA code, is in the Tools > Web Options > Browsers menu next to the "Save" button in the "Save As" dialog.

(NOTE: You can also change the 'screen size' and 'pixels per inch' options for images under the "Pictures" tab but I found that it only resizes images with anti-aliasing so they end up blurry and unreadable.)

Now, that's all well and good for saving individual files, but it's no help at all for converting a bunch of them at once.  This is where PowerShell saves the day.  A quick script with some Word automation will get us most of the way there, allowing us to specify the type of conversion we want (WdFormatFilteredHTML, which equates to a value of '10' in the WdSaveFormat enumeration - see here for a complete list of supported formats) but it uses the default web options when performing the save operation, resulting in the same poor quality image conversions.

After hours digging around in C# samples and Word automation docs, I finally found a way to set the image conversion options in code.  Turns out the Word application object exposes the DefaultWebOptions collection through the WebOptions property, allowing us to specify values for all the options exposed in the Tools menu.  Setting "AllowPNG" to TRUE did the trick and all images came through the conversion process looking much better than they did before.

Here is the final script, which iterates a single directory using Get-ChildItems, instantiates a new Word application object, the runs the conversion with the proper web options:

$files = gci -filter "*.docx"

$savedir = "F:\ConvertedFiles\"

$word = New-Object -ComObject "Word.Application"

$word.Visible = $true

foreach ($document in $files)


write-host $document.Name


$name = $document.Name.Replace(" ","_");

$saveaspath = $savedir + $name.Replace('.docx','.htm')

$wdFormatHTML = [ref] 10

$existingDoc.WebOptions.AllowPNG = $true

$existingDoc.SaveAs( [ref] $saveaspath,$wdFormatHTML )




Monday, February 11, 2013

Solving the Word Wrap Problem in Firefox with Javascript

I know it's all the rage among web developers to blame IE for everything that's wrong on the Internet but sometimes - just sometimes - the folks in Redmond get it right and the fine volunteers of the Mozillaverse get it wrong.  Perhaps there's no better example of this than word breaking in columns.  For years, IE has had the ever-so-useful CSS property "word-wrap" which, when supplied with a value like "break-word", will split up a long string of text and preserve the precious layout you've slaved for hours to create.  It's so handy, in fact, that it actually made it into the CSS3 spec (don't be a hater - even the standards busybodies know a good thing when they see it).

Unfortunately, Firefox doesn't recognize this property.  Yes, I know it's supposed to in version 3.5+ but I have yet to see it actually working.  I've spent the better part of the last 9 months doing JavaScript programming for SharePoint 2013 apps so I've spent way too much time testing browser compatibility and not once have I seen FF honor this property.  Perhaps there's some secret Little Orphan Annie Decoder Ring trick to make it work - if so, please share it with me but until then we'll have to continue  hacking our way around this ridiculous exclusion in an otherwise fine browser.

A lot of people suggest doing this in CSS like so:

.wrap {
white-space: pre-wrap;       /* CSS3 */
white-space: -moz-pre-wrap;  /* Mozilla */

But I've found that solution is spotty - it sometimes works and sometimes doesn't (mostly the latter).  Instead, I prefer to use JavaScript to solve this problem, which eliminates the need for browser-dependent CSS tricks.  Taking, for example, a string that has 20 characters and must fit into a column that only permits 10 characters, the solution would look something like this:

var newstring = oldstring.replace(/([^\s\t\r\n<>]{11})/g, "$1<wbr>");

Or, if you would prefer a hyphen instead of a soft line break:

var newstring = oldstring.replace(/([^\s\t\r\n<>]{11})/g, "$1&shy;");  

Inline replacement is fine but if you end up doing it repeatedly within an application all those Regular Expressions get pretty redundant.  Plus, it would be nice to be able to specify the length as a parameter for a reusable function.  A quick bit of additional code will get us there:

function breakWord(string, length) {
    var reg = new RegExp("([^\s\t\r\n<>]{" + length + "})", "g");
    var s = string.replace(reg, "$1<wbr>");
    return s;

We can now call that function on any string we like:

var newstring = breakWord(oldstring,10);

Ah, that's better.  Word breaking for any string in IE and Firefox.  Now, Microsoft, let's talk about all those web pages that don't work in IE10, shall we?  Like, I dunno, just as a random example, SharePoint 2010 dialogs.  How 'bout it?