MoonPoint Support Logo


Shop Amazon Warehouse Deals - Deep Discounts on Open-box and Used ProductsAmazon Warehouse Deals

Advanced Search
Sun Mon Tue Wed Thu Fri Sat

Thu, Jan 29, 2015 11:18 pm

Word to Clean HTML

I maintain a website for an association with a little over 2,500 members that sends a monthly newsletter to members. The newsletter is sent by the U.S. Postal Service and email and I also convert the Microsoft Word document the newsletter editor sends me to HTML and post it to the association's website, which I've been maintaining for years now. The newsletter editor uses Microsoft Word to produce the newsletter. I tried Microsoft Word's "Save as Web Page" feature initially, but parts of the HTML code it produced didn't display properly on non-Microsoft Windows systems, sometimes because they didn't have the same fonts as those present on Microsoft Windows systems. And the code looked messy when I would edit the HTML version of the newsletter produced by Microsoft Word. Eventually, I decided it was actually quicker to just copy the text from Word, paste it into the Vi text editor and add the appropriate HTML formatting tags manually to get the newsletter to look close to the original Word version, but in a format that would display similarly across browsers and operating systems.

The copying, pasting, and editing process can take an hour or more, so when I came across the Word to Clean HTML site, which provides a free tool to convert documents produced by Microsoft Word and similar office software to HTML, I pasted the newsletter into its online form for conversion. The tool "strips out invalid or proprietary tags leaving clean HTML behind for use in web pages and ebooks", I hoped it might save me a fair portion of the time I normally spend each month on the manual conversion process and allow me to get the newsletter posted more promptly after I receive it. So, I copied the contents of the newsletter with command-C (I'm normally handling it on a Mac) and then pasted it into the form on the site's webpage. I checked a couple of the options that weren't checked by default: replace non-ascii with HTML entities and replace smart quotes with ascci equivalents. I then clicked on the convert to clean html button.

The HTML code produced by the tool was much cleaner than that produced by Word and gave me code that looked the same when viewed from browsers on different operating systems. I wouldn't have needed to do any editing to have the newsletter display appropriately, but I noticed that for an unordered list that at the end of each <li> entry there were extraneous <strong><u></u></strong> tags, i.e. there wasn't any text enclosed by the tags, which weren't needed. But that wouldn't have affected members' view of the newsletter. I removed it though, and made the source code a little more readable by putting in some blank lines between some of the items. But that wasn't really needed and by using the free online tool I should, hopefully, be able to reduce the process of posting the newsletter to about 15 minutes and get the newsletter posted shortly after I receive it now, so I'm thankful to Olly Cope, a freelance python web developer, for making it freely available to others. The tool was written in Python using the lxml library.

[/os/windows/office/word] permanent link

Valid HTML 4.01 Transitional

Privacy Policy   Contact

Blosxom logo