How to retrieve text from Windows Office 2007 Word docs
Fri, Dec 8 2006 at 7:30AM PST • Contributed by: nicbav
Fri, Dec 8 2006 at 7:30AM PST • Contributed by: nicbav
Looking through makezine.com brings up a way to pull just the text from a new Word for Windows Office 2007 .docx file. This page has the info you need -- a simple PHP script that will pull the text from the file.
I think that maybe Openoffice.org 2.0 may be able to help, but I haven't tried it yet, so I would love to hear from anyone who has made this work.
[robg adds: On that page, several other solutions are mentioned. It should be noted that, as of now, all of them will strip the formatting from the file, providing just the text. Microsoft has promised free converters for older versions of Office on the Mac (an I'll list them here for easy reference for anyone searching:
I think that maybe Openoffice.org 2.0 may be able to help, but I haven't tried it yet, so I would love to hear from anyone who has made this work.
[robg adds: On that page, several other solutions are mentioned. It should be noted that, as of now, all of them will strip the formatting from the file, providing just the text. Microsoft has promised free converters for older versions of Office on the Mac (an I'll list them here for easy reference for anyone searching:
- docx-converter.com -- a website that takes a .docx file as input and spits out the pure text content.
- An Automator script that does the same thing.
- If you own BBEdit or TextMate (or probably others), they have a "strip all tags" function you can use on the Word XML file. To see the XML file, though, you first need to change the .docx extension to .zip, then expand that archive in the Finder. Open the resulting folder, go into the word folder, and open the document.xml folder in BBEdit or TextMate, then use each app's strip tags function to pull out the text.
•
[12,080 views]
