Pick of the Week - Nov 10 [Show all picks]
Path Finder 5 - A feature-laden Finder replacement
Submit Hint Search The Forums LinksStatsPollsFAQHeadlinesRSS
12,000 hints and counting!

How to retrieve text from Windows Office 2007 Word docs Apps
Looking through makezine.com brings up a way to pull just the text from a new Word for Windows Office 2007 .docx file. This page has the info you need -- a simple PHP script that will pull the text from the file.

I think that maybe Openoffice.org 2.0 may be able to help, but I haven't tried it yet, so I would love to hear from anyone who has made this work.

[robg adds: On that page, several other solutions are mentioned. It should be noted that, as of now, all of them will strip the formatting from the file, providing just the text. Microsoft has promised free converters for older versions of Office on the Mac (an I'll list them here for easy reference for anyone searching:
  • docx-converter.com -- a website that takes a .docx file as input and spits out the pure text content.
  • An Automator script that does the same thing.
  • If you own BBEdit or TextMate (or probably others), they have a "strip all tags" function you can use on the Word XML file. To see the XML file, though, you first need to change the .docx extension to .zip, then expand that archive in the Finder. Open the resulting folder, go into the word folder, and open the document.xml folder in BBEdit or TextMate, then use each app's strip tags function to pull out the text.
With the recent news that the XML converters won't be out until April or so of next year for current versions of Office, I think tricks like this are going to be increasingly necessary. Hopefully some brilliant coder out there will figure out how to parse the XML before Microsoft does, as losing all formatting is far from ideal.]
    •    
  • Currently 0.00 / 5
  • 1
  • 2
  • 3
  • 4
  • 5
  (0 votes cast)
 
[12,080 views]  

How to retrieve text from Windows Office 2007 Word docs | 9 comments | Create New Account
Click here to return to the 'How to retrieve text from Windows Office 2007 Word docs' hint
The following comments are owned by whomever posted them. This site is not responsible for what they say.
OpenOffice.org from Novell
Authored by: DamienMcKenna on Fri, Dec 8 2006 at 8:09AM PST
Novell are going to be releasing binaries of OpenOffice.org with importers for MSFT's XML formats, and I'm expecting their code to be merged into the main OOo codebases soon thereafter.

[ Reply to This | # ]
How to retrieve text from Windows Office 2007 Word docs
Authored by: dan55304 on Fri, Dec 8 2006 at 8:43AM PST
Kind of simple for me, Word 2007 is vaporware without the converters, and I just won't buy it. Word 2007 should be working in 2009 or so 8-)

[ Reply to This | # ]
How to retrieve text from Windows Office 2007 Word docs
Authored by: macavenger on Fri, Dec 8 2006 at 3:58PM PST
If only it was that easy- unfortunately, you may well need to open .docx files before then- say from a friend, coworker, etc. That's where things as mentioned in this hint will come in handy :)

---
iMac FP 17" 800MHz OS X 10.4.x

[ Reply to This | # ]
How to retrieve text from Windows Office 2007 Word docs
Authored by: ctierney on Fri, Dec 8 2006 at 9:26AM PST
Thanks for the tip! Now I'll be prepared the next time I get one of these files. Here's another method that could be wrapped into an applescript droplet:
unzip -p some.docx word/document.xml | perl -pe 's/<[^>]+>|[^[:print:]]+//g'

[ Reply to This | # ]
How to retrieve text from Windows Office 2007 Word docs
Authored by: ctierney on Fri, Dec 8 2006 at 11:45AM PST
Here's a droplet that'll extract plain text to the clipboard:
on open this_item
   set docxPath to POSIX path of this_item
   try
      do shell script "unzip -p " & docxPath & " word/document.xml | perl -pe 's/<[^>]+>|[^[:print:]]+//g' | pbcopy"
   end try
end open

on run
   display dialog "Drop a docx file on this applescript and it's plain text contents will be copied to the clipboard." buttons {"Ok"} giving up after 10 default button 1
end run


[ Reply to This | # ]
How to retrieve text from Windows Office 2007 Word docs
Authored by: hoosker on Wed, Dec 13 2006 at 4:30PM PST
I create lots of forms for my office staff with a lone Mac. I would love to be able to perform this Word watermark trick for them but they all use Windows and their version of Office does not import PDF files. I could convert the PDF to a bitmap image but that would make for big ugly files.

[ Reply to This | # ]
According to rumors...
Authored by: DamienMcKenna on Thu, Dec 14 2006 at 8:09PM PST
According to some rumors doing the blogs, the TextEdit in Leopard will support loading docx files. Now if they'd just support ODF in all of their software we wouldn't need NeoOffice.

[ Reply to This | # ]
According to rumors...
Authored by: mnoriega on Tue, Apr 10 2007 at 6:08PM PDT
Now that version 2.1 of NeoOffice is out, you can open docx files directly with Neooffice.

Download it free from:
http://www.neooffice.org/

[ Reply to This | # ]
How to retrieve text from Windows Office 2007 Word docs
Authored by: jimisoft on Wed, May 30 2007 at 6:47PM PDT
Try All2Txt 2.0 at:
http://www.jimisoft.com/en/all2txt.html

This software can retrieve text from docx file.

[ Reply to This | # ]