What is meta-text information … and how to get rid of it?

Most of us use word processing software like Microsoft Word, Apple Pages or OpenOffice Writer to create doucments these days. Although we see the words on the page we seldom realise how much more information is included to make the words look like they do: the type of font; size; decorations; etc. This is what I call meta-text information or meta-data.

Often when you copy and paste from one program to another it copies all of the information including meta-data. There are ways to get rid of this, but the simplest is to paste it into something which refuses to accept meta-data like Notepad (Windows) or TextEdit (Mac) and then cut and paste again into the desired program or if you are a corpus linguist just save it as is to use as a plain-text file.

Note: I have heard of problems with MacEdit not saving it correctly for concordancer use. I cannot verify this but I know it works well in Windows witht he method described here and on the linked page.

  1. Cutting and pasting is very tedious. There is a freeware program called Zilla Word to Text Converter I use but the encoding is in Indian and so the quotation marks and apostrophie are not read correctly in the concordancer I use. This problem has annoyed me for some time now. I cannot get it to correct in perl programs I have written either.

    Perhaps you can help me understand what the problem is (see you are from Pakistan and may be able to help).


  2. I understood your point about meta text. As far as I know, OOo Writer used a similar scheme to show bold, headings etc as it is used in html. I mean angle brackets. I got this while working on odt and doc files in OmegaT.
    And idea to eliminate these tags using notepad is fairly simple. I myself do the same while building a corpus. Copy texts from doc files, paste in notepad, save, go to next file….


  3. Hello Muhammad,
    Thank you for your comment. I am fully aware of what you are trying to say. Please the link to understand the menaing of the term meta-text in this context. The problem many of my readers are facing is how to get rid of information hidden in Word documents, not meta-tags (HTML, XML, etc).


  4. Corpus files should be in plain text, rule no. 1.
    Wanna remove meta text i.e. text within angle brackets: use regular expressions. (rule no. 2)
    I’ve learn these two rules in last few years while working with corpora.


