encoding of text files

Wed May 25 04:12:14 PDT 2005

On Wed, 2005-05-25 at 09:52, Jorge Almeida wrote:
> On Wed, 25 May 2005, Roger Oberholtzer wrote:
> >>>
> >> "file -k myfile" yields "HTML document text\012- exported SGML document text"
> >> I tryed with a UTF8 file and with a latin1 file, and the outcome is the
> >> same.
> >
> > When you say 'text' files, what exactly do you mean? Commonly, a file
> > called a 'text file' has nothing else in it other than text. The 'vi'
> > editor deals with 'text files'. No headers or anything. Is this what you
> > mean? Or do these files come from some application? Once there is some
> > header in the file, it is no longer a text file. It becomes whatever the
> > header implies.
> I mean the kind of file we edit with a text editor (vim is the one I
> use). The files are html files, which I download with wget or save through the
> browser or receive by e-mail, etc. (Word processors are _not_ involved
> in this.) Could it be that such files are not
> technicaly "text files"?
> I tried inserting the character '?' in the file and processing the file
> with the script; the outcome depends on whether the original file was
> latin1 or UTF8, so there must be some hidden information somewhere!
> >
> > Since you are dealing specifically with Portuguese, you could determine
> > the numeric representation of the specifically Portuguese letters in
> > latin1 and in UTF-8. Then, see which of these exist in the file. This
> > means looking at the file numerically, which Perl allows. Do you have
> > such a list of the representations?
> I can detect the encoding that way, but it's not very practical to do it
> in a script.
> I suppose I'll make two identical scripts with different encodings.
> Silly, but it will do the job, sort of :)

You don't need two scripts. Open the text file in binary mode (which is
a Perl detail I leave to you). Then, read the file once looking for any
numeric values used to represent Portuguese in UTF-8, and again looking
for any numeric values used to represent the characters in latin1. On
each file, one of the two tests should find matches, indicating which
encoding was used.

The reason I suggest binary access to the file and then looking for the
numeric value is that your script does not have to contain the actual
visible characters. The numeric values can be manipulated in any editor.

As an example, in iso8859-1, the Swedish character ? is stored as
decimal value 229. In UTF-8 it is different (I don't have the UTF-8 list
here...). I would look for numeric value of 229 to see if the text file
contained '?' in this encoding.

Note that this method could work for you only because you are limited to
Portuguese characters, and not all possible languages. And it will only
work if the numeric encoding for the Portuguese-specific characters is
different between the possible encodings.

> >
> > FYI, SAMBA has an option wherein you can specify pairs of these numeric
> > representations. It then converts one into the other in things like file
> > names. This allows you to change the encoding of a remote system into
> > the local one. In SAMBA, you must be specify these numeric pairs because
> > it has the same problem - the encoding is not contained in the file
> > name. We had to do this to convert DOS encoding <-> Roman8 on an HP
> > server.
> The server we use is Apache. Files are made in linux or macox.
> >
> >
> >
> Thanks.
+????????????????????????????+???????????????????????????????+
? Roger Oberholtzer          ?   E-mail: roger at opq.se        ?
? OPQ Systems AB             ?      WWW: http://www.opq.se/  ?
? Kapellgr?nd 7              ?                               ?
? P. O. Box 4205             ?    Phone: Int + 46 8   314223 ?
? 102 65 Stockholm           ?   Mobile: Int + 46 733 621657 ?
? Sweden                     ?      Fax: Int + 46 8   314223 ?
+????????????????????????????+???????????????????????????????+