encoding of text files

Jorge Almeida jalmeida
Wed May 25 02:28:38 PDT 2005


On Wed, 25 May 2005, Roger Oberholtzer wrote:
>>>
>> "file -k myfile" yields "HTML document text\012- exported SGML document text"
>> I tryed with a UTF8 file and with a latin1 file, and the outcome is the
>> same.
>
> When you say 'text' files, what exactly do you mean? Commonly, a file
> called a 'text file' has nothing else in it other than text. The 'vi'
> editor deals with 'text files'. No headers or anything. Is this what you
> mean? Or do these files come from some application? Once there is some
> header in the file, it is no longer a text file. It becomes whatever the
> header implies.
I mean the kind of file we edit with a text editor (vim is the one I
use). The files are html files, which I download with wget or save through the
browser or receive by e-mail, etc. (Word processors are _not_ involved
in this.) Could it be that such files are not
technicaly "text files"?
I tried inserting the character '?' in the file and processing the file
with the script; the outcome depends on whether the original file was
latin1 or UTF8, so there must be some hidden information somewhere!
>
> Since you are dealing specifically with Portuguese, you could determine
> the numeric representation of the specifically Portuguese letters in
> latin1 and in UTF-8. Then, see which of these exist in the file. This
> means looking at the file numerically, which Perl allows. Do you have
> such a list of the representations?
I can detect the encoding that way, but it's not very practical to do it
in a script.
I suppose I'll make two identical scripts with different encodings.
Silly, but it will do the job, sort of :)
>
> FYI, SAMBA has an option wherein you can specify pairs of these numeric
> representations. It then converts one into the other in things like file
> names. This allows you to change the encoding of a remote system into
> the local one. In SAMBA, you must be specify these numeric pairs because
> it has the same problem - the encoding is not contained in the file
> name. We had to do this to convert DOS encoding <-> Roman8 on an HP
> server.
The server we use is Apache. Files are made in linux or macox.
>
>
>
Thanks.
-- 
Jorge Almeida


More information about the Linux-users mailing list