encoding of text files
Roger Oberholtzer
roger
Wed May 25 00:40:30 PDT 2005
On Tue, 2005-05-24 at 22:37 +0100, Jorge Almeida wrote:
> On Tue, 24 May 2005, Michael Hipp wrote:
>
> > Jorge Almeida wrote:
> >> I only need files in Portuguese. The problem is that they come in latin1
> >> or UTF8...
> >> It's all about html files. I want to write accented characters and then
> >> filter the file through a script in order to replace them by html code.
> >> The script uses Perl substitution operator. Example:
> >> with "s/?/\ç/g;" in the script,
> >> "po?o" in a latin1 encoded file will be substituted by "poço"
> >> But if the file is UTF8 I get "poÃ?o". Nasty!
> >> Using "recode UTF8..latin1 file" would solve the problem, but one has to
> >> know that it's UTF8 encoded...
> >
> > Does the 'file' command do what you want?
> >
> "file -k myfile" yields "HTML document text\012- exported SGML document text"
> I tryed with a UTF8 file and with a latin1 file, and the outcome is the
> same.
When you say 'text' files, what exactly do you mean? Commonly, a file
called a 'text file' has nothing else in it other than text. The 'vi'
editor deals with 'text files'. No headers or anything. Is this what you
mean? Or do these files come from some application? Once there is some
header in the file, it is no longer a text file. It becomes whatever the
header implies.
Re-read my earlier post. UTF-8 and other encodings cannot be determined
by file content alone. All use the values between 1 and 255 to represent
something. There is simply nothing in the file to say what these mean.
Since you are dealing specifically with Portuguese, you could determine
the numeric representation of the specifically Portuguese letters in
latin1 and in UTF-8. Then, see which of these exist in the file. This
means looking at the file numerically, which Perl allows. Do you have
such a list of the representations?
FYI, SAMBA has an option wherein you can specify pairs of these numeric
representations. It then converts one into the other in things like file
names. This allows you to change the encoding of a remote system into
the local one. In SAMBA, you must be specify these numeric pairs because
it has the same problem - the encoding is not contained in the file
name. We had to do this to convert DOS encoding <-> Roman8 on an HP
server.
--
Roger Oberholtzer <roger at opq.se>
More information about the Linux-users
mailing list