encoding of text files

Jorge Almeida jalmeida
Tue May 24 09:51:14 PDT 2005


On Tue, 24 May 2005, Roger Oberholtzer wrote:

> On Tue, 2005-05-24 at 13:28, Jorge Almeida wrote:
>> How can I find out what encoding is used in a text file?
>> A command line utility (or, say, a Perl module) would be great, since I
>> need to use it in a script.
>
> That is tricky. The problem is that a text file may contain the same
> numeric value, but it will look different based on the encoding.
> However, there is nothing in the text file that tells this. MIME, for
I read somewhere that the first bytes contain some invisible information
about the encoding, but I don't know how to extract those bytes nor how
to interpret them.
>
> Perhaps look for a few key words in a likely language, and then use the
> common encoding for that language. For example, in Sweden there is a 99%
> chance a file is encoded in ISO8859-1. So if you find lyric discussions
> of meatballs (look for k?ttbullar, possibly Mammas) in the file...
>
I only need files in Portuguese. The problem is that they come in latin1
or UTF8...
It's all about html files. I want to write accented characters and then
filter the file through a script in order to replace them by html code.
The script uses Perl substitution operator. Example:
 	with "s/?/\ç/g;" in the script,
 	"po?o" in a latin1 encoded file will be substituted by "poço"
 	But if the file is UTF8 I get "poÃ?o". Nasty!
Using "recode UTF8..latin1 file" would solve the problem, but 
one has to know that it's UTF8 encoded...

-- 
Jorge Almeida


More information about the Linux-users mailing list