Double UTF-8

November 23, 2011

Double UTF-8

Have you ever encountered an issue with national charset being double encoded by UTF-8?

On linux it can be seen as strange capital "A" with unusual diacritics within the filenames.

Example od such weird behavior is:

which is in hex actualy:

This is hex code for the double encoded text:

50 c4 b9 c2 99 c4 82 c2 ad 6c 69 c4 b9 c4 84 20 c4 b9 c5 be

6c 75 c4 b9 c4 bd 6f 75 c3 84 c2 8d 6b c4 82 cb 9d 20 6b c4

b9 c5 bb c4 b9 c2 88 20 c4 82 c5 9f 70 c3 84 c2 9b 6c 20 64

c4 82 c4 84 62 65 6c 73 6b c4 82 c5 a0 20 c4 82 c5 82 64 79

This is the correct utf-8 encoded text:

50 c5 99 c3 ad 6c 69 c5 a1 20 c5 be 6c 75 c5 a5 6f 75 c4

8d 6b c3 bd 20 6b c5 af c5 88 20 c3 ba 70 c4 9b 6c 20 64 c3

a1 62 65 6c 73 6b c3 a9 20 c3 b3 64 79

The reason for that is rather simple when the file was copied it already was in UTF-8 by the system thought that it is 8-bit encoding and thus recoding it again into UTF-8 second time.

The way to remove it is simple:

recode -f UTF-8..ISO_8859-1

or

convmv -f UTF-8 -t ISO_8859-1 *

Unfortunately the second variant does not rename directories, while the first can be used inside a script that does the tash bottom up (from file tree perspective). It could look like:

find dir -printf "%d %p" | sort -n -r | while read DEPTH FN; do
ON=$(basename "$FN")
  NN=$(echo "$ON" | recode -f UTF-8..ISO_8859-1)
  DN=$(dirname "$FN")
  if [ "x$ON" != "x$NN" ]; then
      mv "$FN" "$DN/$NN"
  fi
done

Search This Blog

Mad Linux

Double UTF-8

Comments

Popular Posts

html mail - css merge

SafeQ and Minolta PCL only