Double UTF-8

Have you ever encountered an issue with national charset being double encoded by UTF-8?

On linux it can be seen as strange capital "A" with unusual diacritics within the filenames.

Example od such weird behavior is:
which is in hex actualy:

This is hex code for the double encoded text:
 
50 c4 b9 c2 99 c4 82 c2 ad 6c 69 c4 b9 c4 84 20 c4 b9 c5 be
6c 75 c4 b9 c4 bd 6f 75 c3 84 c2 8d 6b c4 82 cb 9d 20 6b c4
b9 c5 bb c4 b9 c2 88 20 c4 82 c5 9f 70 c3 84 c2 9b 6c 20 64
c4 82 c4 84 62 65 6c 73 6b c4 82 c5 a0 20 c4 82 c5 82 64 79
 
This is the correct utf-8 encoded text: 
50 c5 99 c3 ad 6c 69 c5 a1 20 c5 be 6c 75 c5 a5 6f 75 c4
8d 6b c3 bd 20 6b c5 af c5 88 20 c3 ba 70 c4 9b 6c 20 64 c3
a1 62 65 6c 73 6b c3 a9 20 c3 b3 64 79

The reason for that is rather simple when the file was copied it already was in UTF-8 by the system thought that it is 8-bit encoding and thus recoding it again into UTF-8 second time.

The way to remove it is simple:

recode -f UTF-8..ISO_8859-1

or

convmv -f UTF-8 -t ISO_8859-1 *

Unfortunately the second variant does not rename directories, while the first can be used inside a script that does the tash bottom up (from file tree perspective). It could look like:

find dir -printf "%d %p" | sort -n -r | while read DEPTH FN; do
    ON=$(basename "$FN")
    NN=$(echo "$ON" | recode -f UTF-8..ISO_8859-1)
    DN=$(dirname "$FN")
    if [ "x$ON" != "x$NN" ]; then
        mv "$FN" "$DN/$NN"
    fi
done

Comments

Popular Posts