Double UTF-8
Have you ever encountered an issue with national charset being double encoded by UTF-8?
On linux it can be seen as strange capital "A" with unusual diacritics within the filenames.
Example od such weird behavior is:
which is in hex actualy:
The reason for that is rather simple when the file was copied it already was in UTF-8 by the system thought that it is 8-bit encoding and thus recoding it again into UTF-8 second time.
The way to remove it is simple:
recode -f UTF-8..ISO_8859-1
or
convmv -f UTF-8 -t ISO_8859-1 *
Unfortunately the second variant does not rename directories, while the first can be used inside a script that does the tash bottom up (from file tree perspective). It could look like:
find dir -printf "%d %p" | sort -n -r | while read DEPTH FN; do
ON=$(basename "$FN")
NN=$(echo "$ON" | recode -f UTF-8..ISO_8859-1)
DN=$(dirname "$FN")
if [ "x$ON" != "x$NN" ]; then
mv "$FN" "$DN/$NN"
fi
done
On linux it can be seen as strange capital "A" with unusual diacritics within the filenames.
Example od such weird behavior is:
This is hex code for the double encoded text: 50 c4 b9 c2 99 c4 82 c2 ad 6c 69 c4 b9 c4 84 20 c4 b9 c5 be |
6c 75 c4 b9 c4 bd 6f 75 c3 84 c2 8d 6b c4 82 cb 9d 20 6b c4 |
b9 c5 bb c4 b9 c2 88 20 c4 82 c5 9f 70 c3 84 c2 9b 6c 20 64 |
c4 82 c4 84 62 65 6c 73 6b c4 82 c5 a0 20 c4 82 c5 82 64 79 |
This is the correct utf-8 encoded text: 50 c5 99 c3 ad 6c 69 c5 a1 20 c5 be 6c 75 c5 a5 6f 75 c4 |
8d 6b c3 bd 20 6b c5 af c5 88 20 c3 ba 70 c4 9b 6c 20 64 c3 |
a1 62 65 6c 73 6b c3 a9 20 c3 b3 64 79 |
The reason for that is rather simple when the file was copied it already was in UTF-8 by the system thought that it is 8-bit encoding and thus recoding it again into UTF-8 second time.
The way to remove it is simple:
recode -f UTF-8..ISO_8859-1
or
convmv -f UTF-8 -t ISO_8859-1 *
Unfortunately the second variant does not rename directories, while the first can be used inside a script that does the tash bottom up (from file tree perspective). It could look like:
find dir -printf "%d %p" | sort -n -r | while read DEPTH FN; do
ON=$(basename "$FN")
NN=$(echo "$ON" | recode -f UTF-8..ISO_8859-1)
DN=$(dirname "$FN")
if [ "x$ON" != "x$NN" ]; then
mv "$FN" "$DN/$NN"
fi
done
Comments