Difference between revisions of "Unicode"
imported>ThorstenStaerk (New page: -------------------------------------------------------------------------------- *** Datei nach UTF-8 konvertieren *** convmv -f iso-8859-1 -t utf8 -r --notest <datei> recode latin1..u8 <d...) |
imported>ThorstenStaerk |
||
Line 5: | Line 5: | ||
yudit | yudit | ||
+ | |||
+ | Programming [[html2mediawiki]] showed some severe problems if you are using sites that contain umlauts like ä or ö. So I [http://wiki.linuxquestions.org/wiki/UniCode deep-dived into unicode] programming and want you to be able to use my findings. | ||
+ | |||
+ | Clearly, [http://www.joelonsoftware.com/articles/Unicode.html every text file has an encoding], that means, you must know if two bytes form one character to display, one byte, or the characters have mixed byte length. [http://en.wikipedia.org/wiki/Unicode Unicode] defines every character in the world. | ||
+ | |||
+ | Here is some practice: Store a file containing | ||
+ | hellö world | ||
+ | in file.txt. Do: | ||
+ | tweedleburg:~ # cat >file.txt | ||
+ | hellö world | ||
+ | tweedleburg:~ # cat file.txt | ||
+ | hellö world | ||
+ | tweedleburg:~ # hexdump -C file.txt | ||
+ | 00000000 68 65 6c 6c c3 b6 20 77 6f 72 6c 64 0a |hell.. world.| | ||
+ | 0000000d | ||
+ | This means, every "normal" character has been stored in 1 byte, every umlaut in 2 bytes. That is unicode's [http://en.wikipedia.org/wiki/UTF-8 UTF-8 encoding] | ||
+ | |||
+ | To show what Qt understands when it reads UTF8, we store a file with the content | ||
+ | ü | ||
+ | nothing else. The following code outputs the code: | ||
+ | QFile inputfile(args->url(0).fileName()); | ||
+ | inputfile.open(QIODevice::ReadOnly); | ||
+ | inputfilecontent = inputfile.read(inputfile.bytesAvailable()); | ||
+ | kDebug() << "inputfilecontent.data()[0]"<<(byte)inputfilecontent.data()[0]; | ||
+ | kDebug() << "inputfilecontent.data()[1]"<<(byte)inputfilecontent.data()[1]; | ||
+ | |||
+ | For little endian systems, ü UTF8 encoded delivers | ||
+ | 195 | ||
+ | 188 |
Revision as of 06:48, 16 May 2009
- Datei nach UTF-8 konvertieren ***
convmv -f iso-8859-1 -t utf8 -r --notest <datei> recode latin1..u8 <datei>
yudit
Programming html2mediawiki showed some severe problems if you are using sites that contain umlauts like ä or ö. So I deep-dived into unicode programming and want you to be able to use my findings.
Clearly, every text file has an encoding, that means, you must know if two bytes form one character to display, one byte, or the characters have mixed byte length. Unicode defines every character in the world.
Here is some practice: Store a file containing
hellö world
in file.txt. Do:
tweedleburg:~ # cat >file.txt hellö world tweedleburg:~ # cat file.txt hellö world tweedleburg:~ # hexdump -C file.txt 00000000 68 65 6c 6c c3 b6 20 77 6f 72 6c 64 0a |hell.. world.| 0000000d
This means, every "normal" character has been stored in 1 byte, every umlaut in 2 bytes. That is unicode's UTF-8 encoding
To show what Qt understands when it reads UTF8, we store a file with the content
ü
nothing else. The following code outputs the code:
QFile inputfile(args->url(0).fileName()); inputfile.open(QIODevice::ReadOnly); inputfilecontent = inputfile.read(inputfile.bytesAvailable()); kDebug() << "inputfilecontent.data()[0]"<<(byte)inputfilecontent.data()[0]; kDebug() << "inputfilecontent.data()[1]"<<(byte)inputfilecontent.data()[1];
For little endian systems, ü UTF8 encoded delivers
195 188