|Lesson 2||Character sets|
|Objective||Explore the problems posed by multilingual and multiplatform character sets. |
This course has tried to steer clear of reading and writing text, character-based data like Q or
The quick brown fox jumped over the lazy dog.
Reading and writing text is simple as long as you assume that everyone is reading and writing ASCII character data. However, in the modern world,
that's rarely true.
The Mac uses an extended 8-bit character set called MacRoman that contains many additional symbols like © and letters from non-English Latin alphabets like
Windows uses a different 8-bit character set called ISO Latin-1 that has most of these symbols but maps them to different numbers.
The character ç is number 141 on the Mac but number 231 on Windows.
The problem only gets worse when you attempt to incorporate non-Roman alphabets like Greek, Cyrillic, Hebrew, and Arabic. Character sets used for
these languages often do not correspond to ASCII at all, and may not have ASCII character equivalents. When you consider the pictographic
languages like Chinese, Japanese, and Korean, there's simply no longer any way to fit all the characters from even one of these languages into
eight bits. You have to move to a multibyte character set.
There isn't one universally accepted standard for how to encode these languages. There are many different ways these characters are commonly
Unicode attempts to provide encodings for all the most common characters in most of the world's current languages in two bytes (16
bits). Java uses Unicode internally, and it uses a variant of Unicode called UTF-8 in .class files for storing string literals. However, there are
many Chinese, Japanese, Arabic, Hebrew, and other character sets in common use that are not Unicode. A means is needed for converting between the
Unicode text Java supports and these other character sets.