One step we had glossed over in our haste to get something interesting on the screen in our Chinese/English dictionary program was the conversion from traditional to simplified Chinese characters.
The format of the hcutf8.txt file is a series of lines,
each of which is a UTF-8 encoded string consisting of a simplified
Chinese character followed by its traditional equivalents.
Often, multiple traditional characters map to a single
simplified character.
Much more rarely—only twice in our data set—multiple
simplified characters map to a single traditional character.
Unfortunately, one of the cases is the common syllable
麼, which has two simplifications, either
么 or 麽, the first of which is far more productive.
We’ll have to keep an eye out for that one.
(Note also that in real life, the mapping is more complicated than a character-for-character substitution, but I’m willing to forego that level of complexity because this is just for my personal use and people will have realized I’m not a native speaker long before I get caught up in language subtleties like that.)
One could try to work out a fancy data structure to represent
this mapping table compactly, but it turns out that simple is
better here: an array of 65536 WCHARs, each producing
the corresponding simplification.
Most of the array will lie unused,
since the characters we are interested in lie in the range
U+4E00 to U+9FFF.
Consequently, the active part of the table is only about 40Kb,
which easily fits inside the L2 cache.
It is important to know when a simple data structure is better than a complex one.
The hcutf8.txt file contains a lot of fluff that we
aren’t interested in. Let’s strip that out ahead of time so that
we don’t waste our time parsing it at run-time.
#!perl
$_ = <> until /^# Start zi/; # ignore uninteresting characters
while (<>) {
s/\r//g;
next if length($_) == 7 &&
substr($_, 0, 3) eq substr($_, 3, 3); # ignore NOPs
print;
}
Run the hcutf8.txt file through this filter to clean
it up a bit.
Now we can write our “traditional to simplified” dictionary.
class Trad2Simp
{
public:
Trad2Simp();
WCHAR Map(WCHAR chTrad) const { return _rgwch[chTrad]; }
private:
WCHAR _rgwch[65536]; // woohoo!
};
Trad2Simp::Trad2Simp()
{
ZeroMemory(_rgwch, sizeof(_rgwch));
MappedTextFile mtf(TEXT(“hcutf8.txt”));
const CHAR* pchBuf = mtf.Buffer();
const CHAR* pchEnd = pchBuf + mtf.Length();
while (pchBuf < pchEnd) {
const CHAR* pchCR = std::find(pchBuf, pchEnd, ‘\r’);
int cchBuf = (int)(pchCR – pchBuf);
WCHAR szMap[80];
DWORD cch = MultiByteToWideChar(CP_UTF8, 0, pchBuf, cchBuf,
szMap, 80);
if (cch > 1) {
WCHAR chSimp = szMap[0];
for (DWORD i = 1; i < cch; i++) {
if (szMap[i] != chSimp) {
_rgwch[szMap[i]] = chSimp;
}
}
pchBuf = std::find(pchCR, pchEnd, ‘\n’) + 1;
}
}
_rgwch[0x9EBC] = 0x4E48;
}
We read the file one line at a time, convert it from UTF-8, and for each nontrivial mapping, record it in our dictionary. At the end, we do our little 么 special-case patch-up.
Next time, we’ll use this mapping table to generate simplified Chinese characters into our dictionary.
0 comments