October 5th, 2005

MSI Databases and Code Pages

Heath Stewart
Principal Software Engineer

A Windows Installer database is full of strings. Most times those strings don’t cause a problem when using the standard, printable characters found in all code pages. These are called ASCII characters and are the same for the first 7 bits (0x00 through 0x7F) for all code pages except for a few rare code pages in existence for legacy support. If a Windows Installer database requires extended characters — characters where the 8th bit is set (0x80 through 0xFF) — then a code page is necessary to define how those characters are displayed. For example, decimal character 255 is ÿ in ANSI code page 1252 (ANSI – Latin1) but я in ANSI code page 1251 (ANSI – Cyrillic). The database code page is used to display strings in Windows 9x/Me and used to convert strings to Unicode on Windows NT when calling the W functions.

It is recommended to use only ASCII characters and then you can author a database with a neutral code page (0). Such a database could be used by any language. If you must include extended characters, you should set the code page for the database before importing any strings or risk corrupting extended characters. For localized product installation databases this would be common, since many languages require extended characters. Once you set the code page for a database all imported text files must specify the same code page or the import will fail. A file to be imported — common referred to as an IDT archive file — would look like the following example:

Property	Value
s72	l0
1252	Property	Property
ProductLanguage	1033
ProductName	Microsoft Visual Studio 2005 Team Suite — ENU

The first row contains the column names and the second row contains their respective types. The third row contains the optional code page, followed by the required table name and an optional list of tab-delimited primary key column names. The example above is part of the Property table for Visual Studio 2005. I have inserted 1252 as the code page for this example since the English SKU uses only ASCII characters.

You can easily display or change the code page for the database — along with the supported package languages and the product language for strings not authored into the MSI database (such as Windows Installer error message not in the Error table) — using WiLangId.vbs from the Windows Installer SDK, part of the Platform SDK.

Unofficially, MSI databases do support UTF-7 and UTF-8 by specifying code pages 65000 and 65001, respectively. Encoded strings will store correctly and will be converted correctly when the W functions are called, but they may not display properly because the correct font for wide characters is not chosen.

With this in mind, don’t be surprised if you open a database with a code page different from your current system code page in Orca and find that some characters are not displayed correctly (they will most likely appear as boxes or simply the wrong character). The strings are being displayed or converted to Unicode according to the database code page.

It’s also important to note that the database code page is different from the Summary Information stream code page, which is property ID PID_CODEPAGE (1). This is the code page in which the summary information properties are encoded.

Author

Heath Stewart
Principal Software Engineer

Heath is an application architect and developer, looking to help educate others to learn professional development. Besides designing and developing applications he enjoys writing about intermediate and advanced topics. Heath also consults for deployment packages and scenarios within Microsoft and for external customers.

0 comments

Discussion are closed.

Feedback