Unicode Encoding

Introduction to Unicode Encoding

UNICODE is an acronym for "Universal Character Encoded System". It is a unique character encoding scheme allowing characters from European, Greek, Arabic, Hebrew, Chinese, Japanese, Korean, Thai, Urdu, Hindi, and other world languages to be encoded in a single character set. This enables applications to simultaneously support text in multiple languages in their data files. Unicode encoding covers most of the letters, punctuation marks, and technical symbols commonly used in the English language that are not covered by the legacy encoding.

Unicode defines two mapping methods:

•

UCS (Universal Character Set) encoding

•

UTF (Unicode Transformation Format) encoding

For more information on Unicode Encoding, visit http://unicode.org.

Pro/ENGINEER Wildfire 4.0 onward, all string data in Pro/ENGINEER (previously stored in the legacy encoding format) is now stored in the Unicode encoding. Pro/ENGINEER Wildfire 4.0 uses the UCS-2 encoding on Windows platforms and UCS-4 encoding in UNIX environments for widestring data. It reads and writes character data using the mulitbyte UTF-8 encoding on all platforms. UTF-8 is an 8-bit, variable-length character encoding format that uses one to four bytes per character.

Some important terminology about string encoding related to Creo TOOLKIT that is used throughout this section is described as follows:

•

“Unicode encoding” refers to the string and widestring encodings used by Pro/ENGINEER Wildfire 4.0 and later.

•

“Legacy encoding” refers to the encoding used by Pro/ENGINEER Wildfire 3.0 and earlier. Depending on the language, this encoding is typically some version of an EUC encoding.

•

“Native encoding” refers to the encoding used by the operating system in the language in which the system is running. This encoding is the same as legacy encoding in most cases.

•

“Multibyte string” refers to a character array representing a string in the C language. Because of the limited size of the character (a single byte), combinations of multiple bytes are used to represent characters outside the ASCII range.

•

“7-bit ASCII” refers to the character range 0x0 through 0x127. This range is shared between Unicode and non-Unicode encodings used by Creo Parametric. Thus, any data of this type is unchanged after transcoding.

•

“8-bit ASCII” refers to the character range 0x128 through 0x255. In many European native encodings, this range is used to represent European accented vowels and other letters. In Unicode, this range is not directly used. Therefore, 8-bit ASCII native strings are not equivalent in Unicode.

•

“Byte Order Mark” (BOM) refers to a string of three bytes U+FEFF (represented in C language strings by “\357\273\277”), and is placed on the top of a text file to indicate that the text is Unicode encoded. Unicode has designated the character U+FEFF as the BOM and reserved U+FFFE as an illegal character for UTF-8 encoding. Most of the text files generated by Creo Parametric are written with the BOM and Unicode encoding. Creo Parametric can accept a Unicode encoded text file with a BOM, or a legacy encoded text file without a BOM as the input.

•

“Transcoding” refers to the act of changing a string or widestring encoding from one encoding to another, for example, from platform native to Unicode or vice-versa. For some transcoding operations, there is a possibility of data loss, since characters from one encoding may not be supported in the target encoding.

Unicode Encoding and Creo Toolkit

Pro/TOOKIT applications running with Pro/ENGINEER Wildfire 4.0 and later must, by default, receive and send strings and widestrings to Pro/ENGINEER in Unicode encoding. This is a change to the encoding previously received by applications in Wildfire 3.0 and earlier. Because the workstation operating system will not be running in Unicode and other languages, functions and libraries accessed by the Creo TOOLKIT application may not be Unicode aware, the Creo TOOLKIT application must deal with the change of encoding.

Make changes to the application to expect and accept Unicode strings and widestrings when dealing with Creo Parametric data. At the external interfaces from the application to the operating system or third-party APIs, perform necessary transcoding operations to ensure that those other systems receive an expected encoding.

PTC recommends that all applications be evaluated for Unicode compliance regardless of their purpose or intended data. However, applications that would particularly be affected by Unicode encoding are as follows:

•

Any Creo TOOLKIT application expected to work with Creo Parametric in any language other than English.

•

Any Creo TOOLKIT application expecting Creo Parametric data in any language other than English (where strings from that data are transferred to and from Creo Parametric or any other source).

Necessity of Unicode Compliance

It is strongly recommended that you make your existing Creo TOOLKIT applications Unicode-compliant for the following reasons:

•

Applications that are not Unicode-compliant will be unable to reliably handle Creo Parametric data saved in the Unicode format with strings (notes, annotations, table, and so on) in multiple languages other than English. For example, a Creo Parametric drawing can now contain both German and Japanese notes. The Creo TOOLKIT application will not be able to read or modify those notes correctly without being Unicode compliant. This could result in data loss or corruption.

•

Applications that do not consider the Unicode nature of Creo Parametric data may try to pass that data directly to the system or third-party APIs that do not recognize it correctly. This could cause data corruption or crashes.

•

Applications that do not transcode non-Unicode data into Unicode before using the data as strings inside Creo Parametric models will generate corrupt and incorrect models.

External Interface Handling

Creo TOOLKIT applications running in Unicode will need to create utilities around the interfaces between non-Unicode aware third-party APIs and interfaces. While PTC cannot directly provide such interfaces, this section discusses the considerations for creating such utilities by showing how one external API such as the C runtime library can be used from a Unicode environment.

Any C runtime library accepting char* or wchar_t* as input may be adversely affected by receiving Unicode data. Typically, it should be possible to create a simple wrapper for each C runtime interface used in the application, where the input string to the interface is expected to be in Unicode. The string should be transcoded before calling the system API. Examples of such C runtime functions are listed below (this is not an exhaustive list):

•

fopen()

•

access(), _access()

•

chdir(), readdir(), opendir()

•

chmod(), _chmod()

•

findfirst(), _findnext()

•

getcwd()

•

getenv()

•

open(), opendir()

○

fgetc()

○

fgets()

○

fputc()

○

fputs()

○

fread(), fwrite()

○

puts()

•

remove(), stat(), system(), tmpfile(), unlink()

Special External Interface: printf() and scanf() Functions

The printf() and scanf() family of C runtime functions are a special case of an external interface. The format string must be transcoded when these interfaces are called. The list of variable arguments passed to the functions may also contain string and widestring data that needs to be transcoded and modified in format. Because of the complexity of wrapping these C runtime functions, PTC has provided a standard Creo TOOLKIT function equivalent for each. These functions support all the format specifiers and modifiers supported by the C language specification.

Functions Introduced:

ProTKSnscanf()

The function ProTKPrintf() provides the Unicode equivalent to the C runtime function printf(). The number of characters returned by this function is sent to stdout. The output data is transcoded to the native encoding format, which may result in out-of-locale characters in the results.

The function ProTKFprintf() provides the Unicode equivalent to the C runtime function fprintf(). The number of characters returned by this function are copied into the file. This file will receive the data in the Unicode-encoded format.

The functions ProTKSprintf() and ProTKSnprintf() provide the Unicode equivalent to the C runtime functions sprintf() and snprintf() respectively. The number of characters returned by these functions are copied into the output buffer.

The function ProTKScanf() provides the Unicode equivalent to the C runtime function scanf(). This function parses the contents of the input from stdin. The output data in the string or character format is in Unicode encoding.

The function ProTKFscanf() provides the Unicode equivalent to the C runtime function fscanf(). This function parses the contents of the input from a file.

The functions ProTKSscanf() and ProTKSnscanf() provide the Unicode equivalent to the C runtime functions sscanf() and snscanf() respectively.

The Unicode equivalent of the C runtime functions v*printf() and v*scanf(), which take a variable arguments list instead of variable number of arguments, have also been provided in the form of ProTKV*printf() and ProTKV*scanf() functions.

Special External Interface: Windows-runtime Functions

Win32 functions that take char* inputs are not Unicode compliant, and thus cannot be used with data directly obtained from Pro/ENGINEER Wildfire 4.0 and later. The simplest approach to using Windows runtime functions is to use the functions accepting wchar_t* inputs since these functions are Unicode compliant (Windows native encoding for whar_t* is Unicode). For example, use the function GetMessageW() instead of GetMessage() or GetMessageA().

Special External Interface: Hardcoded Strings

Another example of an external interface is a hardcoded string. You should review all uses of hardcoded strings in your application and ensure that they fit the following categories:

•

They use only 7-bit ASCII characters or wide characters.

•

They use Unicode escape sequences.

8-bit ASCII or non-Unicode escape sequences in hardcoded strings do not work correctly unless you transcode the string into Unicode before sending it to Creo Parametric.