The first step in assessing your systems for Unicode-readiness is understanding the terminology for characters and encodings. Find the information you need to get started on this page.
A character set is a list of characters, and an encoding scheme represents them in the system as ones and zeroes (binary data). When storing text as binary data, you must specify the encoding for that text . An encoding scheme is necessary to transfer data between systems.
Unicode is an international encoding standard created in the early 1990’s. Its goal was to include all the characters used in any of the world’s living languages. Since then, it has undergone significant changes.
To process Unicode data, all the system’s data stores need to be configured to store data in Unicode's standard encodings. In B.C. government systems, the standard encoding is UTF-8. Database companies such as Oracle provide utilities for converting non-Unicode databases to Unicode/UTF-8.
There are many things that impact how IM/IT systems process languages. The following is a data architecture entity relationship diagram showing how terms like "byte", "font", "encoding", "grapheme", "glyph" and "character set" relate to one another:
A character set comprises characters, such as ASCII, ISO-8859-1, and Unicode. Their encoding mechanism digitally encodes characters as ones and zeros. UTF-8 and UTF-16, are encoding examples for Unicode.
What we see as a single character could actually be many characters superimposed upon one another creating a grapheme. For example, the character "c cédille" combines the Latin character 'c' with a superimposed cedilla accent.
How a grapheme appears when displayed on the screen or paper is determined by the font used. BC Sans, for example, is a font influencing the visual representation of graphemes.
Many older IM/IT systems allowed users to type using only the characters available on the US ASCII keyboard. Complex systems, in particular, have not undergone modernization. This is due to the risk of service delivery issues like data loss, corruption or security risks. We need careful planning and execution to reduce errors and modernize successfully.
Some of our applications use a z/OS® (mainframe) operating system. The data in these is encoded in the Extended Binary Coded Decimal Interchange (EBCDIC) which came before American Standard Code for Information Interchange (ASCII) became commonly used.
Most of our current systems use ASCII or a limited extended version of ASCII such as ISO-8859 -1 (Latin1) or Windows-1252. These consume one byte of storage for each character when digitally encoded.
Potential issues with existing programs:
EBCDIC and ASCII Limitations:
ASCII and 8-bit extended character sets don’t cover all characters used in Indigenous languages in B.C. Unicode is the only character set that includes characters for Indigenous languages in B.C.
Use the terminology you have learned to complete your Unicode-readiness assessment. The next step in your assessment is to review system components.