Skip to the content of the web site.

Lesson 160: Unicode and wide strings

Previous lesson Next lesson


Up to this point, our strings have been limited to 8-bit ASCII characters; however, many applications today that require stings will be in multi-lingual settings where it will be necessary to display in many different languages. By restricting your application to English (or European languages, in general), you are excluding the majority of humans. Based on a calculation at this website:

Script or
alphabet
Absolute numbersPercentage
Latin2.6 billion36
Chinese1.3 billion18
Dewanagari1.0 billion14
Arabic1.0 billion14
Cyrillic0.3 billion4
Dravidian0.3 billion4

In addition, Korean, Japanese, Greek, Hebrew, Amharic as well as a few others alphabets and scripts are dominant in specific countries.

The std::string class gives you access to only one third of your possible audience, and for no other reason than the choice of script or alphabet.

Fortunately, the wide string type std::wstring allows you to store Unicode or 16-bit letters. Together with the wide string class, you must also use the wide cout, or wcout.

First, your string must be prefixed by capital L:

#include <iostream>
#include <string>

int main() {
	wstring msg = L"Hello world!";
	wcout << msg << std::endl;

	return 0;
}

If you now want to include a Unicode character, you must escape the 4-hexadecimal-character representation with an \u. You can find these on, for example, Wikipedia.

#include <iostream>
#include <string>

int main();

int main() {
	size_t const N{4};
	std::wstring wstr[N];

	wstr[0] = L"Latin alphabet (\""
		  L"\\u0041\\u0042...\\u0059\\u005a"
		  L"\"): "
		  L" \u0041\u0042...\u0059\u005a";

	wstr[1] = L"Croatian letters (\""
		  L"\\u01c4\\u01c5\\u01c6\\u01c7\\u01c8\\u01c9\\u01ca\\u01cb\\u01cc"
		  L"\"): "
		  L" \u01c4\u01c5\u01c6\u01c7\u01c8\u01c9\u01ca\u01cb\u01cc";

	wstr[2] = L"Romanian letters (\""
		  L"\\u0218\\u0219\\u021a\\u021b"
		  L"\"): "
		  L" \u0218\u0219\u021a\u021b";

	for ( size_t i = 0; i < N; ++i ) {
		std::wcout << wstr[i] << std::endl;
	}

	return 0;
}

The problem with this is, however, that not all fonts are able to print all of these characters. Consequently, you should pick a font appropriate for the letters you wish to display. Not all fonts will be able to display all Unicode characters, and Open Type is restricted to having only 65536 while Unicode allows for 17 planes of 65536 characters, so

For example, this file unicode.cpp attempts to print out all of the valid Unicode characters from \u0001 to \uffff. The console I am using is not very useful—it can only print out a very small number of the characters: output.txt.

Questions and practice:

1. You will notice that the default date is midnight on January 1, 1970. This is time 0 for Unix, which uses a signed 32-bit integer counting seconds from that time. Approximately what day is the time 0x3fffffff, and approximately what day is 0xffffffff?

You may wish to read up on the Year 2038 problem. The type date_t is always guaranteed to store as many bits as there are on the current computer. How many bytes does date_t occupy on your computer?

2. If your computer uses a signed 64-bit integer to store a date, will it be able to store the approximate date on which the planet Earth will be vaporized, projected to be approximately five billion years from the present?


Previous lesson Next lesson