Thayumanasamy Somasundaram's Tamil Pages

Unicode and its importance to Tamil usage in computers

An article by Thayumanasamy Somasundaram

[http://tamil.somasundaram.us/ta-unicode.html]

Original: January 14, 2009 | Updated: Feb 16, 2010

What is Unicode?

I will use the definition given by the Unicode Consortium itself to answer the question:

"Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language." [From Unicode Consortium’s webpage]

யூனிக்கோடு என்றால் என்ன?

"யூனிக்கோடு எந்த இயங்குதளம் ஆயினும், எந்த நிரல் ஆயினும், எந்த மொழி ஆயினும் ஒவ்வொரு எழுத்துக்கும் தனித்துவமான எண் ஒன்றை வழங்குகிறது". [யூனிக்கோடு ஒன்றியம் இணையத்திலிருந்து]

Encoding Standards (ASCII)

As we all know computers store, handle, and transmit information using binary digits (bits). This idea is famously expressed by the saying that computers store information in "0s" and "1s" or "On" and "Off" or "Up" and "Down" method. So, for example, if you want to type, store, and transmit the English letter "Capital A", a computer needs to do this in a particular way so that when it is read back, sent off, printed out, by another computer running in a different platform (think, PC, Mac, iPhone, web server, color printer, etc), running a different operating system (think, Windows, Macintosh, Linux/UNIX, Palm O/S, PostScript, etc), and using a different application (think, web browser, word processor, e-mail, pdf-viewer, etc) that system can reproduce exactly the same letter. So computer experts devised a standard that will fulfill the above requirements.

In early 1960s when English language dominated the computing world it was decided that ASCII (American Standard Code for Information Interchange) a 7-bit character encoding system will be used as a standard encoding for storing and exchanging information in and between computers systems (EBCDIC an 8-bit character encoding and several other standards existed before and during that time and they are beyond the scope of this article). With 128 possible values (2⁷=128; the number 2 comes from the binary and the power 7 comes since there 7 positions) 31 control (non-printable) characters, A-Z, a-z, 0-9, and several punctuation marks were assigned from 0 to 127. So the "Capital letter A" was assigned a decimal value of 65 or binary value of 100 0001 or a hexadecimal value of 41 (hexadecimal has values from 0 to 9 and A through F; a total of 16). All computers then were able to reinterpret this value back to "Capital letter A". Soon it became necessary to add European languages like German, French, and Spanish with their accent marks (Á), umlauts (Ü), graves (Ò), tildes (Ñ), British Pound, Japanese Yen, and Copyright sign. So the computer experts designed an Extended ASCII (also known as ISO-8859-1; Latin-1) system with its full 8-bit character encoding of a byte (2⁸=256) thus extending the value to 256 positions (0 to 255; retaining the original 0-127 values). After all, most of the computers at that time read data in bytes which contained 8 bits, why not use the last bit as well!

However, when computers became more common place in 1980s, non-US and non-European countries decided to use the extended set to include the characters of their own languages. This meant that characters assigned for positions 128 to 255 (past the first 0 to 127 values) could indicate or encode different characters depending upon what kind of encoding (for example, Greek: ISO-8859-7 and Hebrew: ISO-8859-8) the input-user used and may not be reinterpreted correctly by the output-user, if they didn’t know the original encoding scheme. To complicate the matter, if the information being exchanged had more than two (non-Latin) languages it became impossible to input both those languages at the same time (in addition to Latin-1, English letters).

Tamil Encoding Standards (TSCII)

As the Internet started to become more popular in 1990s several non-Latin based languages wanted to post, spread, and share their information/contents online. But they found it difficult to exchange and display their languages on the web browsers. Soon these pioneering people started to devise their own encoding system assigning certain values to certain letters (glyphs) and including the encoding information in special files (and fonts) with documents or a separate downloads. The early developers/web pioneers for Tamil language included Dr. K. Kalyanasundaram (Kalyan) of Switzerland (www.tamilelibrary.org) and Muthu Nedumaran (Muthu) of Malaysia (www.murasu.com) and their colleagues at the Standard for Tamil Computing (STC) and Tamil Net.

Dr Kalyan started to develop an 8-bit bilingual character encoding scheme for Tamil in late 1990s and named it Tamil Standard Code for Information Interchange (TSCII). As we discussed before, like other non-Latin language pioneers Kalyan used the positions 128 through 255 to encode or map various Tamil characters like vowels (உயிர்; அ), and consonants (மெய்; ப்), combinations (உயிர்மெய்; பூ), and add-ons like (ி,) (ை), and (ா). The TSCII encoding scheme went through several modifications to accommodate or streamline some features and finally settling with version 1.7 (current) and is shown in Fig 1. In 2007 TSCII encoding was registered with Internet Assigned Names Authority. Following Tamil Net 1999 Conference, Tamil Nadu Government came up with two more encoding schemes called TAM (Tamil monolingual; Tamil only glyphs for positions 0-255) and TAB (Tamil bilingual; English for 0-127 and Tamil for 128-255 ) that were different from TSCII. In addition they gave suggestions for Tamil Type Writer key positions. The TAB encoding is shown in Fig 2 and the Extended ASCII table is shown in Fig 3 to compare and contrast with the two encoding schemes discussed above.

Fig 1. TSCII 1.7 from Kalyan and others www.tamil.net/tscii/tscii.html

Fig 2. TAB encoding from TN Gov www.tamilvu.org/Tamilnet99/annex4.htm

Fig 3. Extended ASCII encoding www.lookuptables.com

Propriety Encoding/Fonts

The situation became more complicated when leading Tamil magazines and news papers like Ananda Vikatan, Kalki, Kumdum, Dina Mani, Daily Thani, and Dina Malar decided to encode their own proprietary schemes and fonts. So by early 2000s it became difficult to browse, type, exchange e-mails unless everyone had the same encoding/font scheme. Many readers will recall that without the proper fonts installed in their computers reading Tamil magazines and newspaper articles was difficult and at times the computer would display funny characters instead of Tamil alphabets. The user will then be prompted to download the proprietary font and restart the browser before continuing.

As the number of publications, applications like web browsers, e-mail clients, mobile phones increased the number of fonts and schemes mushroomed. The Tamil Nadu Government and the Virtual Tamil University devised their own font and scheme. Indian Script Code for Information Interchange (ISCII) is another encoding supported by Government of India. Where as, TSCII encoded the characters in written order, ISCII encoded it in logical order (similar to Unicode). No one wanted to adopt other’s scheme, the web pioneers wanted to unite the Tamils and come up with uniform standard. One such solution is Tamil Unicode. Please note that there are still lots of active discussions about the advantages and disadvantages of the current Tamil Unicode Standard 5.1.0 and no by no means the situation is finalized. But that is beyond the scope of this article.

Tamil Unicode

The Unicode Consortium starting with Unicode version 1.0.0 (Oct 1991; 24 scripts) and to the current version 5.1.0 (Apr 2008; 75 scripts) has included Tamil as one of the scripts it supports (Note that Unicode refers to script rather than language since scripts like Latin is used by several languages like, English, Italian, etc). The consortium assigned 128 positions for Tamil (U-0B80 to U-0BFF; in hexadecimal) with positions for vowels (ஈ; U- 0B88), consonants with inherent vowels (த; U-0BA4), Tamil numerals (௪; U-0BEA), and special Tamil characters (௹; U-0BFA), and ligatures (ி; U-0BFF) and several positions are left open for future inclusions (actually only 72 positions have been used out of the 128). Tamil Code Points in Unicode are expressed as follows: U-000B80 to U-000BFF (where 00 is Unicode Plane, 0B is Unicode Block, and 80 to FF are the Unicode Points)

The The Unicode has itself more than one flavor but in this article we will deal with the most prevalent Unicode-8 (utf-8) flavor only. UTF-8 is a variable length character encoding scheme for Unicode. UTF-8 maps each character or code point into one to four 8-bit bytes. To accommodate this Unicode has assigned some blocks of code points for one 8-bite byte (00-7f) and others to two 8-bit byte (start with C2-df), three 8-bit byte (start with e0-ef), and four 8-bit bytes (start with f0-f4).

Tamil Unicode Writers and Editors

I will give some Tamil Unicode Writers and Editors here. There may be more, but I found the following list to be very useful. In some writers you will be asked to type in literal Tamil and press a button to get the words in Tamil language (For example, type ammaa or ammA to get அம்மா or en peyar sOmasuwtharam to get என் பெயர் சோமசுந்தரம் (note some quirks like w for ந், sO for சோ and so for சொ)

http://ezilnila.com/tane/unicode_Writer.htm (Online)
http://www.suratha.com/reader.htm (Online)
http://yesudas.rs.googlepages.com/tamilunicodewriter (Online)
http://yesudas.rs.googlepages.com/WOG_UniPad.zip (Download and use it without an Internet connection | Stand-alone product)
http://www.higopi.com/ucedit/Tamil.html (Online)
http://software.nhm.in (Download software and use it offline | Windows XP/Vista)

Unicode Resources

I will list some Unicode general and Tamil specific resources below. By no means is the list exhaustive, but I found the following to be very helpful.

Unicode Consortium (www.unicode.org)
Wikipedia Unicode pages (en.wikipedia.org/wiki/Unicode)
Tamil Unicode Character Charts (http://unicode.org/charts/PDF/U0B80.pdf)
Alan Wood’s Unicode Resources (http://www.alanwood.net/unicode/tamil.html)
Acharya- IIT, Madras site (http://acharya.iitm.ac.in/tamil/tamil_unicode.php)
Tamil Pad Unicode Editor (http://www.tamilpad.com/)
Tamil Scirpt (http://en.wikipedia.org/wiki/Tamil_script)
Fileformat Info (http://www.fileformat.info/info/unicode/char/0b85/index.htm)
Indic Script at Unicode (http://unicode.org/faq/indic.html)
Unicode Standard for South Asian Scripts (Ch 9)
Project Madurai (Tamil Literature Repository)
Kalyan’s Tamil Electronic Library (http://www.tamilelibrary.org/)
Unicode Tamil Font Gallery (http://www.wazu.jp/gallery/Fonts_Tamil.html)
SALRC at UChicago
Suratha’s Unicode Web Presentation (http://www.suratha.com/tamilunicode.html)
எழில் நிலா (How TO Unicode)
Gopi’s Tools (http://www.higopi.com/tools/)
ISO Language code for Tamil (ISO 39-1: ta):
Penn State’s TLT

ஒருங்குறி (http://ta.wikipedia.org/wiki/%E0%AE%92%E0%AE%B0%E0%AF%81%E0%AE%99%E0%AF%8D%E0%AE%95%E0%AF%81%E0%AE%B1%E0%AE%BF)

Advantages of Tamil Unicode

There are several advantages in using the Tamil Unicode encoding scheme than any other. First it is non-proprietary and is becoming the worldwide standard. Next, major software companies like Microsoft, Apple, Sun and others are supporting Unicode meaning the standard is here to stay. Then if you use Unicode you can start exchanging e-mails, participate in web-forums, and even search the web in Tamil itself. For example, let us say you want to search the word “Tamil” using Google. If you used the search term “Tamil” you are not likely to find links that use the alternate spelling “Tamizh”, “Thamiz”, and “Thamil” (granted most people will use Tamil). With Unicode you can search “தமிழ்” and not worry about alternate spellings. The situation becomes more interesting if you search for names like “Sambandhar” with alternate spellings in “Champandhar”, “Sampanthar”, etc, but with Unicode you simply search for “சம்பந்தர்”.