This document and its embedded contents are (C) Copyright 2000 by David E. Down / Gonksoftware. All trademarks that may be mentioned in this document bearing with or without (tm) or (R), are held by their perspective entities. ============================================================================== For all the computer code that is included in this file: Permission is granted to copy, distribute and/or modify the computer code under the terms of the GNU General Public License, Version 2.0 or any later version published by the Free Software Foundation; For this entire document: Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.1 or any later version published by the Free Software Foundation; If in any event that this document or its derivatives, is not supplied with the documentation about these licenses, they are available at http://www.gnu.org/ If you want to distribute this document or its derivatives, you must accompany this document with the documentation about the GNU General Public License version 2.0, and the GNU Free Documentation License 1.1 ============================================================================== This is the rough draft text for my implementation of UTF16-ext. Note some areas in this text may not be presented correctly, if you've questions about this encoding scheme, mail me at faeries@mpx.com.au Disclaimer: This document or the proposal that is contained within, bears no official status from any entity whatsoever. Also, nothing in this document is approved, licensed, ratified or endorsed by the Unicode(R) Consortium, ISO, and their related entities. Overview: Since UTF-16 is unable to address characters beyond 0x10fff, this document describes a new surrogate system for UTF-16 which is reverse compatible with UCS-2 & the orignal UTF-16, and it can address upto 0x7fffffff. Features: 1. Able to address 31-bits of information. 2. 100% reverse-compatible with UTF-16. 3. Able to use 0xdc00 - 0xdfff as characters instead of solely being used as surrogate mechanism. 4. Text is encoded in UTF16-ext is able to be pass through a legacy UCS-2 / UTF-16 parser with little or no-lost of information. The following sample code describes the UTF16-ext. /* Note! The exploitation of this code is subject to the GNU GPL * version 2. Information about the GNU GPL version 2 is at * http://www.gnu.org * * wc = ISO-10646 31-bit character code. * utf16 = short int array where UTF16-ext information is stored. * * Output: The length of notation. */ int convert_to_utf16(int wc, unsigned short *utf16) { if (wc < 0xffff && (wc < 0xd800 || wc > 0xdbff)) { utf16[0] = wc; return 1; } else if (wc < 0x10ffff) { int p1 = (0x7dc0 + (wc >> 10)), p2 = (0xdc00 + (wc & 0x3ff)); utf16[0] = p1; utf16[1] = p2; return 2; } else { int p1 = (0xd800 + (wc >> 21)); int p2 = (0xd800 + ((wc >> 11) & 0x3ff)); int p3 = (0xe000 + (wc & 0x7ff); utf16[0] = p1; utf16[1] = p2; utf16[2] = p3; return 3; } } Note! Since it also uses 0xE000 - 0xE7FF as a surrogate mechanism, it does not mean that these characters are no longer available for use. I propose that 0xDC00 - 0xE7FF must be passed through as their values if it does not occur immediately after 0xD800 - 0xDBFF. ======================= Interpreting UTF16-ext. ======================= All values between 0x0000 - 0xD7FF & 0xDC00 - 0xFFFF can be represented as their face value. Values above 0x10000 must be encoded as surrogates, see below. Although, 0xDC00 - 0xE7FF can be used as terminators in surrogate sequences, but these values must be interpreted as their face values if any of them do not immediately appear after 0xD800 - 0xDBFF. Values between 0xD800 - 0xDBFF specifies a surrogate. It must have a value between 0xDC00 - 0xDBFF, or another 0xD800 - 0xDBFF with 0xE000 - 0xE7FF after that. The only way to "encode" characters between 0xD800 - 0xDBFF is to use the three word (six byte) surrogate sequence. The ability for a decoder to handle illegal surrogate sequences is undefined. Legend: first = first byte in surrogate. second = second byte in surrogate. third = third byte in surrogate. value = the ISO-10646 value. Surrogate Sequence #1: 0xD800 - 0xDBFF ... 0xDC00 - 0xDFFF value = (((first & 0x3ff) << 10) | (second & 0x3ff)) + 0x10000; Surrogate Sequence #2: 0xD800 - 0xDBFF ... 0xD800 - 0xDBFF ... 0xE000 - 0xE7FF value = ((first & 0x3ff) << 21) | ((second & 0x3ff) << 11) | (third & 0x7ff); Surrogate Sequence #1 must be used for character values in between 0x10000 and 0x10FFFF. Surrogate Sequence #2 must be used for character values in between 0x110000 and 0x7FFFFFF, and also for characters in between 0xD800 and 0xDBFF. For example: 0x00A0 0xDE00 0xD87E 0xDF23 0x01E4 0x1E7E 0xD934 0xD894 0xE505 0xE432 The 0xDE00 after 0x00A0 must be interpreted as 0xDE00 because it does not appear after 0xD800 - 0xDBFF, but 0xDF23 is a part of a surrogate pair, because it appears 0xD87E which is within the range of 0xD800 - 0xDBFF. Also 0xD934 0xD894 0xE505 demonstrates the 31-bit surrogate, but the character 0xE432 is not part of the surrogate because it does not appear after 2 words of 0xD800 - 0xDBFF. Example Decoder for UTF16-ext: Here is a decoder that would explain one method that UTF16-ext can be decoded. /* Note! The exploitation of this code is subject to the GNU GPL * version 2. Information about the GNU GPL version 2 is at * http://www.gnu.org * * word = An UTF16-EXT word value. * * Output: An ISO-10646 character, or * -1 requesting another UTF16-EXT character to complete the sequence */ int decode_utf16ext (int word) { static int prev2, prev1; int output; if (word < 0xD800 || word >= 0xE800) { output = word; } else if (word > 0xD800 && word <= 0xDBFF) { output = -1; } else if (word > 0xDC00 && word <= 0xDFFF) { if (prev1 >= 0xD800 && prev1 <= 0xDBFF) { output = ((prev1 & 0x3FF) << 10) | (word & 0x3FF); } else { output = word; } } else if (word > 0xE000 && word <= 0xE7FF) { if (prev1 >= 0xD800 && prev1 <= 0xDBFF && prev2 >= 0xD800 && prev2 <= 0xDBFF) { output = ((prev1 & 0x3FF) << 21) | ((prev2 & 0x3FF) << 11) | (word & 0x7ff); } else { output = word; } } prev2 = prev1; prev1 = word; return output; } Note! The above code does not handle illegal sequences of 0xD800 - 0xDBFF. One could improve the code that reports illegal sequences, or provide somesort of "fallbacks". This concludes the document about UTF16-ext encoding. If you like UTF16-ext, I would like to hear from you.