UTF-16 - CompWisdom
About us  |  Why use us?  |  Press  |  Contact us

 

Topic: UTF-16



  
 ongoing · Characters vs. Bytes
UTF Along with the characters, Unicode also defines methods for storing them in byte sequences in a computer.
Past the BMP, planes 1 through 16 are sometimes humorously called the “astral planes” and are used for exotic, rare, and historically important characters.
Many people assumed that 16 bits of address space is all you'd ever need, then repeated the error with 32 bits.
http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF   (2663 words)

  
 FAQ - UTF and BOM
A: A Unicode transformation format (UTF) is an algorithmic mapping from every Unicode code point (except surrogate code points) to a unique byte sequence.
This makes it easy to support data input or output in multiple formats, while using a particular UTF for internal storage or processing.
Each UTF is reversible, thus every UTF supports lossless round tripping: mapping from any Unicode coded character sequence S to a sequence of bytes and back will produce S again.
http://www.unicode.org/faq/utf_bom.html#22   (4895 words)

  
 unicode.html
When such ASCII strings are encoded in the UTF -32 and 16 formats, they become interspersed with bytes of the form 00, which represent the NULL control character.
Also there is the general possiblity of UTF-16/32 bytes being interpreted as 7-bit ASCII when this was not the intention, which could cause major problems.
It thus requires 21 binary bits to represent the largest value, and might be called a "21 bit charset." In earlier versions Unicode had a smaller codespace and 16 bits was sufficient.
http://homepage.mac.com/thgewecke/unicode.html   (338 words)

  
 UTF : Java Glossary
You can recognise Unicode files by their starting byte order marks, and by the way Unicode-16 files are half zeroes and Unicode-32 files are three-quarters zeros.
UTF strings are interconverted to ordinary Strings during I/O by readUTF and writeUTF or by using Readers and Writers with an encoding.
The resulting pair of 16 bit characters are in the so-called so-called high-half zone or high surrogate area (0xdc800-0xdbff) and low-half zone or low surrogate area (0xdcff-0xdfff).
http://mindprod.com/jgloss/utf.html   (729 words)

  
 [darcs-users] UTF-16 (was: Default binary masks)
Anywhere you process UTF-16, you are dealing with 16 bit codepoints.
Ruby, C, and Java byte stream accessors all return > single bytes (although Java returns the bytes as ints, which are 16 > bit, the ints only contain 8 significant bits).
If your underlying file access is on an octet basis (as it would be in most of the systems in this discussion), then you read, write and move 2 octets at a time on that level.
http://abridgegame.org/pipermail/darcs-users/2003/000734.html   (936 words)

  
 Unicode Transformation Formats
Besides the incompatibilities there is also the argument that it is wasteful to have one character occupy 16 or 32 bits instead of 8 bits because that would double or quadruple file sizes and memory images.
A fixed length of 16 bits has the problem that only 2^16 == 65'536 characters can be encoded.
Besides that, the UTF representation of Latin1's accented letters contains the original code prefixed by a pound sign (£) which means that it readability is remained in Latin1 applications.
http://czyborra.com/utf   (5676 words)

  
 PEP 100 -- Python Unicode Integration
The internal format for Unicode objects should use a Python specific fixed format implemented as 'unsigned short' (or another unsigned numeric type having 16 bits).
The Unicode API should provide interface routines from to the compiler's wchar_t which can be 16 or 32 bit depending on the compiler/libc/platform being used.
1.6: Changed to since this is the name used in the implementation.
http://www.python.org/peps/pep-0100.html   (4012 words)

  
 UTF-8: What is It and Why is It Important
In other words, UTF-16 or UTF-32 require 16 or 32 bits of storage for most characters instead of a single byte required by the series of ISO-8859 encodings.
When a string of 16 or 32 bit values are processed as a series of byte values, the value
This complicates and confuses existing text processing algorithms, leading to miscalculated string lengths, oddly concatenated strings, and search failures.
http://www.joconner.com/javai18n/articles/UTF8.html   (1579 words)

  
 Unicode: Information From Answers.com
UTF-8, which consists of one-, two-, three-, and four-byte codes, is used extensively in World Wide Web applications; UTF-16, which consists of two- and four-byte codes, is used primarily for data storage and text processing; and UTF-32, which consists of four-byte codes, is used where character handling must be as efficient as possible.
When combined with the byte order of the hardware (BE or LE), they are known officially as "character encoding schemes." They are also known by their UTF acronyms, which stand for "Unicode Transformation Format" or "Universal Character Set Transformation Format." See byte order.
The numbers indicate the number of bits in one unit, for UTF encodings, or bytes, for UCS encodings.
http://www.answers.com/unicode   (3393 words)

  
 UTN #12: UTF-16 for Processing
Another potential problem is that while conversion between UTFs is lossless, conversion between 8/16/32-bit Unicode strings which are not well-formed UTF-8/16/32 strings is not defined.
Conversion among UTFs is fast and reliable, but still takes some time and code.
Conversion also needs to extend beyond the string representation itself to string indexes, offsets and lengths, which can be visible across a protocol (e.g., SQL) or a software boundary (e.g., Java/JNI).
http://www.unicode.org/notes/tn12   (1711 words)

  
 Production First Software Encyclopedia of Typography and Electronic Communication : U
The use of UTF-32 instead of UCS-4 encoding applies three major restrictions:
An encoding transformation form which conforms to Unicode character semantics, extended with surrogate code points, so as to be able to reference the first group of 17 planes (planes 0 through 16) of ISO/IEC/10646.
An encoding transformation form which conforms to Unicode character semantics, able to reference the first group of 17 planes (planes 0 through 16) of ISO/IEC/10646 directly using 32-bit code points instead of surrogate code points.
http://ourworld.compuserve.com/homepages/profirst/u.htm   (2351 words)

  
 UTF-8 and Unicode FAQ
Please do not write UTF-8 in any documentation text in other ways (such as utf8 or UTF_8), unless of course you refer to a variable name and not the encoding itself.
The official name and spelling of this encoding is UTF-8, where UTF stands for UCS Transformation Format.
There is an old UTF locale, but it is incomplete and uses the now obsolete
http://www.cl.cam.ac.uk/~mgk25/unicode.html   (14421 words)

  
 RLG DigiNews Volume , Number 2
the algorithm (or logical description of the process) used to convert 16- and 32-bit code values to a sequence of one or more 8-bit values.
Computer circuitry operates most efficiently when processing bytes that are 8, 16, 32 or 64 bits wide.
A character encoding scheme also controls the order of the 8-bit sequences—important because computers may treat 16-bit numbers as pairs of 8-bit numbers, transmitting data one byte at a time (sometimes the lower half first, sometimes the higher half first).
http://www.rlg.org/en/newsletters/rlgdiginews_extras/v8_n2_glossary.html   (2697 words)

  
 Unicode Data Transfer Formats
A UTF-16 mapping takes valid Unicode code point values and translates them into one or two 16 bit values.
Each 16 bit value is encoded as a pair of octets.
An encoder will simply write the 16 bit values in sequential order, and a decoder will read the 16 bit values one at a time and try to fit them to a reverse mapping.
http://www.azillionmonkeys.com/qed/unicode.html   (1887 words)

  
 jGuru: which utf encoding provide support for multiple language?
Kindly assist me in telling which 'utf encoding' would be better (utf-16 or utf-8 or any other)?
If by 'support' you mean integration with other systems with different character sets, then UTF-16 is the way to go.
These are the following observations I found after struggling on net for utf-16 & utf-8...
http://www.jguru.com/forums/view.jsp?EID=1227908   (580 words)

  
 RFC 2044 (rfc2044)
UTF-8, the object of this memo, has the characteristic of preserving the full US-ASCII range: US-ASCII characters are encoded in one octet having the usual US-ASCII value, and any octet with such a value can only be an US-ASCII character.
Abstract The Unicode Standard, version 1.1, and ISO/IEC 10646-1:1993 jointly define a 16 bit character set which encompasses most of the world's writing systems.
This situation has led to the development of so-called UCS transformation formats (UTF), each with different characteristics.
http://www.cse.ohio-state.edu/cgi-bin/rfc/rfc2044.html   (1426 words)

  
 10065
7) I've converted some large Java text processing apps to C++, and converted the Java 16 bit char's to using UTF-8.
5) 16 bit accesses on Intel CPUs can be pretty slow compared to byte or dword accesses (varies by CPU type).
http://www.digitalmars.com/drn-bin/wwwnews?D/10065   (1083 words)

  
 RFC 2781 - UTF-16, an encoding of ISO 10646. P. Hoffman, F. Yergeau.
RFC 2781 UTF-16, an encoding of ISO 10646 February 2000 The term "network byte order" has been used in many RFCs to indicate big-endian serialization, although that term has yet to be formally defined in a standards-track document.
The following C code fragment demonstrates a way to write 16- bit quantities to a file in big-endian order, irrespective of the hardware's native byte order.
void write_be(unsigned short u, FILE f) /* assume short is 16 bits */ { putc(u >> 8, f); /* output high-order byte */ putc(u and 0xFF, f); /* then low-order */ } Hoffman and Yergeau Informational [Page 4]
http://rfc.sunsite.dk/rfc/rfc2781.html   (3669 words)

  
 ApacheCon 2002, Las Vegas, NV: XML and I18N by Sander van Zoest
These three forms, formally known as UTF-8, UTF-16 and UTF-32, provide developers with three ways to use Unicode.
The Consortium has defined three encoding forms (mappings from a character set definition to the actual code units used to represent the data) that allow the data to be transmitted in 8, 16 and 32-bits.
Unicode Frequently Asked Questions: UTF and BOM .
http://sander.vanzoest.com/talks/2002/xml_and_i18n   (2204 words)

  
 Extended UCS-2 Encoding Form (UTF-16)
As is clear by the above example, UTF-16 is essentially a variable length encoding technique that supports the characters in the BMP and the 16 planes immediately following the BMP.
These codes are then mapped from/onto 16 planes (1-10) of group 0.
The introduction of a variable length encoding brings up a number of important issues for implementations which are similar to the issues encountered in common double byte character systems, in which single byte and double byte characters are mixed together.
http://www.terena.nl/library/multiling/unicode/utf16.html   (905 words)

  
 Unicode: Information From Answers.com
UTF-8, which consists of one-, two-, three-, and four-byte codes, is used extensively in World Wide Web applications; UTF-16, which consists of two- and four-byte codes, is used primarily for data storage and text processing; and UTF-32, which consists of four-byte codes, is used where character handling must be as efficient as possible.
The numbers in the names of the encodings indicate the number of bits in one code value (for UTF encodings) or the number of bytes per code value (for UCS) encodings.
When combined with the byte order of the hardware (BE or LE), they are known officially as "character encoding schemes." They are also known by their UTF acronyms, which stand for "Unicode Transformation Format" or "Universal Character Set Transformation Format." See byte order.
http://www.answers.com/unicode   (4441 words)

  
 Unicode - definition of Unicode in Encyclopedia
The numbers indicate the number of bits in one unit, for UTF encodings, or bytes, for UCS encodings.
Several mechanisms have therefore been suggested to implement Unicode; which one is chosen depends on available storage space, source code compatibility, and interoperability with other systems.
The strongest denunciation of Unicode is at [1] (http://www.hastingsresearch.com/net/04-unicode-limitations.shtml) (also see a response, [2] (http://slashdot.org/features/01/06/06/0132203.shtml)) For example, opponents of Unicode sometimes claim even now that it cannot handle more than 65,535 characters, a limitation that was removed in Unicode 2.0.
http://encyclopedia.laborlawtalk.com/Unicode   (2611 words)

  
 Glossary
The Unicode encoding form which assigns each Unicode scalar value in the ranges U+0000..U+D7FF and U+E000..U+FFFF to a single unsigned 16-bit code unit with the same numeric value as the Unicode scalar value, and which assigns each Unicode scalar value in the range U+10000..U+10FFFF to a surrogate pair, according to Table 3-4, UTF-16 Bit Distribution.
Planes are numbered from 0 to 16, with the number being the first code point of the plane divided by 65,536.
(3) “Transformation format for 16 planes of Group 00,” defined in Annex C of ISO/IEC 10646:2003, technically equivalent to the definitions in the Unicode Standard.
http://www.unicode.org/glossary   (7489 words)

  
 UCS Transformation Format 16 (UTF-16)
When the escape sequences from ISO 2022 are used, the identification of the return from UTF-16 to the coding system of ISO 2022 shall be by the escape sequence ESC 02/05 04/00.
A UCS Transformation Format (UTF-16) is specified in Annex O which can be used to represent characters from 16 planes, additional to the BMP, in a form that is compatible with the two-octet BMP form.
In addition, the coded representation of any character from a single contiguous block of 16 Planes in Group 00 (1,048,576 code positions) is transformed to pairs of two-octet sequences, where each sequence corresponds to a cell in a single contiguous block of 8 Rows in the BMP (2,048 code positions).
http://www.uazone.org/multiling/unicode/wg2n1035.html   (1901 words)

  
 Re: UTF-16 inside UTF-8
> > > > 'cmap' http://www.microsoft.com/typography/otspec/cmap.htm > > Yes, as stated before, if your previous idea of a UniChar was 16 bits, > you have some work to do.
That is just specification change not include code > > changes or API changes.
I thought we were talking about apps like > MySQL to which Unicode support was being added for the first time.
http://www.mail-archive.com/unicode@unicode.org/msg19817.html   (1096 words)

  
 Tucu's Weblog
USC-4 or EBCDIC encoding families) the algorithm should be extended here.
#1.3, #1.4, #1.5, #1.6 There is an explicit encoding mismatch.
Given the currently assumed BOMEnc values this case cannot happen.
http://blogs.sun.com/roller/page/tucu/20040917   (1409 words)

  
 The skew.org XML Tutorial
UTF-16 may be more straightforward to implement, but it is difficult to compose UTF-16 encoded documents with most text editing software, and it is wasteful to use 2 bytes per character when most characters fall in a very small range.
Unicode values are the code value sequences produced by the UTF-16 encoding form.
ISO 639 has been updated a number of times since 1988 and is now in 2 parts, ISO 639-1 for the 2-letter codes and ISO 639-2 for 3-letter codes.
http://skew.org/xml/tutorial   (8463 words)

  
 netandmore.de - das internetforum - Internet Glossar U
Im Unterschied zu ANSI und ASCII verwendet Unicode nicht nur 8 Bit, sondern 16 Bit (= 2 Byte) pro Zeichen, weshalb nicht nur 2hoch8 (= 256), sondern 2hoch16 (= 65.236) verschiedene Zeichen dargestellt werden können.
Dies erlaubt es, fast alle wichtigen Zeichen aus fast allen wichtigen Sprachen zu berücksichtigen.
http://netandmore.de/glossar/glo_u.htm   (1123 words)

  
 Supplementary Characters in the Java Platform
Where in the past we could simply talk about "characters" and, in a Unicode based environment such as the Java platform, assume that a character has 16 bits, we now need more terminology.
We'll try to keep it relatively simple -- for a full-blown discussion with all details you can read Chapter 2 of The Unicode Standard or Unicode Technical Report 17 "Character Encoding Model." Unicode experts may skip all but the last definition in this section.
The introduction of supplementary characters unfortunately makes the character model quite a bit more complicated.
http://java.sun.com/developer/technicalArticles/Intl/Supplementary   (4411 words)

  
 Unicode Supplementary Characters (Surrogate code points) support in Microsoft Windows NT, 2000, XP
Convertors between scalar values and surrogate code points, or between UTFs:
The Unicode Consortium also has a Frequently Asked Questions (FAQ) page on UTF and BOM which discusses surrogates.
For definitions of Unicode-related terms, refer to the Unicode Glossary.
http://www.i18nguy.com/surrogates.html   (979 words)

Compwisdom
 About us   |  Why use us?   |  Press   |  Contact us

 Copyright © 2006 CompWisdom.com Usage implies agreement with terms.