|
| |
| | Endianness - Wikipedia, the free encyclopedia |
 | | This byte order is used for all numeric values in the packet headers and by many higher level protocols and file formats that are designed for use over IP. |  | | Generally the byte (octet) is considered an atomic unit from the point of view of storage at all but the lowest levels of network protocols and storage formats. |  | | While variable-width text encodings using the byte as their base unit could be considered to have an inbuilt endianness this is (at least in all commonly used ones) fixed by the encoding's design. |
|
http://en.wikipedia.org/wiki/Endian
(2097 words)
|
|
| |
| | FAQ - UTF-8, UTF-16, UTF-32 & BOM |
 | | When data are exchange in the same byte order as they were in the memory of the originating system, they may appear to be in the wrong byte order on the receiving system. |  | | The BE form uses big-endian byte serialization (most significant byte first), the LE form uses little-endian byte serialization (least significant byte first) and the unmarked form uses big-endian byte serialization by default, but may include a byte order mark at the beginning to indicate the actual byte serialization used. |  | | In that form, the BOM serves to indicate both that it is a Unicode file, and which of the formats it is in. |
|
http://www.unicode.org/unicode/faq/utf_bom.html
(4895 words)
|
|
| |
| | Unicode - encyclopedia article about Unicode. |
 | | Combining marks, like the complex script-shaping required to properly render Arabic text and many other scripts, usually depend on complex font technologies, like OpenType (by Adobe and Microsoft), Graphite (by SIL International), and AAT (by Apple), by which a font designer includes instructions in a font, telling software how to properly output different character sequences. |  | | The UCS-2 and UTF-16 encodings specify the Unicode byte order mark (BOM) for use at the beginnings of text files. |  | | The numbers in the names of the encodings indicate the number of bits in one code value (for UTF encodings) or the number of bytes per code value (for UCS) encodings. |
|
http://encyclopedia.thefreedictionary.com/Unicode
(5193 words)
|
|
| |
| | W3C I18N FAQ: Unexpected characters or blank lines |
 | | Each character in the file is represented by 2 or 4 bytes of data and the order in which these bytes are stored in the file is significant; the BOM indicates this order. |  | | In the UTF-8 encoding, the presence of the BOM is not essential because, unlike the UTF-16 or UTF-32 encodings, there is no alternative sequence of bytes in a character. |  | | Some applications insert a particular combination of bytes at the beginning of a file to indicate that the text contained in the file is Unicode. |
|
http://www.w3.org/International/questions/qa-utf8-bom.html
(767 words)
|
|
| |
| | System.Text.UnicodeEncoding Class |
 | | Returns the bytes used at the beginning of a Stream instance to determine which Encoding implementation the stream was created with. |  | | This Encoding implementation can detect a byte order mark automatically and switch byte orders, based on a parameter specified in the constructor. |  | | A Byte array that identifies the Encoding implementation used to create a Stream. |
|
http://www.gnu.org/software/dotgnu/pnetlib-doc/System/Text/UnicodeEncoding.html
(1389 words)
|
|
| |
| | opentag.com - XML FAQ: Encoding |
 | | Little-endian byte order (most significant byte is stored last) is used by processors such as Intel or Vax. |  | | Byte order is important only for encodings using units greater than 8-bits (i.e. |  | | network byte order) is used by processors such as Motorola or RISC (most significant byte is stored first). |
|
http://www.opentag.com/xfaq_enc.htm
(998 words)
|
|
| |
| | Byte-order Mark |
 | | A byte-order mark is not a control character that selects the byte order of the text; it simply informs an application receiving the file that the file is byte ordered. |  | | The preferred place to specify byte order is in a file header, but text files do not have headers. |  | | With only a single set of byte-ordering rules, users of one type of microprocessor would be forced to swap the byte order every time a plain text file is read from or written to, even if the file is never transferred to another system based on a different microprocessor. |
|
http://msdn.microsoft.com/library/en-us/intl/unicode_42jv.asp?frame=true
(550 words)
|
|
| |
| | [No title] |
 | | The use of a mark at the beginning of a file, which contains plain text, to identify the coding format of the characters, is commonly refere to as the Byte Order Mark or BOM for short. |  | | The extension can be by using the column 0 and 1, which redefines the Intermediate Bytes: * I: Any bytes between the ESCAPE and the Final Byte are known as Intermediate Bytes, and are from column 00 to 02 of the code table. |  | | This apart from ISO/IEC 2022 rules, but without conflict, the I bytes haning to be from from column 02 of the code table, and Esc is from column 1. |
|
http://ietfreport.isoc.org/old-ids/draft-tremblay-bom-00.txt
(704 words)
|
|
| |
| | SP - Character sets |
 | | The bytes representing each character are in the system byte order, unless the byte order mark character is present, in which case the order of its bytes determines the byte order. |  | | The bytes representing the entire storage object may be preceded by a pair of bytes representing the byte order mark character (0xFEFF). |  | | A bit combination with the 0x8000 and 0x80 bits set is encoded by the sequence of bytes with which the SJIS encoding encodes the character whose number in JIS X 0208 added to 0x8080 is equal to the bit combination. |
|
http://www.cs.indiana.edu/l/www/hyplan/asengupt/sgml/jade-1.2.1/doc/charset.htm
(1176 words)
|
|
| |
| | Unicode in XML and other Markup Languages |
 | | When used as a byte order mark the character is placed at the beginning of a file. |  | | Problems with other uses: The use of byte order mark as ZWNBSP is also problematic when used in plain text, and has been deprecated for that purpose in favor of U+2060 word joiner. |  | | Except for Line and Paragraph Separator, or the Byte Order Mark, it is acceptable for browsers and similar user agents to ignore the presence of discouraged characters in HTML or XML. |
|
http://www.w3.org/TR/unicode-xml
(6853 words)
|
|
| |
| | Tucu's Weblog |
 | | But if XMLEnc was read it means that the encoding byte order of the stream was guessed from the first bytes in the stream, note that this is possible only if the document starts with a XML declaration. |  | | They are still marked as Alpha but we consider they are already stable for some serious use, we just want to do some sanity check (mostly classes, interfaces, methods and packages names) before we go with a Beta release (which we hope it will be the next one). |  | | There is no BOM, an encoding or encoding family cannot be guessed from the first bytes in the stream, defaulting to UTF-8. |
|
http://blogs.sun.com/roller/page/tucu/20040927
(1480 words)
|
|
| |
| | FIX: UNICODE Byte Order Marks Ignored by Internet Explorer 4.0x |
 | | If the byte sequence FF FE is found at the beginning of a file it indicates that the remaining bytes are not normalized and should be byte swapped before use. |  | | In other words, the Byte Order Mark is UNICODE FE FF, but since Little Endian machines automatically swap their bytes, a binary dump of the mark would be FF FE. |  | | In Little Endian, the Byte Order Mark is swapped like all characters so a binary dump of the Byte Order Mark would actually display as FF FE. |
|
http://support.microsoft.com/support/kb/articles/q190/8/37.asp
(681 words)
|
|
| |
| | Byte Order Mark |
 | | For this reason, the Unicode standard specifies that a file may begin with a BOM, a sequence of reserved bytes that indicate byte order as well as the type of UTF encoding. |  | | It does not indicate byte order; it just serves to indicate that the encoding is UTF-8 rather than something else. |  | | Besides UTF-8, there are other UTFs (Unicode Transport Format) and in some of them bytes are interpreted differently depending on whether the machine is "big-endian" (Sun, Apple) or "little-endian" (Windows and most Linux machines). |
|
http://www.stanford.edu/~laurik/fsmbook/errata/BOM.html
(284 words)
|
|
| |
| | System.IO.StreamReader |
 | | The System.IO.StreamReader class is designed for character input in a particular System.Text.Encoding, whereas subclasses of System.IO.Stream are designed for byte input and output. |  | | Constructs and initializes a new instance of the System.IO.StreamReader class for the specified file name, with the specified character encoding and byte order mark detection option. |  | | Constructs and initializes a new instance of the System.IO.StreamReader class for the specified stream, with the specified character encoding and byte order mark detection option. |
|
http://taubz.for.net/code/monodocs/corlib/System.IO/StreamReader.html
(2635 words)
|
|
| |
| | Progress 4GL and the Unicode Byte Order Mark (BOM) |
 | | When generating files, if the encoding is UTF-8, there is no need to generate a BOM, unless you are exporting to an application that expects a BOM as a file signature indicating the file is encoded in UTF-8 instead of another code page. |  | | For plain text, which has no protocol or structure, it was considered that the BOM could be the first character in the file. |  | | There is some value in the BOM being used as a file signature, indicating the plain text file is encoded as Unicode UTF-8, as opposed to some other code page. |
|
http://www.xencraft.com/resources/unicodebom.html
(1136 words)
|
|
| |
| | The skew.org XML Tutorial |
 | | In computing and telecommunications, dividing the basic marks of a writing system into graphemes is helpful, but is not sufficient, on its own, to reproduce written text, since there is more to writing than just spewing a stream of graphemes. |  | | XML documents, in order to be stored or transmitted, must manifest in an encoded form as bits and bytes, using a consistent character encoding mechanism such as UTF-16 or UTF-8. |  | | An algorithm for converting code values to a sequence of 8-bit values (bytes or octets) for cross-platform data exchange is a character encoding scheme. |
|
http://skew.org/xml/tutorial
(8463 words)
|
|
| |
| | Charset (Java 2 Platform SE v1.4.2) |
 | | In these encodings the byte order of a stream may be indicated by an initial byte-order mark represented by the Unicode character |  | | charsets use sixteen-bit quantities and are therefore sensitive to byte order. |  | | A character-encoding scheme is a mapping between a coded character set and a set of octet (eight-bit byte) sequences. |
|
http://java.sun.com/j2se/1.4.2/docs/api/java/nio/charset/Charset.html
(2045 words)
|
|
| |
| | Production First Software Encyclopedia of Typography and Electronic Communication : B |
 | | byte A unit of computer data storage which can be thought of as consisting of one character. |  | | byte order mark A zero-width no-break 16-bit character (at for UTF-16 encodings) or or 32-bit character (at <0000feff> for UCS-4 or UTF-32 encodings) which can be used to detect byte order by comparison to the 16-bit code point or 32-bit code point <0000fffe>. |  | | byte polarity or byte pair polarity Another term for byte order, describing the physical sequence of low order to high order bytes in a data file. |
|
http://ourworld.compuserve.com/homepages/profirst/b.htm
(3440 words)
|
|
| |
| | Xin: utils.hpp File Reference |
 | | Read the first few bytes of a file (the byte-order mark) to determine the unicode encoding type. |  | | Insert the byte order mark into a file. |  | | Set the file pointer beyond the byte order mark. |
|
http://xined.sourceforge.net/resource/doxygen/utils_8hpp.html
(181 words)
|
|
| |
| | Universal Feed Parser 3.2 [dive into mark] |
 | | The heuristic is actually divided into two parts, because all XML documents are allowed to start with something called a Byte Order Mark (BOM), which is a specific Unicode character (U+FEFF) that looks different depending on the encoding and the byte order used in the document. |  | | (All the non-ASCII characters are encoded in the upper 128 characters of a byte, or in multi-byte sequences.) However, this assumption fails for multi-byte encodings, such as UTF-16 and UTF-32. |  | | Have you ever wanted to parse an ill-formed CDF feed encoded as UTF-32 Little Endian with a Byte Order Mark? |
|
http://diveintomark.org/archives/2004/07/03/feed-parser-32
(534 words)
|
|
| |
| | Re: Byte Order Mark mucks up headers |
 | | Suggested resolution (was Re: Byte Order Mark mucks up headers) |  | | If the character coding for a website has a byte >> order mark (things like utf-16, all that "big endian/little endian" >> stuff) then LWP can't interpret HTML headers in the usual way. |  | | Programming > Perl Libwww > Re: Byte Order... |
|
http://www.talkaboutprogramming.com/group/perl.libwww/messages/1685.html
(194 words)
|
|
| |
| | [jdom-interest] Parsing files starting with UTF-8 Byte Order Mark |
 | | Hi Peter, The UTF-8 byte order mark is supposedly optional, but unfortunately there is a known bug in Sun JVMs which means they do not ignore it; so if it's present, you'll see it in your input stream (Sun JVM bug #4508058, http://developer.java.sun.com/developer/bugParade/bugs/4508058.html). |  | | Is there some way that I can load files > with or without this Byte Order Mark transparently, i.e. |  | | The typical workaround is to do the check yourself when reading the input stream, for example: InputStream in =... |
|
http://www.jdom.org/pipermail/jdom-interest/2003-July/012455.html
(280 words)
|
|
| |
| | B-Index (Eclipse Platform API Specification) |
 | | Bit mask used to indicate a this byte of memory is big endian. |  | | Moves the given part forward in the Z order of this page so as to make it visible, without changing which part has focus. |  | | to a platform specific representation of the byte array and vice versa. |
|
http://help.eclipse.org/help31/topic/org.eclipse.platform.doc.isv/reference/api/index-files/index-2.html
(1980 words)
|
|
| |
| | External Tables Concepts |
 | | With external table loads, the byte-order mark is not written at the beginning of the bad and discard files. |  | | Suppression of byte-order mark checking is only necessary if the beginning of the datafile contains binary data that matches the byte-order mark encoding. |  | | This means that a row that is rejected because a column in the row causes a datatype conversion error will not get rejected in a different query if the query does not reference that column. |
|
http://www.stanford.edu/dept/itss/docs/oracle/9i/server.920/a96652/ch11.htm
(1965 words)
|
|
| |
| | [No title] |
 | | This class may be used directly, in which case it * expects the input byte array to begin with a byte-order mark, or it may be * subclassed in order to preset the byte order and mark behavior. |  | | Whether or not a mark is expected, if a mark that does not match the * established byte order is later discovered then a * |  | | * */ package sun.io; import java.io.*; /** * Convert byte arrays containing Unicode characters into arrays of actual * Unicode characters. |
|
http://www.cs.duke.edu/csed/java/src1.3/sun/io/ByteToCharUnicode.java
(225 words)
|
|
| |
| | Character Encoding Detection [Universal Feed Parser] |
 | | Section F of the XML specification outlines the process for determining the character encoding based on unique properties of the Byte Order Mark in the first two to four bytes of the document. |  | | If no encoding is given, XML supports the use of a Byte Order Mark to identify the document as some flavor of UTF-32, UTF-16, or UTF-8. |  | | the encoding sniffed from the first four bytes of the document (as per Section F) |
|
http://feedparser.org/docs/character-encoding.html
(448 words)
|
|
| |
| | [No title] |
 | | out.txt # # Converts standard input which was redirected from the in.txt UTF-8 # encoded file, in any Unicode normal form, into Unicode normal form D # without a BOM (Byte Order Mark). |  | | out.txt # # Converts standard input which was redirected from the in.txt UTF-8 # encoded file, in any Unicode normal form, into Unicode normal form C # without a BOM (Byte Order Mark). |  | | For example if you # specify -NFC followed by -NFD on the command line, then -NFD # will be used since it occurred last. |
|
http://staff.oclc.org/~houghtoa/repository/perl/utf-nf.pl
(696 words)
|
|
| |
| | WebReference.com - Chapter 3 from Perl & XML, from O'Reilly and Associates (12/12) |
 | | Not knowing which end of a byte carries the significant bit will make reading these documents similar to reading them in a mirror, rendering their content into a garble that your programs will not appreciate. |  | | The UTF-8 encoding doesn't have to worry about any of this endianness business since all its characters are made of strung-together byte sequences that are always read from first to last instead of little boxes holding byte pairs whose order may be questionable. |  | | If, for some reason, you have an XML document from an unknown source and have no idea what its encoding might be, it may behoove you to check for the presence of a byte order mark (BOM) at the start of the document. |
|
http://www.webreference.com/programming/perl/perlxml/chap3/12.html
(994 words)
|
|
| |
| | The recode reference manual |
 | | always outputs the high order byte before the low order byte. |  | | is usable for the subset defined by its first sixty thousand characters (in fact, 31 * 2^11 codes), and uses exactly two bytes per character. |  | | file normally begins with a so called byte order mark, having value |
|
http://www.delorie.com/gnu/docs/recode/recode_23.html
(289 words)
|
|
| |
| | BOM : Java Glossary |
 | | Byte Order Marks are special characters at the beginning of a Unicode file to indicate whether it is big or little endian, in other words does the high or low order byte come first. |  | | You can recognise Unicode files by their starting byte order marks, and by the way Unicode-16 files are half zeroes and Unicode-32 files are three-quarters zeros. |  | | These codes also tell whether the encoding is 8, 16 or 32 bit. |
|
http://mindprod.com/jgloss/bom.html
(239 words)
|
|
| |
| | Unicode for Programmers |
 | | UTF-16 byte-order mark - Since UTF-16 is a 16-bit encoding, but most current filesystems and networking protocols are 8-bit, the implementer is left with the choice of whether to send each 16-bit value with the high byte or the low byte first. |  | | Sorting and searching Unicode data - Sorting ASCII data is easy: the whole alphabet is in byte order. |  | | ; or high byte first) and UTF-16 LE ( |
|
http://www.jorendorff.com/articles/unicode/next.html
(409 words)
|
|
| |
| | BOM characters in 'utxt' clipboard flavor |
 | | A: Cocoa has always put a Byte Order Mark (BOM) character (for indicating endianness) in Unicode text on the clipboard, and as of 10.2, it does so for the 'utxt' scrap flavor as well, as documented by Inside Mac:Text Encoding Conversion Manager. |  | | , it should byte swap the data before using it. |  | | Code that reads the 'utxt' data should look for and strip the BOM character (instead of simply displaying it), and if it is |
|
http://developer.apple.com/qa/qa2001/qa1221.html
(148 words)
|
|
| |
| | Internationalization Encyclopedia: : byte order mark |
 | | Name given to the Unicode character U+FEFF when used at the beginning of a Unicode byte stream. |  | | character generally know as ZERO WIDTH NO-BREAK SPACE (ZWNBSP) serves to identify unambiguously the Unicode transformation form used (and especially the byte order) for the stream. |  | | EF BB BF Historically the ZWNBSP was also used to indicate non-breaking but this use is now deprecated and replaced by the character U+2060 for that purpose. |
|
http://www.i18ngurus.com/encyclopedia/byte_order_mark.html
(129 words)
|
|
| |
| | [No title] |
 | | **/ private void put (ByteBuffer out, char c) { if (byteOrder == BIG_ENDIAN) { out.put ((byte) (c >> 8)); out.put ((byte) (c & 0xFF)); } else { out.put ((byte) (c & 0xFF)); out.put ((byte) (c >> 8)); } } protected void implReset () { byteOrder = originalByteOrder; } } |
|
http://www.ualberta.ca/dept/chemeng/users/barton/gcc/gcc-src/gcc-3.4.0/libjava/gnu/java/nio/charset/UTF_16Decoder.java
(327 words)
|
|
| |
| | Re: [xml] UTF-16 byte order mark |
 | | On Tue, Oct 28, 2003 at 05:24:19PM +0100, Kasimier Buchcik wrote: > Hi, > > is it possible to use libxml2's UTF-16 conversion functions and > serialization mechanism without a byte order mark? |  | | am I able to > define somewhere if byte order marks should be expected and if byte > order marks should be returned? |
|
http://mail.gnome.org/archives/xml/2003-October/msg00329.html
(106 words)
|
|
| |
| | [Python-Dev] Unicode byte order mark decoding |
 | | The Unicode standard is unclear about how it should be handled (version 4, section 15.9): > Although there are never any questions of byte order with UTF-8 text, > this sequence can serve as signature for UTF-8 encoded text where the > character set is unmarked. |  | | This will remove a bit of manual work for Python programs that deal with UTF-8 files created on Windows, which frequently have the BOM at the beginning. |  | | [...] Systems that use the byte order mark > must recognize when an initial U+FEFF signals the byte order. |
|
http://mail.python.org/pipermail/python-dev/2005-April/052501.html
(476 words)
|
|
| |
| | byte_order_mark - OneLook Dictionary Search |
 | | We found 2 dictionaries with English definitions that include the word byte order mark: |  | | Tip: Click on the first link on a line below to go directly to a page where "byte order mark" is defined. |  | | Byte Order Mark : Unicode Glossary [home, info] |
|
http://www.onelook.com/?w=byte_order_mark&loc=resrd
(82 words)
|
|
| |
| | BOM - TheBestLinks.com - Acronym, Abbreviation, Byte Order Mark, Disambig, ... |
 | | BOM - TheBestLinks.com - Acronym, Abbreviation, Byte Order Mark, Disambig,... |  | | BOM might be an acronym or abbreviation for: |  | | BOM, Acronym, Abbreviation, Byte Order Mark, Disambig, Bill of materials |
|
http://www.thebestlinks.com/BOM.html
(117 words)
|
|
| |
| | Byte Order Mark (BOM) Crimson Problem |
 | | The problem is that the first char in the string is like ''. |  | | I assume this first char is the BOM character, that is |  | | Byte Order Mark (BOM) Crimson Problem Pedro Sousa |
|
http://www.mail-archive.com/general@xml.apache.org/msg02059.html
(124 words)
|
|
| |
| | UTF to UTF conversion |
 | | This is really just a special case of Character Set Conversion. |  | | This page lets you convert between various flavors of UTF, including UTF-8, UTF-16, UTF-16 with byte-order marks and entity-encoded UTF-8. |
|
http://www.fileformat.info/convert/text/utf2utf.htm
(31 words)
|
|
|