HTML HTML, an initialism for Hypertext Mark-up Language, is the predominant markup language for web pages. It provides a means to describe the structure of text-based information in a document—by denoting certain text as links, headings, paragraphs, lists, etc.—and to supplement that text with interactive forms, embedded images, and other objects has been in use since 1991, but HTML 4.0 (December 1997) was the first standardized version where international characters In computer and machine-based telecommunications terminology, a character is a unit of information that roughly corresponds to a grapheme, grapheme-like unit, or symbol, such as in an alphabet or syllabary in the written form of a natural language were given reasonably complete treatment. When an HTML document includes special characters outside the range of seven-bit ASCII American Standard Code for Information Interchange , pronounced /ˈæski/ is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that work with text. Most modern character-encoding schemes—which support many more characters than did the two goals are worth considering: the information's integrity Integrity is consistency of actions, values, methods, measures, principles, expectations and outcome. As a holistic concept, it judges the quality of a system in terms of its ability to achieve its own goals. A value system's abstraction depth and range of applicable interaction may also function as significant factors in identifying integrity due, and universal browser A web browser is a software application for retrieving, presenting, and traversing information resources on the World Wide Web. An information resource is identified by a Uniform Resource Identifier and may be a web page, image, video, or other piece of content. Hyperlinks present in resources enable users to easily navigate their browsers to display.

Contents

The document character encoding

When HTML documents are served there are three ways to tell the browser what specific character encoding is to be used for display to the reader. First, HTTP Hypertext Transfer Protocol is an application-level protocol for distributed, collaborative, hypermedia information systems. Its use for retrieving inter-linked resources led to the establishment of the World Wide Web headers can be sent by the web server A web server has defined load limits, because it can handle only a limited number of concurrent client connections (usually between 2 and 60,000, by default between 500 and 1,000) per IP address (and TCP port) and it can serve only a certain maximum number of requests per second depending on: along with each web page (HTML document). A typical HTTP header looks like this:

Content-Type: text/html; charset=ISO-8859-1

For HTML HTML, an initialism for Hypertext Mark-up Language, is the predominant markup language for web pages. It provides a means to describe the structure of text-based information in a document—by denoting certain text as links, headings, paragraphs, lists, etc.—and to supplement that text with interactive forms, embedded images, and other objects (not usually XHTML The Extensible Hypertext Markup Language, or XHTML, is a markup language that has the same depth of expression as HTML, but also conforms to XML syntax), the other method is for the HTML document to include this information at its top, inside the HEAD element.

<meta http-equiv="Content-Type" content="text/html; charset=US-ASCII">

XHTML The Extensible Hypertext Markup Language, or XHTML, is a markup language that has the same depth of expression as HTML, but also conforms to XML syntax documents have a third option: to express the character encoding in the XML XML is a general-purpose specification for creating custom markup languages. It is classified as an extensible language, because it allows the user to define the mark-up elements preamble, for example

<?xml version="1.0" encoding="ISO-8859-1"?>

These methods each advise the receiver that the file being sent uses the character encoding specified. The character encoding is often referred to as the "character set" and it indeed does limit the characters in the raw source text. However, the HTML standard states that the "charset" is to be treated as an encoding of Unicode Unicode is a computing industry standard allowing computers to consistently represent and manipulate text expressed in most of the world's writing systems. Developed in tandem with the Universal Character Set standard and published in book form as The Unicode Standard, Unicode consists of a repertoire of more than 100,000 characters, a set of code characters and provides a way to specify characters that the "charset" does not cover. The term code page Code page is the traditional IBM term used to map a specific set of characters to numerical code point values . This is slightly different in meaning from the related terms encoding and character set. IBM and Microsoft often allocate a code page number to a character set even if that charset is better known by another name is also used similarly.

It is a bad idea to send incorrect information about the character encoding used by a document. For example, a server where multiple users may place files created on different machines cannot promise that all the files it sends will conform to the server's specification — some users may have machines with different character sets. For this reason, many servers simply do not send the information at all, thus avoiding making false promises. However, this may result in the equally bad situation where the user agent A user agent is the client application used with a particular network protocol; the phrase is most commonly used in reference to those which access the World Wide Web. Other systems, such as Session Initiation Protocol , use the term user agent to refer to both end points of a phone call, server and client displays the document incorrectly because neither sending party has specified a character encoding.

The HTTP header specification supersedes all HTML (or XHTML) meta tag Meta elements are HTML or XHTML elements used to provide structured metadata about a Web page. Such elements must be placed as tags in the head section of an HTML or XHTML document. Meta elements can be used to specify page description, keywords and any other metadata not provided through the other head elements and attributes specifications, which can be a problem if the header is incorrect and one does not have the access or the knowledge to change them.

Browsers receiving a file with no character encoding information must make a blind assumption. For Western European languages, it is typical and fairly safe to assume windows-1252 Windows-1252 is a character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows in English and some other Western languages. It is one version within the group of Windows code pages. In LaTeX packages, it is referred to as ansinew. The encoding is a superset of ISO 8859-1, but differs from the IANA's ISO-88 (which is similar to ISO-8859-1 ISO 8859-1, more formally cited as ISO/IEC 8859-1 is part 1 of ISO/IEC 8859, a standard character encoding of the Latin alphabet. It is less formally referred to as Latin-1. It was originally developed by the ISO, but later jointly maintained by the ISO and the IEC. The standard, when supplemented with additional character assignments , is the but has printable characters in place of some control codes that are forbidden in HTML anyway), but it is also common for browsers to assume the character set native to the machine on which they are running. The consequence of choosing incorrectly is that characters outside the printable ASCII range (32 to 127) usually appear incorrectly. This presents few problems for English English is a West Germanic language that originated in Anglo-Saxon England. As a result of the military, economic, scientific, political, and cultural influence of the British Empire during the 18th, 19th, and early 20th centuries and of the United States since the mid 20th century, it has become the lingua franca in many parts of the world. It is-speaking users, but other languages regularly — in some cases, always — require characters outside that range. In CJK CJK is a collective term for Chinese, Japanese, and Korean, which constitute the main East Asian languages. The term is used in the field of software and communications internationalization environments where there are several different multi-byte encodings in use, auto-detection is often employed.

It is increasingly common for multilingual websites to use one of the Unicode Unicode is a computing industry standard allowing computers to consistently represent and manipulate text expressed in most of the world's writing systems. Developed in tandem with the Universal Character Set standard and published in book form as The Unicode Standard, Unicode consists of a repertoire of more than 100,000 characters, a set of code/ISO 10646 The Universal Character Set , defined by the ISO/IEC 10646 International Standard, is a standard set of characters upon which many character encodings are based. The UCS contains nearly a hundred thousand abstract characters, each identified by an unambiguous name and an integer number called its code point transformation formats, as this allows use of the same encoding for all languages. Generally UTF-8 UTF-8 is a variable-length character encoding for Unicode. It is able to represent any character in the Unicode standard, yet is backwards compatible with ASCII. For these reasons, it is steadily becoming the preferred encoding for e-mail, web pages, and other places where characters are stored or streamed is used rather than UTF-16 In computing, UTF-16 is a variable-length character encoding for Unicode, capable of encoding the entire Unicode repertoire. The encoding form maps each character to a sequence of 16-bit words. Characters are known as code points and the 16-bit words are known as code units. For characters in the Basic Multilingual Plane (BMP) the resulting or UTF-32 UTF-32 is a protocol for encoding Unicode characters that uses exactly 32 bits for each Unicode code point. All other Unicode transformation formats use variable-length encodings because it is easier to handle in programming languages that assume a byte-oriented ASCII superset encoding, and it is efficient for ASCII-heavy text (which HTML tends to be).

Successful viewing of a page is not necessarily an indication that its encoding is specified correctly. If the page's creator and reader are both assuming some machine-specific character encoding, and the server does not send any identifying information, then the reader will nonetheless see the page as the creator intended, but other readers with different native sets will not see the page as intended.

Character references

Main articles: character entity reference In the markup languages SGML, HTML, XHTML and XML, a character entity reference is a reference to a particular kind of named entity that has been predefined or explicitly declared in a Document Type Definition . The replacement text of the entity consists of a single character from the Universal Character Set/Unicode. The purpose of a character and numeric character reference A numeric character reference is a common markup construct used in SGML and other SGML-based markup languages such as HTML and XML. It consists of a short sequence of characters that, in turn, represent a single character from the Universal Character Set (UCS) of Unicode. NCRs are typically used in order to represent characters that are not

In addition to native character encodings, characters can also be encoded as character references, which can be numeric character references (decimal The decimal numeral system has ten as its base. It is the most widely used numeral system or hexadecimal In mathematics and computer science, hexadecimal is a numeral system with a radix, or base, of 16. It uses sixteen distinct symbols, most often the symbols 0–9 to represent values zero to nine, and A, B, C, D, E, F (or a through f) to represent values ten to fifteen) or character entity references. Character entity references are also sometimes referred to as named entities, or HTML entities for HTML. HTML's usage of character references derives from SGML The Standard Generalized Markup Language is an ISO Standard metalanguage in which one can define markup languages for documents. SGML is a descendant of IBM's Generalized Markup Language (GML), developed in the 1960s by Charles Goldfarb, Edward Mosher and Raymond Lorie (whose surname initials were used by Goldfarb to make up the term GML).

Character entity references have the format &name; where "name" is a case-sensitive alphanumeric string. For example, the character 'λ' can be encoded as &lambda; in an HTML 4 document. Characters <, >, " and & are used to delimit tags, attribute values, and character references. Character entity references &lt;, &gt;, &quot; and &amp;, which are predefined in HTML, XML, and SGML, can be used instead for literal representations of the characters.

Numeric character references can be in decimal format, &#DD;, where DD is a variable-width string of decimal digits. Similarly there is a hexadecimal format, &#xHHHH;, where HHHH is a variable-width string of hexadecimal digits, though many consider it good practice to never use fewer than four hex digits, and never use an odd number of hex digits (due to the correspondence of two hex digits to one byte). Unlike named entities, hexadecimal character references are case-insensitive in HTML. For example, λ can also be represented as &#955;, &#x03BB; or &#X03bb;.

Numeric references always refer to Universal Character Set The Universal Character Set , defined by the ISO/IEC 10646 International Standard, is a standard set of characters upon which many character encodings are based. The UCS contains nearly a hundred thousand abstract characters, each identified by an unambiguous name and an integer number called its code point code points, regardless of the page's encoding. Using numeric references that refer to UCS control code ranges is forbidden, with the exception of the linefeed, tab, and carriage return characters. That is, characters in the hexadecimal ranges 00–08, 0B–0C, 0E–1F, 7F, and 80–9F cannot be used in an HTML document, not even by reference —so "&#153;", for example, is not allowed. However, for backward compatibility with early HTML authors and browsers that ignored this restriction, raw characters and numeric character references in the 80–9F range are interpreted by some browsers as representing the characters mapped to bytes 80–9F in the Windows-1252 encoding.

Unnecessary use of HTML character references may significantly reduce HTML readability. If the character encoding for a web page is chosen appropriately then HTML character references are usually only required for a few special characters (or not at all if a native Unicode Unicode is a computing industry standard allowing computers to consistently represent and manipulate text expressed in most of the world's writing systems. Developed in tandem with the Universal Character Set standard and published in book form as The Unicode Standard, Unicode consists of a repertoire of more than 100,000 characters, a set of code encoding like UTF-8 UTF-8 is a variable-length character encoding for Unicode. It is able to represent any character in the Unicode standard, yet is backwards compatible with ASCII. For these reasons, it is steadily becoming the preferred encoding for e-mail, web pages, and other places where characters are stored or streamed is used).

XML character entity references

Unlike traditional HTML with its large range of character entity references, in XML XML is a general-purpose specification for creating custom markup languages. It is classified as an extensible language, because it allows the user to define the mark-up elements there are only five predefined character entity references. These are used to escape characters that are markup sensitive in certain contexts:

All other character entity references have to be defined before they can be used. For example, use of &eacute; (which gives é, Latin lower-case E with acute accent, U+00E9 in Unicode) in an XML document will generate an error unless the entity has already been defined. XML also requires that the x in hexadecimal numeric references be in lowercase: for example &#xA1b rather than &#XA1b. XHTML The Extensible Hypertext Markup Language, or XHTML, is a markup language that has the same depth of expression as HTML, but also conforms to XML syntax, which is an XML application, supports the HTML 4 entity set and XML's &apos; entity, which does not appear in HTML 4.

However, use of &apos; in XHTML should generally be avoided for compatibility reasons. &#39; or &#x0027; may be used instead.

&amp; has the special problem that it starts with the character to be escaped. A simple Internet search finds thousands of sequences &amp;amp;amp;amp; ... in HTML pages for which the algorithm to replace an ampersand by the corresponding character entity reference was applied too often.

HTML character entity references

For a list of all named HTML character entity references, see List of XML and HTML character entity references In SGML, HTML and XML documents, the logical constructs known as character data and attribute values consist of sequences of characters, in which each character can manifest directly , or can be represented by a series of characters called a character reference, of which there are two types: a numeric character reference and a character entity (approximately 250 entries).

See also

External links

Categories: HTML | World Wide Web Consortium standards Categories: Web standards | World Wide Web Consortium

 

The above information uses material from Wikipedia and is licensed under the GNU Free Documentation License The purpose of this License is to make a manual, textbook, or other functional and useful document "free" in the sense of freedom: to assure everyone the effective freedom to copy and redistribute it, with or without modifying it, either commercially or noncommercially. Secondarily, this License preserves for the author and publisher a.
Some facts may not have been fully verified for accuracy. [Disclaimers Wikipedia is an online open-content collaborative encyclopedia, that is, a voluntary association of individuals and groups working to develop a common resource of human knowledge. The structure of the project allows anyone with an Internet connection to alter its content. Please be advised that nothing found here has necessarily been reviewed by]
This page was last archived by our server on Sat Jul 11 17:47:31 2009. [ refresh local cache ]
Displaying this page or its contents does not use any Wikimedia Foundation's resources.
The owners of this site proudly support the Wikimedia Foundation.


Test Run - MSDN Magazine
news.google.com
Test Run

MSDN Magazine

Text in order to easily use the Encoding class to convert text to bytes. The System.Net and System.IO namespaces house the key classes needed to ...
Google News Search: Character encodings in HTML,
Thu Jul 9 14:51:20 2009
Slide0240 png
w3.org
Slide0240 png
480px x 640px | 44.00kB

[source page]

Slide 24 of 53 Declaring the document encoding

Yahoo Images Search: Character encodings in HTML,
Thu Jul 9 10:02:34 2009
Re: [Catalyst] ajax character encoding issue solved, but WHY?
catalystcookbook.blogspot.com
Re: [Catalyst] ajax character encoding issue solved, but WHY?

gegewan

Fri, 19 Jun 2009 15:51:00 GM

On Fri, Jun 19, 2009 at 6:23 AM, . wrote: The problem was fixed by calling utf8::decode on the data prior to sending back via ajax. BUT WHY? I am using the JSON view to render ajax responses, and it sets the ...

Google Blogs Search: Character encodings in HTML,
Wed Jul 8 04:29:51 2009