html-spec/Text.html at master · dckc/html-spec · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
<!DOCTYPE HTML PUBLIC "-//W3O//DTD WWW HTML 2.0//EN">
<HTML>

<HEAD>
<TITLE>Structured Text in HTML</TITLE></HEAD>

<BODY>

<H1>Structured Text</H1>

<P>An HTML instance is like a text file,
except that some of the characters are interpreted as markup.
The markup gives structure to the document.

<P>The instance represents a hierarchy of elements.
Each element has a <A NAME="z0"
HREF="Text.html#name">name</A> , some <A
NAME="z1" HREF="Text.html#attribute">attributes</A> ,
and some content. Most elements are represented in the document
as a start tag, which gives the name and attributes,
followed by the content, followed by the end tag.
For example:

<PRE>
	&#60;!DOCTYPE HTML PUBLIC "-//W3O//DTD W3 HTML 2.0//EN"&#62;
	&#60;HTML&#62;
	  &#60;HEAD&#62;
	    &#60;TITLE&#62;
	      A sample HTML document
	    &#60;/TITLE&#62;
	  &#60;/HEAD&#62;

	  &#60;BODY&#62;
	    &#60;H1&#62;
	      An Example of Structure
	      &#60;br&#62;
	      In HTML
	    &#60;/H1&#62;
	    &#60;P&#62;
	      Here's a typical paragraph.
	    &#60;UL&#62;
	      &#60;LI&#62;
	        Item one has an
	        &#60;A NAME="anchor"&#62;
	          anchor
	        &#60;/A&#62;
	      &#60;LI&#62;
	        Here's item two.
	    &#60;/UL&#62;
	  &#60;/BODY&#62;
	&#60;/HTML&#62;
</PRE>

<P>Some elements (e.g. <CODE>BR</CODE>) are empty.
They have no content. They show up as just a start tag.

<P>For the rest of the elements, the content is a sequence of
data characters and nested elements. Some things such as forms
and anchors cannot be nested, in which case this is mentioned
in the text. Anchors and character highlighting may be put inside
other constructs.

<H2><A NAME="Tags">Tags</A></H2>

<P>Most elements start and end with tags.
Empty elements have no end tag. Start tags are delimited by &#60;
and &#62;, and end tags are delimited by &#60;/ and &#62;.
For example:

<PRE>
	&#60;h1&#62; ... &#60;/H1&#62;   &#60;!-- uppercase = lowercase  --&#62;
	&#60;h1 &#62; ... &#60;/h1 &#62; &#60;!-- spaces OK before &#62; --&#62;
</PRE>

<P>The following are <EM>not</EM> valid tags:

<PRE>
	&#60; h1&#62;             &#60;!-- this is not a tag at all --&#62;
	&#60;H1/&#62; &#60;H=1&#62;       &#60;!-- these are markup errors --&#62;
</PRE>

<H3>NOTE: SHORTTAG</H3>

<P><EM>The SGML declaration for HTML specifies SHORTTAG YES ,
which means that there are some other valid syntaxes for tags,
e.g. NET tags: &#60;em/.../ , empty start tags:
&#60;&#62; , empty end tags: &#60;/&#62; .
Until such time as support for these idioms is widely deployed,
their use is strongly discouraged.</EM>

<P>The start and end tags for the HTML, HEAD,
and BODY elements are omissable. The end tags of some other elements
(e.g. P, LI, DT, DD) can be ommitted (see the DTD for details).
This does not change the document structure -- the following
documents are equivalent:

<PRE>
	&#60;!DOCTYPE HTML PUBLIC "-//W3O//DTD W3 HTML 2.0//EN"&#62;
	  &#60;TITLE&#62;Structural Example&#60;/TITLE&#62;
	  &#60;H1&#62;Structural Example&#60;/H1&#62;
	  &#60;P&#62;A paragraph...

	&#60;!DOCTYPE HTML PUBLIC "-//W3O//DTD W3 HTML 2.0//EN"&#62;
	  &#60;HTML&#62;&#60;HEAD&#62;
	  &#60;TITLE&#62;Structural Example&#60;/TITLE&#62;
	  &#60;/HEAD&#62;
	  &#60;BODY&#62;
	  &#60;H1&#62;Structural Example&#60;/H1&#62;
	  &#60;P&#62;A paragraph...&#60;/P&#62;
	  &#60;/BODY&#62;
</PRE>

<H3><A NAME="name">Names</A></H3>

<P>The element name immediately follows the tag open delimiter.
Names consist of a letter followed by up to 72 letters,
digits, periods, or hyphens. Names are not case sensitive.
For example:

<PRE>
	A H1 h1 another.name name-with-hyphens
</PRE>

<H3><A NAME="attribute">Attributes</A></H3>

<P>In a start tag, whitespace and attributes are allowed between
the element name and the closing delimiter.
An attribute consists of a name, an equal sign,
and a value. Whitespace is allowed around the equal sign.

<P>The value is either:

<UL>

<LI>A string literal, delimited by single quotes or double quotes,
or

<LI>A name token; that is, a sequence of letters,
digits, periods, or hyphens.
</UL>

<P>For example:

<PRE>
	&#60;A HREF="http://host/dir/file.html"&#62;
	&#60;A HREF=foo.html &#62;
	&#60;IMG SRC="mrbill.gif" ALT="Mr. Bill says, &#38;#34;Oh Noooo&#38;#34;"&#62;
</PRE>

<P>The length of an attribute value (after replacing entity and
numeric character referencees) is limited to 1024 characters.

<H3>NOTE: Unquoted Attribute Value Literals</H3>

<P><EM>Some implementations allowed any character except space
or '&#62;' in a name token, for example &#60;A HREF=foo/bar.html&#62;
. As a result, there are many documents that contain attribute
values that should be quoted but are not.
While parser implementators are encouraged to support this idiom,
its use in future documents is stictly prohibited.</EM>

<H3>NOTE: <CODE>&#62;</CODE> in Attribute Value Literals</H3>

<P><EM>Some implementations also consider any occurence of the
&#62; character to signal the end of a tag.
For compatibility with such implementations,
it may be necessary to represent &#62; with an entity or numeric
character reference; for example: &#60;IMG SRC="eq1.ps" ALT="a
&#38;#62; b"&#62;</EM>

<P>Attributes with a declared value of <CODE>NAME</CODE> (e.g.
<CODE>ISMAP</CODE>, <CODE>COMPACT</CODE>) may be written using
a minimized syntax. The markup:

<PRE>
	&#60;UL COMPACT="COMPACT"&#62;
</PRE>

<P>can be written as

<PRE>
	&#60;UL COMPACT&#62;
</PRE>

<H2>Undefined tag and attribute names</H2>

<P>It is a principle to be conservative in that which one produces,
and liberal in that which one accepts. HTML parsers should be
liberal except when verifying code. HTML generators should generate
strictly conforming HTML.

<P>The behaviour of WWW applications reading HTML documents and
discovering tag or attribute names which they do not understand
should be to behave as though, in the case of a tag,
the whole tag had not been there but its content had,
or in the case of an attribute, that the attribute had not been
present.

<H2><A NAME="Data">Character Data </A>
</H2>

<P>The characters between the tags represent text in the ISO-Latin-1
character set, which is a superset of ASCII.
Because certain characters will be interpreted as markup,
they should be "escaped"; that is, represented by markup -- entity
or numeric character references. For example:

<PRE>
                When a&#38;#60;b, we can show that...
                Brought to you by AT&#38;amp;T
</PRE>

<P>The HTML DTD includes entities for each of the non-ASCII characters
so that one may reference them by name if it is inconvenient
to enter them directly:

<PRE>           Kurt G&#38;ouml;del was a famous logician and mathematician.

</PRE>

<H3>NOTE: Markup Characters</H3>

<P><EM>To ensure that a string of characters has no markup,
it is sufficient to represent all occurrences of <CODE>&#60;</CODE>,
<CODE>&#62;</CODE>, and <CODE>&#38;</CODE> by character or entity
references.</EM>

<H3>NOTE: CDATA, RCDATA</H3>

<P><EM>There are SGML features ( CDATA ,
RCDATA ) to allow most &#60; , &#62; , and &#38; characters to
be entered without the use of entity or character references.
Because these features tend to be used and implemented inconsistently,
and because they require 8-bit characters to represent non-ASCII
characters, they are not employed in this version of the HTML
DTD.</EM>

<P><EM>An earlier HTML specification included an XMP element
whose syntax is not expressible in SGML.
Inside the XMP , no markup was recognized except the &#60;/XMP&#62;
end tag. While implementations are encouraged to support this
idiom, its use is obsolete.</EM>

<H3>Comments</H3>

<P>To include comments in an HTML document that will be ignored
by the parser, surround them with &#60;!-- and --&#62;.
After the comment delimiter, all text up to the next occurrence
of -- is ignored. Hence comments cannot be nested.
Whitespace is allowed between the closing -- and &#62;.
(But not between the opening &#60;! and --.)

<P>For example:

<PRE>&#60;HEAD&#62;
&#60;TITLE&#62;HTML Guide: Recommended Usage&#60;/TITLE&#62;
&#60;!-- Id: Text.html,v 1.6 1994/04/25 17:33:48 connolly Exp --&#62;
&#60;/HEAD&#62;
</PRE>

<H3>Note: Tags in Comments</H3>

<P><EM>Some historical implementations incorrectly consider a
&#62; sign to terminate a comment.</EM>
</BODY>
</HTML>