Network Working Group K. Davies Internet-Draft ICANN Intended status: Informational A. Freytag Expires: September 6, 2014 ASMUS Inc. March 5, 2014 Representing Label Generation Rulesets using XML draft-davies-idntables-06 Abstract This document describes a method of representing the domain name registration policy for a zone administrator using Extensible Markup Language (XML). These policies, known as "Label Generation Rulesets" (LGRs), are particularly used for the implementation of Internationalized Domain Names (IDNs). The rulesets are used to implement and share policy defining which labels and specific Unicode code points are permitted for registrations, which alternative code points are considered variants, and what actions may be performed on labels containing those variants. Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on September 6, 2014. Copyright Notice Copyright (c) 2014 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents Davies & Freytag Expires September 6, 2014 [Page 1] Internet-Draft Label Generation Rulesets in XML March 2014 carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 2. Design Goals . . . . . . . . . . . . . . . . . . . . . . . . . 5 3. Requirements . . . . . . . . . . . . . . . . . . . . . . . . . 7 4. LGR Format . . . . . . . . . . . . . . . . . . . . . . . . . . 9 4.1. Namespace . . . . . . . . . . . . . . . . . . . . . . . . 9 4.2. Basic Structure . . . . . . . . . . . . . . . . . . . . . 9 4.3. Metadata . . . . . . . . . . . . . . . . . . . . . . . . . 9 4.3.1. The version Element . . . . . . . . . . . . . . . . . 10 4.3.2. The date Element . . . . . . . . . . . . . . . . . . . 10 4.3.3. The language Element . . . . . . . . . . . . . . . . . 10 4.3.4. The domain Element . . . . . . . . . . . . . . . . . . 11 4.3.5. The description Element . . . . . . . . . . . . . . . 11 4.3.6. The validity-start and validity-end Elements . . . . . 12 4.3.7. The unicode-version Element . . . . . . . . . . . . . 12 4.3.8. The references Element . . . . . . . . . . . . . . . . 12 5. Code Points and Variants . . . . . . . . . . . . . . . . . . . 14 5.1. Sequences . . . . . . . . . . . . . . . . . . . . . . . . 14 5.2. Variants . . . . . . . . . . . . . . . . . . . . . . . . . 15 5.2.1. Basic Variants . . . . . . . . . . . . . . . . . . . . 15 5.2.2. Null Variants . . . . . . . . . . . . . . . . . . . . 16 5.2.3. Dispositions . . . . . . . . . . . . . . . . . . . . . 16 5.2.4. The ref Attribute . . . . . . . . . . . . . . . . . . 17 5.2.5. Variants with Reflexive Mapping . . . . . . . . . . . 18 5.2.6. Conditional Variants . . . . . . . . . . . . . . . . . 19 5.2.7. The comment Attribute . . . . . . . . . . . . . . . . 20 5.3. Code Point Tagging . . . . . . . . . . . . . . . . . . . . 20 6. Whole Label and Context Evaluation . . . . . . . . . . . . . . 21 6.1. Basic Concepts . . . . . . . . . . . . . . . . . . . . . . 21 6.2. Character Classes . . . . . . . . . . . . . . . . . . . . 21 6.2.1. Tag-based Classes . . . . . . . . . . . . . . . . . . 22 6.2.2. Unicode Property-based Classes . . . . . . . . . . . . 23 6.2.3. Explicitly Declared Classes . . . . . . . . . . . . . 23 6.2.4. Combined Classes . . . . . . . . . . . . . . . . . . . 24 6.3. Whole Label and Context Rules . . . . . . . . . . . . . . 26 6.3.1. The rule Element . . . . . . . . . . . . . . . . . . . 26 6.3.2. The Match Operators . . . . . . . . . . . . . . . . . 27 6.3.3. The count Attribute . . . . . . . . . . . . . . . . . 28 6.3.4. The name and byref Attributes . . . . . . . . . . . . 29 6.3.5. The choice Element . . . . . . . . . . . . . . . . . . 30 Davies & Freytag Expires September 6, 2014 [Page 2] Internet-Draft Label Generation Rulesets in XML March 2014 6.3.6. Literal Code Point Sequences . . . . . . . . . . . . . 30 6.3.7. The any Element . . . . . . . . . . . . . . . . . . . 30 6.3.8. The start and end Elements . . . . . . . . . . . . . . 31 6.3.9. Example rule from IDNA2008 . . . . . . . . . . . . . . 31 6.4. Parameterized Context or When Rules . . . . . . . . . . . 32 6.4.1. The anchor Element . . . . . . . . . . . . . . . . . . 32 6.4.2. The look-behind and look-ahead Elements . . . . . . . 33 6.4.3. Omitting the anchor Element . . . . . . . . . . . . . 34 7. The action Element . . . . . . . . . . . . . . . . . . . . . . 36 7.1. The match and not-match Attributes . . . . . . . . . . . . 36 7.2. Actions matching Variant Dispositions . . . . . . . . . . 36 7.2.1. Variant Disposition triggers . . . . . . . . . . . . . 36 7.2.2. Example for RFC3743-style Tables . . . . . . . . . . . 37 7.3. Recommended Disposition Values . . . . . . . . . . . . . . 38 7.4. Precedence . . . . . . . . . . . . . . . . . . . . . . . . 38 7.5. Implied Actions . . . . . . . . . . . . . . . . . . . . . 39 7.6. Default Actions . . . . . . . . . . . . . . . . . . . . . 39 8. Processing a Label Against an LGR . . . . . . . . . . . . . . 41 8.1. Determining Eligibility for a Label . . . . . . . . . . . 41 8.2. Determining Variants for a Label . . . . . . . . . . . . . 41 8.3. Determining a Disposition for a Label or variant Label . 42 9. Conversion to and from Other Formats . . . . . . . . . . . . . 43 10. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 44 11. Security Considerations . . . . . . . . . . . . . . . . . . . 45 12. References . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Appendix A. Example Table . . . . . . . . . . . . . . . . . . . . 47 Appendix B. How to Translate RFC 3743 based Tables into the XML Format . . . . . . . . . . . . . . . . . . . . . 49 Appendix C. Indic Syllable Structure Example . . . . . . . . . . 54 Appendix D. RelaxNG Schema . . . . . . . . . . . . . . . . . . . 57 Appendix E. Acknowledgements . . . . . . . . . . . . . . . . . . 69 Appendix F. Editorial Notes . . . . . . . . . . . . . . . . . . . 70 F.1. Known Issues and Future Work . . . . . . . . . . . . . . . 70 F.2. Change History . . . . . . . . . . . . . . . . . . . . . . 70 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 72 Davies & Freytag Expires September 6, 2014 [Page 3] Internet-Draft Label Generation Rulesets in XML March 2014 1. Introduction This memo describes a method of using Extensible Markup Language (XML) to describe the algorithm used to determine whether a given domain label is permitted, and under which conditions, based on the code points it contains and their context. These algorithms are comprised of a list of permissible code points, variant code point mappings, and a set of rules acting on them. These algorithms form part of a zone administrator's policies, and can be referred to as Label Generation Rulesets (LGRs), or IDN tables. Administrators of the zones for top-level domain registries have historically published their LGRs using ASCII text or HTML. The formatting of these documents has been loosely based on the format used for the Language Variant Table in [RFC3743]. [RFC4290] also provides a "model table format" that describes a similar set of functionality. Common to these formats is that the algorithms used to evaluate the data therein are implicit or specified elsewhere. Through the first decade of IDN deployment, experience has shown that LGRs derived from these formats are difficult to consistently implement and compare due to their differing formats. A universal format, such as one using a structured XML format, will assist by improving machine-readability, consistency, reusability and maintainability of LGRs. It also provides for more complex conditional implementation of variants that reflects the known requirements of current zone administrator policies. Another feature of this format is that it allows many of the algorithms to be made explicit and machine implementable. A remaining small set of implicit algorithms is described in this document to allow commonality in implementation. While the predominant usage of this specification is to represent IDN label policy, the format is not limited to IDN usage may also be used for describing ASCII domain name label rulesets. Davies & Freytag Expires September 6, 2014 [Page 4] Internet-Draft Label Generation Rulesets in XML March 2014 2. Design Goals The following items are explicit design goals of this format: o MUST be in a format that can be implemented in a reasonably straightforward manner in software; o The format SHOULD be able to be checked for formatting errors, such that common mistakes can be caught; o An LGR MUST be able to express the set of valid code points that are allowed for registration under a specific zone administrator's policies; o MUST be able to express computed alternatives to a given domain name based on mapping relationships between code points, whether one-to-one or many-to-many. These computed alternatives are commonly known as "variants"; o Variant code points SHOULD be able to be tagged with specific dispositions or categories that can be used to support registry policy (such as whether to allocate the computed variant in the zone, or to merely block it from registration); o Variants and code points MUST be able to stipulated based on contextual information. For example, specific variants may only be applicable when they follow another specific code point, or when the code point is displayed in a specific presentation form; o The data contained within an LGR MUST be able to be interpreted unambiguously, such that independent implementations that utilize the contents will arrive at the same results; o To the largest extent possible, policy rules SHOULD be able to be specified in the XML format without relying hidden, or built-in algorithms in implementations. o LGRs SHOULD be suitable for comparison and re-use, such that one could easily compare the contents of two or more to see the differences, to merge them, and so on. o LGRs SHOULD be able to be merged automatically, at the minimum where code points and variant information is concerned. o As many existing IDN tables as practicable SHOULD be able to be migrated to the LGR format with all applicable logic retained. It is explicitly NOT the goal of this format to stipulate what code Davies & Freytag Expires September 6, 2014 [Page 5] Internet-Draft Label Generation Rulesets in XML March 2014 points should be listed in an LGR by a zone administrator. Which registration policies are used for a particular zone is outside the scope of this memo. Davies & Freytag Expires September 6, 2014 [Page 6] Internet-Draft Label Generation Rulesets in XML March 2014 3. Requirements To be able to fulfill the known utilization of LGRs, the existing corpus of published IDN tables were reviewed to prepare this specification. In addition, the requirements of ICANN's work to implement an LGR for the DNS Root Zone [LGR-PROCEDURE] were also considered. In particular, Section B of that document identifies five specific requirements for an LGR methodology. Finally, the syntax and rules in [RFC5892] and [RFC3743] were reviewed. Altogether these reviews resulted in the following requirements: o The ability to identify a set of code points that are permitted. o The ability to include code points that are permitted only in given contexts. o The ability to represent a list of variants, if any, for each code point. o The ability to include variants that are defined only in given contexts. o The ability to assign a single disposition or categorization for each variants o The ability to assign variants with reflexive mappings. o The ability to assign variants that have a code point sequence as target. o The ability to express variant mappings symmetrically. o A method of identifying code points that are related, using a one or several tags per code point. o The ability to describe rules regarding the possible actions that may be performed on the resulting label (such as block, allocatable, etc.) o The ability to describe rules that check for ill-formed combinations across the whole label. Davies & Freytag Expires September 6, 2014 [Page 7] Internet-Draft Label Generation Rulesets in XML March 2014 o The ability to describe rules that define contexts in which code points are permissible or variants defined. o The ability to preserve normative reference information as well as informative comments. Davies & Freytag Expires September 6, 2014 [Page 8] Internet-Draft Label Generation Rulesets in XML March 2014 4. LGR Format An LGR is expressed as a well-formed XML Document[XML]. 4.1. Namespace The XML Namespace URI is [TBD]. 4.2. Basic Structure The basic XML framework of the document is as follows: ... Within the "lgr" element rest several sub-elements. First is a "meta" element that contains all meta-data associated with the IDN table, such as its authorship, what it is used for, implementation notes and references. This is followed by a "data" element that contains the substantive code point data. Finally, an optional "rules" element contains information on contextual and whole-label evaluation rules, if any, along with any specific action elements providing for the disposition of labels and computed variant labels. ... ... ... A document MUST contain exactly one "lgr" element. Each "lgr" element MUST contain exactly one "data" element, optionally preceded by one "meta" element and optionally followed by one "rules" element. 4.3. Metadata The "meta" element is used to express meta-data associated within the LGR. It can be used to identify the author or relevant contact person, explain the intended usage of the LGR, and provide Davies & Freytag Expires September 6, 2014 [Page 9] Internet-Draft Label Generation Rulesets in XML March 2014 implementation notes as well as references. The data contained within is not required by software consuming the LGR in order to calculate valid labels, or to calculate variants. However, the "unicode-version" element MUST be used by a consumer of the table to identify that it has the right Unicode data to perform operations on the table. 4.3.1. The version Element The "version" element is used to uniquely identify each version of the LGR being represented. No specific format is required, but it is RECOMMENDED that it be a numerical positive integer, which is incremented with each revision of the file. An example of a typical first edition of a document: 1 The version element may have an optional "comment" attribute. 1 4.3.2. The date Element The "date" element is used to identify the date the LGR was posted. The contents of this element MUST be a valid ISO 8601 date string as described in [RFC3339]. Example of a date: 2009-11-01 4.3.3. The language Element The "language" element signals that the LGR is associated with a specific language or script. The value of the language element must be a valid language tag as described in [RFC5646]. The tag may simply refer to a script if the LGR is not referring to a specific language. Example of an English language LGR: en If the LGR applies to a specific script, rather than a language, the "und" language tag should be used followed by the relevant [RFC5646] Davies & Freytag Expires September 6, 2014 [Page 10] Internet-Draft Label Generation Rulesets in XML March 2014 script subtag. For example, for a Cyrillic script LGR: und-Cyrl If the LGR covers a specific set of multiple languages or scripts, the language element can be repeated. However, for cases of a script-specific LGR exhibiting insignificant admixture of code points from other scripts, it is RECOMMENDED to the use a single "language" element identifying the predominant script. In the exceptional case of a multi-script LGR where no script is predominant, use Zyyy (Common): und-Zyyy Note that that for the particular case of Japanese, a script tag "Japn" exists that matches the mixture of scripts used in writing that language. The preferred language element would be: und-Japn 4.3.4. The domain Element This optional element refers to a domain to which this policy is applied. The value must be a valid domain name that represents the apex of the zone to which the domain is applied, and in the case of the root zone, should be represented as ".". example.com There may be multiple tags used to reflect a list of domains. 4.3.5. The description Element The "description" element is a free-form element that contains any additional relevant description that is useful for the user in its interpretation. Typically, this field contains authorship information, as well as additional context on how the LGR was formulated (such as citations and references), and how it has been applied. The element has an optional "type" attribute, which refers to the internet media type of the enclosed data. Typical types would be "text/plain" or "text/html". The attribute SHOULD be a valid MIME type. If supplied, it will be assumed the contents is content of that media type. If the description lacks a type field, it will be assumed to be plain text ("text/plain"). Davies & Freytag Expires September 6, 2014 [Page 11] Internet-Draft Label Generation Rulesets in XML March 2014 4.3.6. The validity-start and validity-end Elements The "validity-start" and "validity-end" elements are optional elements that describe the time period from which the contents of the LGR become valid (i.e. are used in registry policy), and the contents of the LGR cease to be used. The times should conform to the format described in section 5.6 of [RFC5646]. It may be comprised of a date, or a date and time stamp. 4.3.7. The unicode-version Element Whenever an IDN table depends on character properties from a given version of the Unicode standard, the version number used in creating the LGR MUST be listed. If any software processing the table does not have access to character property data of the requisite version, it MUST NOT perform any operations relating to whole-label evaluation. While, some Unicode code points may not have been assigned in an earlier version, leaving properties for these code points undefined, in other cases their properties may have been updated in the Unicode standard between versions. It is RECOMMENDED to only reference stable or immutable properties. For a given LGR, the property values for the code points in the actual repertoire may be unchanged in a later version of Unicode, even though other changes were made in that standard. If that fact can be established, it MAY be acceptable to use tools based on a later version of Unicode. [[TODO: A method of indicating a range of permissible Unicode versions should be described.]] 6.2 It is not necessary to include a "unicode-version" element for files that do not make use of Unicode properties. Because Unicode has been strictly additive from Version 1.1, the required minimum version for the repertoire can be uniquely determined by checking the code point values in any "cp" attributes against the "age" property in [UAX42]. 4.3.8. The references Element A Label Generation Ruleset may define a list of references which are used to associate various elements in the LGR to one or more normative references. In contrast, global references for the entire LGR can simply be part of the "description" element. References are specified in an optional "references" element contains any number of "reference" elements, each with a unique "id" attribute. It is RECOMMENDED that the "id" attribute be an zero- Davies & Freytag Expires September 6, 2014 [Page 12] Internet-Draft Label Generation Rulesets in XML March 2014 based integer. The value of each "reference" element SHOULD be the citation of a standard, dictionary or other specification in any suitable format. In addition to an "id" attribute, a reference element may have a "comment" attribute for an optional free-form annotation. The Unicode Standard, Version 7.0 Big-5: Computer Chinese Glyph and Character Code Mapping Table, Technical Report C-26, 1984 ISO/IEC 10646:2012 3rd edition ... ... ... A reference can be associated with many types of elements in the "data" or "rules" sections of the LGR by using an optional "ref" attribute (see Section 5.2.4). A "ref" attribute may not occur on elements that are named references to character classes and rules nor on certain specific other element types. See description of these elements below. Davies & Freytag Expires September 6, 2014 [Page 13] Internet-Draft Label Generation Rulesets in XML March 2014 5. Code Points and Variants The bulk of a label generation ruleset is a description of which set of code points are eligible for a given label. For rulesets that perform operations that result in potential variants, the code point- level relationships between variants need to also be described. The code point data is collected within a "data" element. Within this element, a series of "char" and "range" elements describe eligible code points, or ranges of code points, respectively. Discrete permissible code points or code point sequences are declared with a "char" element, e.g. Ranges of permissible code points may be stipulated with a "range" element, e.g. The range is inclusive of the first and last code points. Whether code points are specified individually or as part of a range makes no difference in processing the data, and tools reading or writing the XML format are not required to retain a distinction. All attributes defined for a range element are as if applied to each code point within. Code points must be expressed in uppercase, hexadecimal, and zero padded to a minimum of 4 digits - in other words according to the standard Unicode convention but without the prefix "U+". The rationale for not allowing other encoding formats, including native Unicode encoding in XML, is explored in [UAX42]. The XML conventions used in this format, including the element and attribute names, mirror this document where practical and reasonable to do so. It is RECOMMENDED to list all "char" elements in ascending order of cp attribute. 5.1. Sequences A sequence of two or more code points may be specified in a LGR, for example, when defining the source for n:m variant mappings. Another use of sequences would be in cases when the exact sequence of code points is required to occur in order for the constituent elements to be eligible, such as when a specific code point is only eligible when preceded or followed by another code point. The following would define the eligibility of the MIDDLE DOT (U+00B7) only when both preceded and followed by the LATIN SMALL LETTER L (U+006C): Davies & Freytag Expires September 6, 2014 [Page 14] Internet-Draft Label Generation Rulesets in XML March 2014 As an alternative to using sequences to define a required context, a "char" or "range" element may specify conditional context in a "when" attribute as described below in Section 5.2.6. The latter method is more flexible in that such conditional context is not limited to specific code point in addition to allowing both prohibited as well as required context to be specified. 5.2. Variants While most LGRs typically only determine code point eligibility, others additionally specify a mapping of code points to other code points, known as "variants". What constitutes a variant code point is a matter of policy, and varies for each implementation. The following examples are intended to demonstrate the syntax; they are not necessarily typical. 5.2.1. Basic Variants Variant code points are specified using one of more "var" elements as children of a "char" element. For example, to map LATIN SMALL LETTER V (U+0076) as a variant of LATIN SMALL LETTER U (U+0075): A sequence of multiple code points can be specified as a variant of a single code point. For example, the sequence of LATIN SMALL LETTER O (U+006F) then LATIN SMALL LETTER E (U+0065) might hypothetically be specified as a variant for an LATIN SMALL LETTER O WITH DIAERESIS (U+00F6) as follows: The "var" element specifies variant mappings in only one direction, even though the variant relation is usually considered symmetric, that is, if A is a variant of B then B should also be a variant of A. The format requires that the inverse of the variant be given explicitly to fully specify symmetric variant relations in the IDN table. This has the beneficial side effect of making the symmetry explicit: Davies & Freytag Expires September 6, 2014 [Page 15] Internet-Draft Label Generation Rulesets in XML March 2014 Both the source and target of a variant mapping may be sequences. As it is not possible to specify variants for ranges, ranges cannot be used for characters for which variant relations need to be defined. All variants MUST be unique. For a given "char" element all variants must have a unique combination of "cp" , "when" and "not-when" attributes. It is RECOMMENDED to list the "var" elements in ascending order of their target code point sequence. 5.2.2. Null Variants To specify a null variant, which is a variant string that maps to no code point, use an empty cp attribute. For example, to mark a string with a ZERO WIDTH NON-JOINER (U+200C) to the same string without the ZERO WIDTH NON-JOINER: This is useful in expressing the intent that some code points in a label are to be mapped away when generating a canonical variant of the label. However, in tables that are designed to have symmetric variant mappings, this could lead to combinatorial explosion, if not handled carefully. The symmetric form of a null variant is expressed as follows: A char element with an empty "cp" attribute MUST specify at least one variant mapping, or the results are undefined. It is strongly RECOMMENDED to use a disposition of 'invalid" or equivalent when defining variant mappings from null sequences, so that variant mapping from null sequences are removed in variant label generation. 5.2.3. Dispositions Variants may be given dispositions. These describe the policy state for a variant label that was generated using a particular variant. The dispositions are the same as described below in Section 7. Davies & Freytag Expires September 6, 2014 [Page 16] Internet-Draft Label Generation Rulesets in XML March 2014 A disposition may be of any non-empty value not starting with an underscore and not containing spaces. Within these restrictions a disposition may have any value, but several conventional dispositions are predefined below in Section 7 to encourage common conventions in their application. If these values can represent registry policy, they SHOULD be used. (See also Section 7.6 ). Usually, if a variant label contains any instance of one of the variants that are to be blocked the label would be blocked, but if it contained only instances of variants to be allocated it could be allocated. See the discussion about implied actions in Section 7.6. Because variants MUST be unique, it is not possible to define the same variant for the same "char" element with different dispositions (see however Section 5.2.6). 5.2.4. The ref Attribute Reference information may optionally be specified by a "ref" attribute, consisting of a space delimited sequence of reference identifiers. This facility is typically used to give source information for code points or variant relations. This information is ignored when machine-processing an LGR. Specifying a "ref" attribute on a range element is equivalent to specifying the same ref attribute on every single code point of the range. All reference identifiers MUST be from the set declared in the "references" element (see Section 4.3.8). It is RECOMMENDED that they be listed in ascending order. In addition to "char", "range" and "var" elements in the data section, a ref attribute may be present for literals ("char" inside a rule) as well as rules and class definitions, but not for named references to them. Davies & Freytag Expires September 6, 2014 [Page 17] Internet-Draft Label Generation Rulesets in XML March 2014 5.2.5. Variants with Reflexive Mapping At first sight there seems to be no call for adding variant mappings for which source and target code points are the same, that is for which the mapping is reflexive, or, in other words, an identity mapping. Yet such reflexive mappings occur frequently in LGRs that follow [RFC3743]. Adding a "var" element allows both a disposition and a reference id to be specified for it. While the reference id is not used in processing, the disposition value can be used to trigger actions. In permuting the label to generate all possible variants, the disposition value associated with a reflexive variant mapping is applied to any of the permuted labels containing the original code point. In the following example, the code point U+3473 exists both as a variant of U+3447 and as a variant of itself (reflexive mapping). Assuming an original label of "3473 3447", the permuted variant "3473 3473" would consist of the reflexive variant of 3473 followed by a variant of 3447. Accordingly, the dispositions for both of the variant mappings used to generate that particular permutation would have the value "preferred" given the following definitions of variant mappings: Having established the disposition values in this way, a set of actions could be defined that return a disposition of "allocate" or "activate" for a label consisting exclusively of variants with disposition "preferred" for example. (For details on how to define actions based on variant dispositions see Section 7.) In general, using reflexive variant mappings in this manner, makes it possible to calculate disposition values using a uniform approach for all labels, whether they consist of mapped variant code point, original code points, or a mixture of both. In particular, the disposition values for two otherwise identical labels may differ based on which variant mappings were executed in order to generate each of them. (For details on how to generate variants and evaluate dispositions, see Section 8.) Davies & Freytag Expires September 6, 2014 [Page 18] Internet-Draft Label Generation Rulesets in XML March 2014 5.2.6. Conditional Variants Fundamentally, variants are mappings between two sequences of code points. However, in some instances for a variant relationship to exist, some context external to the code point sequence must be considered. For example, a positional context may determine whether two code point sequences are variants of each other. An example of that are the Arabic code points, which can have different forms based on position, with some code points sharing forms, thus making them variants in the positions corresponding to those forms. Such positional context cannot be solely derived from the code point by itself, as the code point would be the same for the various forms. To specify a conditional variant relationship the optional "when" attribute is used. The variant relationship exists when the condition in the "when" attribute is satisfied. A "not-when" attribute may be used for conditions that must not be satisfied. The value of each "when" or "not-when" attributes is a parameterized context rule as described below in Section 6.4. Assuming the "rules" element contains suitably defined rules for "arabic-isolated" and "arabic-final", the following example shows how to mark ARABIC LETTER ALEF WITH WAVY HAMZA BELOW (U+0673) as a variant of ARABIC LETTER ALEF WITH HAMZA BELOW (U+0625), but only when it appears in its isolated or final forms: Only a single "when" or "not-when" attribute can be applied to any "var" element, however, multiple "var" elements using the same mapping, but different "when" or "not-when" attributes may be specified. While currently Arabic is the only script known for which such conditional variants are defined. there are other scripts, such as Mongolian, which share the concept of positional forms. By requiring explicit definitions for these rules, this mechanism can easily handle any additional types of conditional variants that are required. As described in Section 5.1 a "when" or "not-when" attribute may also be specified to any "char" element in the data section to define required or prohibited contextual conditions under which a code point Davies & Freytag Expires September 6, 2014 [Page 19] Internet-Draft Label Generation Rulesets in XML March 2014 is valid. 5.2.7. The comment Attribute Any "char", "range" or "variant" element may contain a "comment" attribute. The contents of a comment attribute are free-form plain text. Comments are ignored in machine processing of the table. Comment attributes may also be placed on certain elements in the "rules" section of the document, such as actions and literals ("char"), as well as definitions of classes and rules, but not named references to them. Finally, in the metadata the "version" and "reference" elements may have comment attributes to match the syntax in [RFC3743] 5.3. Code Point Tagging Typically, LGRs are used to explicitly designate allowable code points, where any label that contains a code point not explicitly listed in the LGR is considered an ineligible label according to the ruleset. For more complex registry rules, there may be a need to discern on or more subsets code points. This can be accomplished by applying a "tag" attribute to char or range elements, thereby defining character classes (see Section 6.2.1) which can then be used in whole label evaluation rules (see Section 6.3.2). Tag attributes may be of any value, and multiple values are separated by space. Code point sequences not being proper members of a set of code points, a "tag" attribute MUST NOT be present in a char element defining a code point sequence. A simple example of tag use would be to label preferred code points (as in [RFC3743]) by adding "preferred" to the tag, and then using a rule such as shown in Section 6.3.1 to single out labels for allocation that consist entirely of such preferred code points. For a variety of reasons, actual tables use a different approach. Davies & Freytag Expires September 6, 2014 [Page 20] Internet-Draft Label Generation Rulesets in XML March 2014 6. Whole Label and Context Evaluation 6.1. Basic Concepts The code points in a label sometimes need to satisfy context-based rules, for example for the label to be considered valid, or to satisfy the context for a variant mapping (see the description of the "when" attribute in Section 6.4). A Whole Label Evaluation rule (WLE) is applied to the whole label. It is used to validate both original labels and variant labels computed from them using a permutation over all applicable variant mappings. A conditional context rules is a specialized form of WLE specific to the context around a single code point or code point sequence. For example, if a rule is referenced in the "when" attribute of a variant mapping it is used to describe the conditional context under which the particular variant mapping is defined to exist. Each rule is defined in a "rule" element. A rule may contain the following as child elements: o literal code points or code point sequences o character classes, which define sets of code points to be used for context comparisons; o nested rules; and o context operators, which define when character classes and literals may appear; and Collectively, these are called match operators and are listed in Section 6.3.2. 6.2. Character Classes Character classes are sets of characters that often share a particular property. While they function like sets in every way, even supporting the usual set operators, they are called character classes here in a nod to the use of that term in regular expression syntax. (This also avoids confusion with the term "character set" in the sense of character encoding.) Character classes (or sets) can be specified in several ways: Davies & Freytag Expires September 6, 2014 [Page 21] Internet-Draft Label Generation Rulesets in XML March 2014 1. by defining the set via matching a tag in the code point data. All characters with the same tag attribute are part of the same class. 2. by referencing one of the Unicode character properties defined in the Unicode Character Database[UAX42]; 3. by explicitly listing all the code points in the class; or 4. by defining the class as a set combination of any number of other classes. A character class has an optional "name" attribute, consisting of a single identifier not containing spaces. If it is omitted, the class is anonymous and exists only inside the rule or combined class where it is defined. A named character class is defined independently and can be referenced by name from within any rules or as part of other character class definitions. ... An empty "class" element with a "byref" attribute is a reference to an existing named class. Such an element MUST NOT have either "comment" or "ref" attributes as those may only be placed on a class definition. A "byref" and a "name" attribute MUST NOT occur in the same element. 6.2.1. Tag-based Classes The char element may contain a tag attribute that consists of one or more space separated identifiers, for example: This defines two tags for use with code point U+0061, the tag "letter" and the tag "lower". Implicitly, this defines two named character classes, the class "letter" and the class "lower", the first with 0061 and 4E00 as elements and the latter with 0061, but not 4E00 as an element. The document MUST not contain an explicitly named class definition of the same name as an implicitly named tag- Davies & Freytag Expires September 6, 2014 [Page 22] Internet-Draft Label Generation Rulesets in XML March 2014 derived class. 6.2.2. Unicode Property-based Classes A class is defined in terms of Unicode properties by giving the Unicode property alias and the property value or property value alias, separated by a colon. The example above selects all code points for which the Unicode canonical combining class (ccc) value is 9. This value of the ccc is assigned to all code points that encode viramas. The string "ccc" is the short-alias for the canonical combining class, as defined in the Unicode Character Database [UAX42]. Unicode properties may, in principle, change between versions of the Unicode Standard. However, the values assigned for a given version are fixed. If Unicode Properties are used, a minimum Unicode version MUST be declared in the header. (Note, some Unicode properties are by definition stable across versions and do not change once assigned.) 6.2.3. Explicitly Declared Classes A class of code points may also be declared by listing the code points that are a member of the class. This is useful when tagging cannot be used because code points are not listed individually as part of the eligible set of code points for the given LGR, for example because they only occur in code point sequences. To define a class in terms of an explicit list of code points: This defines a class named "abc" containing the code points for characters "a", "b" and "c". The ordering of the code points is not material, but it is RECOMMENDED to list them in ascending order. Range operators may also be used to represent any series of consecutive code points. The same declaration can be made as follows: Davies & Freytag Expires September 6, 2014 [Page 23] Internet-Draft Label Generation Rulesets in XML March 2014 Range and code point declarations can be freely intermixed. A shorthand notation exists where code points are directly represented by space separated hexadecimal values, and ranges are represented by a start and end value separated by a hyphen. The element: 0061 0062-0063 would be a more streamlined expression of the same class using the shorthand notation. A class element either contains any combination of char and range elements and no other elements, or a text node with the shorthand notation. 6.2.4. Combined Classes Classes may be combined using operators for set complement, union, intersection, difference and symmetric difference (exclusive-or). Because classes fundamentally function like sets, the union of several character classes is itself a class, for example. Davies & Freytag Expires September 6, 2014 [Page 24] Internet-Draft Label Generation Rulesets in XML March 2014 +-------------------+---------------------------------------------+ | Logical Operation | Example | +-------------------+---------------------------------------------+ | Complement | | +-------------------+---------------------------------------------+ | Union | | | | | | | | | | | | | | +-------------------+---------------------------------------------+ | Intersection | | | | | | | | | | | +-------------------+---------------------------------------------+ | Difference | | | | | | | | | | | +-------------------+---------------------------------------------+ | Symmetric | | | Difference | | | | | | | | +-------------------+---------------------------------------------+ The elements from this table may be arbitrarily nested inside each other, subject to the following restriction: a "complement" element MUST contain precisely one "class" or one of the operator elements, while an "intersection", "symmetric-difference" or "difference" element MUST contain precisely two, and a "union" element MUST contain two or more of these elements. An anonymous combined class can be defined directly inside a rule or of the match operator elements that allow child elements (see Section 6.3.2) by using the set combination as the outer element. The example shows the definition of an anonymous combined class that represents the union of classes "xxx" and "yyy". There is no need to wrap this union inside another class element, and, in fact, set Davies & Freytag Expires September 6, 2014 [Page 25] Internet-Draft Label Generation Rulesets in XML March 2014 combination elements MUST NOT be nested inside a "class" element. Lastly, to create a named combined class that can be referenced in other classes or in rules as , add a "name" attribute to the set combination element, for example and place it at the top level below the "rules" element. . . . Because (as for sets) a combination of classes is itself a class, no matter how a class is created, a reference to it always uses the "class" element. That is, a named class is always referenced via an empty "class" element using the "byref" attribute containing the name of the class to be referenced. 6.3. Whole Label and Context Rules Each rule is comprised of a series of matching operators that must be satisfied in order to determine whether a label meets a given condition. Rules may reference other rules or character classes defined elsewhere in the table. 6.3.1. The rule Element A matching rule is defined by a "rule" element, the child elements of which are one of the match operators from the table below. In evaluating a rule, each child element is matched in order. Rule elements may be nested. Rules may optionally be named using a "name" attribute containing a single identifier string with no spaces. A named rule may be incorporated into another rule by reference. If the name attribute is omitted, the rule is anonymous and may not be incorporated by reference into another rule or referenced by an action or "when" attribute. A simple rule to match a label where all characters are members of the class "preferred": Davies & Freytag Expires September 6, 2014 [Page 26] Internet-Draft Label Generation Rulesets in XML March 2014 Rules are paired with explicit and implied actions, triggering these actions when a rule matches a label. For example, a simple explicit action for the rule shown above would be: which has the effect of setting the policy disposition for a label made up entirely of "preferred" code points to "allocate". Explicit actions are further discussed in Section 7 and the use of rules in conditional contexts for implied actions is discussed in Section 5.2.6 and Section 7.5. 6.3.2. The Match Operators The child elements of a rule are a series of match operators, which are listed here by type and name and with a basic example or two. Davies & Freytag Expires September 6, 2014 [Page 27] Internet-Draft Label Generation Rulesets in XML March 2014 +------------+-------------+------------------------------------+ | Type | Operator | Examples | +------------+-------------+------------------------------------+ | logical | any | | | +-------------+------------------------------------+ | | choice | | | | | | | | | | | | | | +--------------------------+------------------------------------+ | location | start | | | +-------------+------------------------------------+ | | end | | +--------------------------+------------------------------------+ | literal | char | | +--------------------------+------------------------------------+ | set | class | | | | | 0061 0064-0065 | +--------------------------+------------------------------------+ | group | rule | | | | | | +--------------------------+------------------------------------+ | contextual | anchor | | | +-------------+------------------------------------+ | | look-ahead | | | +-------------+------------------------------------+ | | look-behind | | +--------------------------+------------------------------------+ Any expression defining an anonymous class, including any of the set combination operators (see Section 6.2.4), in addition to references to a named classes. All match operators shown as empty elements in the Examples column of the table above do not support child elements of their own; otherwise match operators may be nested. In particular, anonymous rule elements can be used for grouping. 6.3.3. The count Attribute The count attribute specifies the minimally required or maximal permitted number of times a match operator is used to match input. If the count attribute is n or n:n the match operator matches the input exactly n times, where n is 1 or greater. Davies & Freytag Expires September 6, 2014 [Page 28] Internet-Draft Label Generation Rulesets in XML March 2014 n+ the match operator matches the input at least n times, where n is 0 or greater. n:m the match operator matches the input at least n times where n is 0 or greater, but matches the input up to m times in total, where m > n. missing the match operator matches the input exactly once. In matching, greedy evaluation is used in the sense defined for regular expressions: beyond the required number or times, the input is matched as many times as possible, but not so often as to prevent a match of the remainder of the rule. The count attribute MUST NOT be applied to match operators of type "start", "end", "anchor", "look-ahead" and "look-behind". It may be applied to "class" and "rule" elements only if they do not have a "name" attribute, that is to anonymous rules and classes or any invocation of predefined rules or classes by reference. 6.3.4. The name and byref Attributes Rules (and classes) may be named using a "name" attribute and can then be nested inside other match operators only by reference. To reference a named rule (or class) use a rule or class element with the "byref" attribute containing the name of the referenced element. It is an error to reference a rule or class for which the definition has not been seen, or that is not an implicitly defined tag-based class. A rule or class element with a "byref" attribute does not have child elements, nor any "ref" or "comment" attributes. Here's an example of a rule requiring that all labels be letters (optionally followed by combining marks) and possibly digits. The example shows rules and classes referenced by name. Davies & Freytag Expires September 6, 2014 [Page 29] Internet-Draft Label Generation Rulesets in XML March 2014 6.3.5. The choice Element For cases where several alternates could be chosen, the "choice" element can encode a list of choices: Each child element of a "choice" represents one alternative. The first matching alternative determines the match for the choice element. To express a choice where one alternative consists of a sequence of elements, they can be wrapped in an anonymous rule. 6.3.6. Literal Code Point Sequences A literal code point sequence matches a single code point or a sequence. It is defined by a "char" element, with the code point or sequence to be matched given by the "cp" attribute. When used as a literal, a "char" element may contain a "count" in addition to the "cp" attribute, comments or references, but no conditional contexts or child elements. 6.3.7. The any Element The "any" element matches any single code point. It may have a "count" attribute. For an example see Section 6.3.9 Davies & Freytag Expires September 6, 2014 [Page 30] Internet-Draft Label Generation Rulesets in XML March 2014 The "any" element" may have neither a "comment" nor a "ref" attribute. 6.3.8. The start and end Elements To match the beginning or end of a label, use the "start" or "end" element. Whole Label Evaluation Rules in principle always apply to the entire label, but in practice, many rules do not need to cover the entire label. For example, to express a requirement of not starting a label with a digit, the rule needs to describe only the initial part of a label. Start and end elements do not have a "count" or any other attribute. 6.3.9. Example rule from IDNA2008 This sections shows an example of the whole label evaluation rule from[RFC5892]forbidding the mixture of the Arabic-Indic and extended Arabic-Indic digits in the same label. Davies & Freytag Expires September 6, 2014 [Page 31] Internet-Draft Label Generation Rulesets in XML March 2014 The preceding example also demonstrates several instances of the use of anonymous rules for grouping. 6.4. Parameterized Context or When Rules A special type of rule provides a context for evaluating the validity of a code point or variant mapping. This rule is invoked by the "when" attribute described in Section 5.2.6. An action implied by a context rule always has a disposition of "invalid" whenever the rule is not matched (see Section 7.5). Conversely, a "not-when" attribute results in a disposition of invalid whenever the rule is matched. 6.4.1. The anchor Element Such parameterized context or "When Rules" may contain a special place holder represented by an "anchor" element. As each When Rule is evaluated, the "anchor" element is replaced by a literal corresponding to the "cp" attribute of the element containing the "when" (or "not-when") attribute. The match to the "anchor" element must be at the same position in the label as the code point or variant mapping triggering the When Rule. For example, the Greek lower numeral sign is invalid if not immediately preceding a character in the Greek script. This is most naturally addressed with a When Rule using look-ahead: Davies & Freytag Expires September 6, 2014 [Page 32] Internet-Draft Label Generation Rulesets in XML March 2014 ... In evaluating this rule, the "anchor" element is treated as if it was replaced by a literal but only the instance of U+0375 at the given position is evaluated. If a label had two instances of U+0375 with the first one matching the rule and the second not, then evaluating the When Rule MUST succeed for the first and fail for the second instance. Unlike other rules, When Rules containing an "anchor" element MUST only be invoked via the "when" or "not-when" attributes on code points or variants; otherwise their "anchor" elements cannot be evaluated. However, it is possible to invoke rules not containing an "anchor" element from a "when" or "not-when" attribute. (See Section 6.4.3) 6.4.2. The look-behind and look-ahead Elements Context rules use the "look-behind" and "look-ahead" elements to define context before and after the code point sequence matched by the "anchor" element. If the "anchor" element is omitted, neither the "look-behind" nor the "look-ahead" element may be present. Here is an example of a rule that defines an "initial" context for an Arabic code point: Davies & Freytag Expires September 6, 2014 [Page 33] Internet-Draft Label Generation Rulesets in XML March 2014 A when rule contains any combination of "look-behind" , "anchor" and "look-ahead" elements in that order. Each of these elements occurs at most once, except if nested inside a "choice" element in such a way that each in matching each alternative has only one occurrence is encountered. Otherwise, the result is undefined. None of these elements takes a "count" attribute. If a context rule contains a look-ahead or look-behind element, it MUST contain an "anchor" element. 6.4.3. Omitting the anchor Element If the "anchor" element is omitted, the evaluation of the context rule is not tied to the position of the code point or sequence associated with the "when" attribute. Katakana middle dot is invalid in any label not containing at least one Japanese character anywhere in the label. Because this requirement is independent of the position of the middle dot, the rule does not require an "anchor" element. Davies & Freytag Expires September 6, 2014 [Page 34] Internet-Draft Label Generation Rulesets in XML March 2014 The Katakana middle dot is used only with Han, Katakana or Hiragana. The corresponding When Rule requires that at least one code point in the label is in one of these scripts. (Note that the Katakana middle dot itself is of script Common). Davies & Freytag Expires September 6, 2014 [Page 35] Internet-Draft Label Generation Rulesets in XML March 2014 7. The action Element The purpose of a rule is to trigger a specific action. Often, the action simply results in blocking or invalidating a label that does not match a rule. An example of an action invalidating a label because it does not match a rule named "leading-letter" is as follows: If an action is to be triggered on matching a rule, a "match" attribute is used instead. Actions are evaluated in the order that they appear in the XML file, Once an action is triggered by a label, the disposition defined in the "disp" attribute is assigned to the label and no other actions are evaluated for that label. 7.1. The match and not-match Attributes A "match" or "not-match" attribute specify a rule that must be matched or not matched as a condition for triggering an action. Only a single rule may be named as the value of a "match" or "not-match" attribute. Because rules may be composed of other rules, this restriction to a single attribute value does not impose any limitation on the contexts that can trigger an action. An action may contain a "match" or a "not-match" attribute, but not both. An action without any attributes is triggered by all labels unconditionally. For a very simple LGR, the following action would allocate all labels that match the repertoire: Since rules are evaluated for all labels, whether they are the original label or computed by permuting the defined and valid variant mappings for the label's code points, actions based on matching or not matching a rule may be triggered for both original and variant labels, but they the rules are not affected by the disposition attributes of the variant mappings. To trigger any actions base on these dispositions requires the use additional optional attributes for actions described next. 7.2. Actions matching Variant Dispositions 7.2.1. Variant Disposition triggers An action may contain one of the optional attributes "any-variant", "all-variants" or "only-variants" defining triggers based on variant dispositions. The permitted value for these attributes consists of Davies & Freytag Expires September 6, 2014 [Page 36] Internet-Draft Label Generation Rulesets in XML March 2014 one or more variant disposition values, separated by space. When a variant label is generated, these disposition values are compared to the disposition values on the variant mappings used to generate the particular variant label. Any single match may trigger an action that contains an "any-variant" attribute, while for an "all-variants", "only-variants" attribute, the dispositions for all variant code points must match one or several of the dispositions specified in the attribute value to trigger the action. An "only-variants" attribute will trigger the action only if the variant label contains no original code points other than those with a reflexive mapping (see Section 5.2.5). One of these variant disposition triggers may be used by itself or in conjunction with an attribute matching or not-matching a rule. If variant triggers and rule-matching triggers are used together, the label MUST "match" or respectively "not-match" the specified rule, AND satisfy the conditions on the disposition values given by the "any-variant", "all-variants", or "only-variants" attribute. 7.2.2. Example for RFC3743-style Tables This section gives an example of using variant disposition triggers, combined with variants with reflexive mappings Section 5.2.5 to achieve LGRs that implement tables like those defined according to [RFC3743] where the l is to allow only variants that consist entirely of simplified or traditional variants, in addition to the original label. Assuming an LGR where all variants have been given suitable "disp" attributes of "block", "simplified", "traditional", or "both", similar to the one in Appendix B. Given such an LGR, the following example actions evaluate the disposition for the variant label: The first action matches any variant label for which at least one of the code point variants carries the disposition "block". The second matches any variant label for which all of the code point variants carry the disposition "simplified" or "both", in other words an all- simplified label. The third matches any label for which all variants carry the disposition "traditional" or "both", or all traditional. These two actions are not triggered by any variant labels containing some original code points, unless the code point has a variant Davies & Freytag Expires September 6, 2014 [Page 37] Internet-Draft Label Generation Rulesets in XML March 2014 defined with a reflexive mapping (Section 5.2.5). The final two actions rely on the fact that actions are evaluated in sequence, and that the first action triggered also defines the final disposition for a variant label (see Section 7.4). They further rely on the assumption that the only variants with disposition "both" are also identity variants. Given these assumptions, any remaining simplified or traditional variants must then be part of a mixed label, and so are blocked; all labels surviving to the last action are original code points only (that is the original label). The assumption on identity mapping made above does not necessarily hold, so this scheme needs some refinements to cover tables where it is violated. For a more complete example, see Appendix B. 7.3. Recommended Disposition Values The precise nature of the policy action taken in response to a disposition and the name of the corresponding "disp" attributes are only partially defined here. It is strongly RECOMMENDED to use the following dispositions only with their conventional sense. invalid The resulting string is not a valid label. This disposition may be assigned implicitly, see Section 7.5. No variant labels should be generated from a variant mapping with this disposition. block The resulting string is a valid label, but should be block from registration. This would typically apply for a derived variant that has is undesirable as having no practical use or being confusingly similar to some other label. allocate The resulting string should be reserved for use by the same operator of the origin string, but not automatically allocated for use. activate The resulting string should be activated for use. (This is the typical default action if no dispositions are defined and is known as a "preferred" variant in [RFC3743]) 7.4. Precedence Actions are applied in the order of their appearance in the file. This defines their relative precedence. The first action triggered by a label defines the disposition for that label. To define a specific order of precedence list the actions in the desired order. Davies & Freytag Expires September 6, 2014 [Page 38] Internet-Draft Label Generation Rulesets in XML March 2014 The conventional order of precedence for the actions defined in Section 7.3 is "invalid", "block", "allocate", "activate" . This default precedence is used for the default actions defined in Section 7.6. 7.5. Implied Actions The context rules on code points ("not-when" or "when" rules) carry an implied action with a disposition of "invalid" (not eligible). These rules are evaluated at the time the code points for a label or its variant labels are checked for validity (see Section 8). In other words, they are evaluated before any of the whole-label evaluation rules and with higher precedence. The context rules for variant mappings are evaluated when variants are generated and / or when variant tables are made symmetric and transitive. They have an implied action with a disposition of "invalid" (undefined) which means a putative variant mapping does not exist whenever the given context matches a "not-when" rule or fails to match a "when" rule specified for that mapping. Note that such non-existing variant mapping is different from a blocked variant, which is a variant code point mapping that exists but results in a label that may not be allocated. 7.6. Default Actions As described in Section 7 any variant mapping may be given a "disp" attribute. defining a disposition. An action containing an "any- variant" or "all-variants" attribute relates these disposition values to a resulting disposition for the entire variant label. If no actions are defined for the standard disposition values of "invalid", "block", "allocate" and "activate", then the following default actions exist that are shown below in their default order of precedence (see Section 7.4. This default order for evaluating dispositions applies only to labels that triggered no explicitly defined actions and which are therefore handled by default actions. Default actions have a lower order of precedence than explicit actions (see Section 8.3). The default actions for variant labels are defined as follows: A final default action sets the disposition to "allocate" for any Davies & Freytag Expires September 6, 2014 [Page 39] Internet-Draft Label Generation Rulesets in XML March 2014 label matching the repertoire for which no other action has been triggered (catch-all). Davies & Freytag Expires September 6, 2014 [Page 40] Internet-Draft Label Generation Rulesets in XML March 2014 8. Processing a Label Against an LGR 8.1. Determining Eligibility for a Label In order to use a table to test a specific domain label for membership in the LGR, a consumer of the LGR must iterate through each code point within a given U-label, and test that each code point is a member of the LGR. If any code point is not a member of the LGR, it shall be deemed as not eligible in accordance with the table. A code point is deemed a member of the table when it is listed with the "char" element, and all necessary condition listed in "when" or "not-when" attributes are correctly satisfied. A label must also not trigger any action that results in a disposition of "invalid" or equivalent, otherwise it is deemed not eligible. (This step may be deferred, until dispositions are determined) For LGRs that contain reflexive variant mappings (defined in Section 5.2.5) the evaluation of dispositions must be deferred until variants are generated. In essence, tables that use this feature treat the original as the (identity) variant of itself. For such tables, the ordinary iteration over code points can at best be used to exclude a subset of invalid labels, effectively a pre-screening. 8.2. Determining Variants for a Label For a given eligible label, the set of variant labels is deemed to consist of each possible permutation of original code points and "var" elements, whereby all "when" and "not-when" attributes are correctly satisfied for each code point or var element in the given permutation and all applicable whole label evaluation rules are satisfied as follows: o Create each possible permutation of a label, by substituting each code point or code point sequence in turn by any defined variant mapping (including any reflexive mappings). o Apply variant mappings with "when" or "not-when" attributes only if the conditions are satisfied o Record each of the "disp" values on the variant mappings used in creating a given variant label; for any unmapped code point record the "disp" value of any variant with a reflexive mapping (see Section 5.2.5) Davies & Freytag Expires September 6, 2014 [Page 41] Internet-Draft Label Generation Rulesets in XML March 2014 o Determine the disposition for each variant label per Section 8.3 o If the disposition is "invalid", remove the label from the set o If final evaluation of the disposition for the original label per Section 8.3 results in a disposition of "invalid" or equivalent, remove all associated variant labels from the set. 8.3. Determining a Disposition for a Label or variant Label For a given label (variant or original), its disposition is determined by evaluating in order of their appearance all actions for which the label or variant label satisfies the conditions. o For any label, the disposition is given by the value of the "disp" attribute for the first action triggered by the label. An action is triggered, if * the label matches or doesn't match the whole label evaluation rule, given in the "match" or "not-match" attribute respectively for that action; * any or all of the recorded variant dispositions for a variant label match the dispositions specified in an "any-variant" , "all-variants", or "only-variants" attribute, respectively, for that action, and in case of "only-variants" the label contains only code points that are the target of applied variant mappings; * the label matches or doesn't match the whole label evaluation rule, given in the "match" or "not-match" attribute respectively for that action and any or all of the recorded variant dispositions for a variant label match the dispositions specified in an "any-variant" , "all-variants", or "only- variants" attribute, respectively, for that action, and in case of "only-variants" the label contains only code points that are the target of applied variant mappings; or * the action does not contain any "match", "not-match", "any- variant" or "all-variants" attributes (catch-all). o For any remaining variant label, assign the variant label the disposition using the default actions defined in Section 7.6. For this step, variant dispositions outside the predefined recommended set (see Section 7.3) are ignored. o For any remaining label, set the disposition to "allocate". Davies & Freytag Expires September 6, 2014 [Page 42] Internet-Draft Label Generation Rulesets in XML March 2014 9. Conversion to and from Other Formats Both [RFC3743] and [RFC4290] provide different grammars for IDN tables. These formats are unable to fully cater for the increased requirements of contemporary IDN variant policies. This specification is a superset of functionality provided by these IDN table formats, thus any table expressed in those formats can be expressed in this format. Automated conversion can be conducted between tables conformant with the grammar specified in each document. For notes on how to translate an RFC 3743-style table, see Appendix B. Davies & Freytag Expires September 6, 2014 [Page 43] Internet-Draft Label Generation Rulesets in XML March 2014 10. IANA Considerations This document does not specify any IANA actions. Davies & Freytag Expires September 6, 2014 [Page 44] Internet-Draft Label Generation Rulesets in XML March 2014 11. Security Considerations There are no security considerations for this memo. Davies & Freytag Expires September 6, 2014 [Page 45] Internet-Draft Label Generation Rulesets in XML March 2014 12. References [ASIA-TABLE] DotAsia Organisation, ".ASIA ZH IDN Language Table". [LGR-PROCEDURE] Internet Corporation for Assigned Names and Numbers, "Procedure to Develop and Maintain the Label Generation Rules for the Root Zone in Respect of IDNA Labels". [RFC3339] Klyne, G., Ed. and C. Newman, "Date and Time on the Internet: Timestamps", RFC 3339, July 2002. [RFC3743] Konishi, K., Huang, K., Qian, H., and Y. Ko, "Joint Engineering Team (JET) Guidelines for Internationalized Domain Names (IDN) Registration and Administration for Chinese, Japanese, and Korean", RFC 3743, April 2004. [RFC4290] Klensin, J., "Suggested Practices for Registration of Internationalized Domain Names (IDN)", RFC 4290, December 2005. [RFC5564] El-Sherbiny, A., Farah, M., Oueichek, I., and A. Al-Zoman, "Linguistic Guidelines for the Use of the Arabic Language in Internet Domains", RFC 5564, February 2010. [RFC5646] Phillips, A. and M. Davis, "Tags for Identifying Languages", BCP 47, RFC 5646, September 2009. [RFC5892] Faltstrom, P., "The Unicode Code Points and Internationalized Domain Names for Applications (IDNA)", RFC 5892, August 2010. [TDIL-HINDI] Technology Development for Indian Languages (TDIL) Programme, "Devanagari Script Behaviour for Hindi". [UAX42] Unicode Consortium, "Unicode Character Database in XML". [XML] World Wide Web Consortium, "Extensible Markup Language (XML) 1.0". Davies & Freytag Expires September 6, 2014 [Page 46] Internet-Draft Label Generation Rulesets in XML March 2014 Appendix A. Example Table The following presents a sample XML LGR showing a near complete collection of most of the elements and attributes defined in this specification in somewhat typical context. 1 2010-01-01 sv example Swedish examples institute. ]]> The Unicode Standard 6.3 RFC 5892 Big-5: Computer Chinese Glyph and Character Code Mapping Table, Technical Report C-26, 1984 Davies & Freytag Expires September 6, 2014 [Page 47] Internet-Draft Label Generation Rulesets in XML March 2014 006E 0070-0078 Davies & Freytag Expires September 6, 2014 [Page 48] Internet-Draft Label Generation Rulesets in XML March 2014 Appendix B. How to Translate RFC 3743 based Tables into the XML Format As a background, the [RFC3743] rules work as follows: 1. The Original (requested) label is checked to make sure that all the code points are a subset of the repertoire. 2. If it passes the check, the Original label is allocatable. 3. Generate the all-simplified and all-traditional variant labels (union of all the labels generated using all the simplified variants of the code points) for allocation. To illustrate by example, here is one of the more complicated set of variants: U+4E7E U+4E81 U+5E72 U+5E79 U+69A6 U+6F27 The following shows the relevant section of the Chinese language table published by the .ASIA registry [ASIA-TABLE]. Its entries read: ;;; These are the lines corresponding to the set of variants listed above U+4E7E;U+4E7E,U+5E72;U+4E7E;U+4E81,U+5E72,U+6F27,U+5E79,U+69A6 U+4E81;U+5E72;U+4E7E;U+5E72,U+6F27,U+5E79,U+69A6 U+5E72;U+5E72;U+5E72,U+4E7E,U+5E79;U+4E7E,U+4E81,U+69A6,U+6F27 U+5E79;U+5E72;U+5E79;U+69A6,U+4E7E,U+4E81,U+6F27 U+69A6;U+5E72;U+69A6;U+5E79,U+4E7E,U+4E81,U+6F27 U+6F27;U+4E7E;U+6F27;U+4E81,U+5E72,U+5E79,U+69A6 The corresponding data section XML format would look like this: Davies & Freytag Expires September 6, 2014 [Page 49] Internet-Draft Label Generation Rulesets in XML March 2014 Here the simplified variants have been given a disposition of "simp", the traditional variants one of "trad" and all other ones are given "block". Note that some variant mappings map to themselves (identity), that is the mapping is reflexive (see Section 5.2.5). In creating the Davies & Freytag Expires September 6, 2014 [Page 50] Internet-Draft Label Generation Rulesets in XML March 2014 permutation of all variant labels, these mappings have no effect, other than adding a value to the variant disposition list for the variant label containing them. Because some variant mappings show in more than one column, while the XML format allows only a single disposition value, they have been given the disposition of "both". In the example so far, all of these are also mappings where source and target are identical that is, reflexive mappings as defined in Section 5.2.5. Given a label "U+4E7E U+4E81", the following labels would be ruled allocatable under [RFC3743] based on how it is commonly implemented in domain registries: Original label: U+4E7E U+4E81 Simplified label 1: U+4E7E U+5E72 Simplified label 2: U+5E72 U+5E72 Traditional label: U+4E7E U+4E7E However, If we generated allocatable labels without regard to the simplified-to-traditional variants, we would end up with an extra allocatable label: "U+5E72 U+4E7E". That label is comprised of an SC character and a TC character which shouldn't be allocatable, but it would be the result of a straight permutation of all variants with disposition other than disp="block". To more fully resolve the dispositions requires several actions to be defined as described in Section 7.2.2. After blocking all labels that contain a variant with disposition "block", these actions will first allocate all labels that consist entirely of variants (including variants with reflexive mappings) that are "simp" or "both", then do likewise for labels that are entirely "trad" or "both". All surviving labels containing any one of the dispositions "simp" or "trad" are now known to be part of an undesirable mixed simplified/traditional label and are blocked. Finally, the remaining labels must be code points without variants or reflexive variants of type "both", in other words, the original label. Davies & Freytag Expires September 6, 2014 [Page 51] Internet-Draft Label Generation Rulesets in XML March 2014 In the example above, variants with the disposition "both" occur only as part of identity mappings (as pointed out in the comments). The scheme described so far relies on the assumption that this is always the case. However, consider the following set of variants: U+62E0;U+636E;U+636E;U+64DA U+636E;U+636E;U+64DA;U+62E0 U+64DA;U+636E;U+64DA;U+62E0 for which the corresponding XML would be: What is needed to make such variant sets work is a way to capture when a disposition is associated with an identity or reflexive mapping, and when it is associated with an ordinary variant mapping. Davies & Freytag Expires September 6, 2014 [Page 52] Internet-Draft Label Generation Rulesets in XML March 2014 This can be done by adding a prefix "i-" in front of the disposition whenever the mapping is an identity mapping, for example the last "trad" in the preceding figure would become "i-trad". With all the dispositions prepared in this way, only a slight modification to the actions is needed to yield the correct set of allocatable labels: The first three actions get triggered by the same labels as before. The fourth action blocks any label that combines an original code point with any of the variant mappings, yet lets through all labels that are a combination of only original code points (everything having either no variant mapping or one of the identity mappings). These are the original labels and they are allocated in the last action. With this modification all RFC 3743-style tables can be converted to XML and, by using the above set of actions, the result will be that all variants consisting completely of variants preferred for simplified or traditional, respectively, will be allocated, as will be the original label. All other variant labels will be blocked. Davies & Freytag Expires September 6, 2014 [Page 53] Internet-Draft Label Generation Rulesets in XML March 2014 Appendix C. Indic Syllable Structure Example In LGRs for Indic scripts it may be desirable to restrict valid labels to sequences of valid Indic syllables, or aksharas. This appendix gives a sample set of rules designed to enforce this restriction. We start with the following BNF form for an akshara which has been published in "Devanagari Script Behavior for Hindi" [TDIL-HINDI] but which, if not directly valid for other languages and scripts used in India is at least similar to equivalent definitions used for them. V[m]|{C[N]H}C[N](H|[v][m]) Where: V (upper case) is any independent vowel m is any vowel modifier (Devanagari Anusvara, Visarga, and Candrabindu) C is any consonant (with inherent vowel) N is Nukta H is a Halant (or Virama) v (lower case) is any dependent vowel sign (matra) {} encloses items which may be repeated one or more times [ ] encloses items which may or may not be present | separates items, out of which only one can be present By using the Unicode property "InSC" or "Indic_Syllable_Category" which corresponds rather directly to the classification of characters in the BNF above, we can directly translate the BNF into a set of WLE rules matching the definition of an akshara. Davies & Freytag Expires September 6, 2014 [Page 54] Internet-Draft Label Generation Rulesets in XML March 2014 With the rules and classes as defined above, the final action assigns a disposition of "invalid" to all labels that are not composed of a sequence of well-formed aksharas, optionally interspersed with other characters, perhaps digits, for example. Davies & Freytag Expires September 6, 2014 [Page 55] Internet-Draft Label Generation Rulesets in XML March 2014 The relevant Unicode property is as of this writing still considered provisional; however, it could be replicated by tagging repertoire values directly in the LGR which would remove the dependency on the Unicode Standard altogether. Davies & Freytag Expires September 6, 2014 [Page 56] Internet-Draft Label Generation Rulesets in XML March 2014 Appendix D. RelaxNG Schema [0-9A-F]{4,6} [0-9A-F]{4,6}( [0-9A-F]{4,6})+ \d{4}-\d\d-\d\d \d+(\+|:\d+)? Davies & Freytag Expires September 6, 2014 [Page 57] Internet-Draft Label Generation Rulesets in XML March 2014 Davies & Freytag Expires September 6, 2014 [Page 58] Internet-Draft Label Generation Rulesets in XML March 2014 Davies & Freytag Expires September 6, 2014 [Page 59] Internet-Draft Label Generation Rulesets in XML March 2014 Davies & Freytag Expires September 6, 2014 [Page 60] Internet-Draft Label Generation Rulesets in XML March 2014 Davies & Freytag Expires September 6, 2014 [Page 61] Internet-Draft Label Generation Rulesets in XML March 2014 Davies & Freytag Expires September 6, 2014 [Page 62] Internet-Draft Label Generation Rulesets in XML March 2014 Davies & Freytag Expires September 6, 2014 [Page 63] Internet-Draft Label Generation Rulesets in XML March 2014 Davies & Freytag Expires September 6, 2014 [Page 64] Internet-Draft Label Generation Rulesets in XML March 2014 Davies & Freytag Expires September 6, 2014 [Page 65] Internet-Draft Label Generation Rulesets in XML March 2014 Davies & Freytag Expires September 6, 2014 [Page 66] Internet-Draft Label Generation Rulesets in XML March 2014 Davies & Freytag Expires September 6, 2014 [Page 67] Internet-Draft Label Generation Rulesets in XML March 2014 Davies & Freytag Expires September 6, 2014 [Page 68] Internet-Draft Label Generation Rulesets in XML March 2014 Appendix E. Acknowledgements This format builds upon the work on documenting IDN tables by many different registry operators. Notably, a comprehensive language table for Chinese, Japanese and Korean was developed by the "Joint Engineering Team" [RFC3743] that is the basis of many registry policies; and a set of guidelines for Arabic script registrations [RFC5564] was published by the Arabic-language community. Contributions that have shaped this document have been provided by Francisco Arias, Mark Davis, Nicholas Ostler, Thomas Roessler, Steve Sheng, Michel Suignard, Andrew Sullivan, Wil Tan and John Yunker. Davies & Freytag Expires September 6, 2014 [Page 69] Internet-Draft Label Generation Rulesets in XML March 2014 Appendix F. Editorial Notes This appendix to be removed prior to final publication. F.1. Known Issues and Future Work o A method of specifying the origin URI for a table, and an expiration or refresh policy, as meta-data may be a useful way to declare how the table will be updated. o The "domain" element should be specified as absolute, so that the Root can be identified as needed for the Root Zone LGR. o The recommended names for disposition ("block" and "allocate") deviate from the name in the Root Zone LGR Procedure ("blocked" and "allocatable"). The latter were chosen to highlight that the machine processing of the LGR table is just the first step, actual allocation requires additional actions, hence "allocatable". This should be resolved. F.2. Change History -00 Initial draft. -01 Add an XML Namespace, and fix other XML nits. Add support for sequences of code points. Improve on consistently using Unicode nomenclature. -02 Add support for validity periods. -03 Incorporate requirements from the Label Generation Ruleset Procedure for the DNS Root Zone. These requirements include a detailed grammar for specifying whole-label variants, and the ability to explicitly declare of the actions associated with a specific variant. The document also consistently applies the term "Label Generation Ruleset", rather than "IDN table", to reflect the policy term now being used to describe these. -04 Support reference information per [RFC3743]. Update description in response to feedback. Extend the context rules to "char" elements and allow for inverse matching ("not-when"). Extend the description of label processing and implied actions, and allow for actions that reference disposition attributes on any or all variant mappings used in the generation of a variant label. Davies & Freytag Expires September 6, 2014 [Page 70] Internet-Draft Label Generation Rulesets in XML March 2014 -05 Change the name of the "disposition" attribute to "disp". Add comment attribute on version and reference elements. Allow empty "cp" attributes in char elements to support expressing symmetric mapping of null variants. Describe use of variants that map identically. Clarify how actions are triggered, in particular based on variant dispositions, as well as description of default actions. Revise description of processing a label and its variants. Move example table at the head of appendices. Add "only-variants" attribute. Change "name" attribute to "byref" attribute for referencing named classes and rules. Change "not" to "complement". Remove "match" attribute on rules as redundant if "start" and "end" are supported. Rename "match" element to "anchor" as better fitting it's function and removing confusion with both the "match" attribute on actions as well as the generic term Match Operator. Augmented the examples relevant to [RFC3743]. -06 Extend the discussion of reflexive variants and their use; includes update of the appendix on converting tables in the style of [RFC3743]. Improve description of tagging and clarify that it doesn't apply to sequences. Specify that root zone uses ".". Add an appendix with an Indic Syllable Structure example. Extend count attribute to allow maximal counts. Davies & Freytag Expires September 6, 2014 [Page 71] Internet-Draft Label Generation Rulesets in XML March 2014 Authors' Addresses Kim Davies Internet Corporation for Assigned Names and Numbers 12025 Waterfront Drive Los Angeles, CA 90094 US Phone: +1 310 301 5800 Email: kim.davies@icann.org URI: http://www.icann.org/ Asmus Freytag ASMUS Inc. Email: asmus@unicode.org Davies & Freytag Expires September 6, 2014 [Page 72]