Improve character name escapes

author Paul Eggert <eggert@cs.ucla.edu>

Fri, 22 Apr 2016 02:26:34 +0000 (19:26 -0700)

committer Paul Eggert <eggert@cs.ucla.edu>

Fri, 22 Apr 2016 02:29:41 +0000 (19:29 -0700)
author Paul Eggert <eggert@cs.ucla.edu>
Fri, 22 Apr 2016 02:26:34 +0000 (19:26 -0700)
committer Paul Eggert <eggert@cs.ucla.edu>
Fri, 22 Apr 2016 02:29:41 +0000 (19:29 -0700)
diff --git a/doc/lispref/nonascii.texi b/doc/lispref/nonascii.texi

index 66ad9aca71e55abce0dc121b1db4546d850a00bd..0e4aa86e48b60b52166cff60bbac7da2ad376df2 100644 (file)
--- a/doc/lispref/nonascii.texi
+++ b/doc/lispref/nonascii.texi
@@ -622,18 +622,21 @@ This function returns the value of @var{char}'s @var{propname} property.
       @result{} Nd
  @end group
  @group
       @result{} Nd
  @end group
  @group
-;; U+2084 SUBSCRIPT FOUR
-(get-char-code-property ?\u2084 'digit-value)
+;; U+2084
+(get-char-code-property ?\N@{SUBSCRIPT FOUR@}
+                        'digit-value)
       @result{} 4
  @end group
  @group
       @result{} 4
  @end group
  @group
-;; U+2155 VULGAR FRACTION ONE FIFTH
-(get-char-code-property ?\u2155 'numeric-value)
+;; U+2155
+(get-char-code-property ?\N@{VULGAR FRACTION ONE FIFTH@}
+                        'numeric-value)
       @result{} 0.2
  @end group
  @group
       @result{} 0.2
  @end group
  @group
-;; U+2163 ROMAN NUMERAL FOUR
-(get-char-code-property ?\N@{ROMAN NUMERAL FOUR@} 'numeric-value)
+;; U+2163
+(get-char-code-property ?\N@{ROMAN NUMERAL FOUR@}
+                        'numeric-value)
       @result{} 4
  @end group
  @group
       @result{} 4
  @end group
  @group
diff --git a/doc/lispref/objects.texi b/doc/lispref/objects.texi

index 96b334d2b81abd4616ec18c9f70040f6b17cedb8..54894b8e24e52b0a5f5e81a7379870d1e9dfa25a 100644 (file)
--- a/doc/lispref/objects.texi
+++ b/doc/lispref/objects.texi
@@ -353,25 +353,32 @@ following text.)
  control characters, Emacs provides several types of escape syntax that
  you can use to specify non-@acronym{ASCII} text characters.
  
  control characters, Emacs provides several types of escape syntax that
  you can use to specify non-@acronym{ASCII} text characters.
  
+@enumerate
+@item
  @cindex @samp{\} in character constant
  @cindex backslash in character constants
  @cindex unicode character escape
  @cindex @samp{\} in character constant
  @cindex backslash in character constants
  @cindex unicode character escape
-  Firstly, you can specify characters by their Unicode values.
-@code{?\u@var{nnnn}} represents a character with Unicode code point
-@samp{U+@var{nnnn}}, where @var{nnnn} is (by convention) a hexadecimal
-number with exactly four digits.  The backslash indicates that the
-subsequent characters form an escape sequence, and the @samp{u}
-specifies a Unicode escape sequence.
-
-  There is a slightly different syntax for specifying Unicode
-characters with code points higher than @code{U+@var{ffff}}:
-@code{?\U00@var{nnnnnn}} represents the character with code point
-@samp{U+@var{nnnnnn}}, where @var{nnnnnn} is a six-digit hexadecimal
-number.  The Unicode Standard only defines code points up to
-@samp{U+@var{10ffff}}, so if you specify a code point higher than
-that, Emacs signals an error.
-
-  Secondly, you can specify characters by their hexadecimal character
+You can specify characters by their Unicode names, if any.
+@code{?\N@{@var{NAME}@}} represents the Unicode character named
+@var{NAME}.  Thus, @samp{?\N@{LATIN SMALL LETTER A WITH GRAVE@}} is
+equivalent to @code{?à} and denotes the Unicode character U+00E0.  To
+simplify entering multi-line strings, you can replace spaces in the
+names by non-empty sequences of whitespace (e.g., newlines).
+
+@item
+You can specify characters by their Unicode values.
+@code{?\N@{U+@var{X}@}} represents a character with Unicode code point
+@var{X}, where @var{X} is a hexadecimal number.  Also,
+@code{?\u@var{xxxx}} and @code{?\U@var{xxxxxxxx}} represent code
+points @var{xxxx} and @var{xxxxxxxx}, respectively, where each @var{x}
+is a single hexadecimal digit.  For example, @code{?\N@{U+E0@}},
+@code{?\u00e0} and @code{?\U000000E0} are all equivalent to @code{?à}
+and to @samp{?\N@{LATIN SMALL LETTER A WITH GRAVE@}}.  The Unicode
+Standard defines code points only up to @samp{U+@var{10ffff}}, so if
+you specify a code point higher than that, Emacs signals an error.
+
+@item
+You can specify characters by their hexadecimal character
  codes.  A hexadecimal escape sequence consists of a backslash,
  @samp{x}, and the hexadecimal character code.  Thus, @samp{?\x41} is
  the character @kbd{A}, @samp{?\x1} is the character @kbd{C-a}, and
  codes.  A hexadecimal escape sequence consists of a backslash,
  @samp{x}, and the hexadecimal character code.  Thus, @samp{?\x41} is
  the character @kbd{A}, @samp{?\x1} is the character @kbd{C-a}, and
@@ -379,23 +386,16 @@ the character @kbd{A}, @samp{?\x1} is the character @kbd{C-a}, and
  You can use any number of hex digits, so you can represent any
  character code in this way.
  
  You can use any number of hex digits, so you can represent any
  character code in this way.
  
+@item
  @cindex octal character code
  @cindex octal character code
-  Thirdly, you can specify characters by their character code in
+You can specify characters by their character code in
  octal.  An octal escape sequence consists of a backslash followed by
  up to three octal digits; thus, @samp{?\101} for the character
  @kbd{A}, @samp{?\001} for the character @kbd{C-a}, and @code{?\002}
  for the character @kbd{C-b}.  Only characters up to octal code 777 can
  be specified this way.
  
  octal.  An octal escape sequence consists of a backslash followed by
  up to three octal digits; thus, @samp{?\101} for the character
  @kbd{A}, @samp{?\001} for the character @kbd{C-a}, and @code{?\002}
  for the character @kbd{C-b}.  Only characters up to octal code 777 can
  be specified this way.
  
-  Fourthly, you can specify characters by their name.  A character
-name escape sequence consists of a backslash, @samp{N@{}, the Unicode
-character name, and @samp{@}}.  Alternatively, you can also put the
-numeric code point value between the braces, using the syntax
-@samp{\N@{U+nnnn@}}, where @samp{nnnn} denotes between one and eight
-hexadecimal digits.  Thus, @samp{?\N@{LATIN CAPITAL LETTER A@}} and
-@samp{?\N@{U+41@}} both denote the character @kbd{A}.  To simplify
-entering multi-line strings, you can replace spaces in the character
-names by arbitrary non-empty sequence of whitespace (e.g., newlines).
+@end enumerate
  
    These escape sequences may also be used in strings.  @xref{Non-ASCII
  in Strings}.
  
    These escape sequences may also be used in strings.  @xref{Non-ASCII
  in Strings}.
diff --git a/src/character.h b/src/character.h

index bc3e15578440cbf408be33b2b9052541b7722527..586f330fba9fe3dfbe5c47db9715924a9fe1d482 100644 (file)
--- a/src/character.h
+++ b/src/character.h
@@ -612,14 +612,13 @@ sanitize_char_width (EMACS_INT width)
     : (c) <= 0xE01EF ? (c) - 0xE0100 + 17       \
     : 0)
  
     : (c) <= 0xE01EF ? (c) - 0xE0100 + 17       \
     : 0)
  
-/* If C is a high surrogate, return 1.  If C is a low surrogate,
-   return 2.  Otherwise, return 0.  */
+/* Return true if C is a surrogate.  */
  
  
-#define CHAR_SURROGATE_PAIR_P(c)       \
-  ((c) < 0xD800 ? 0                    \
-   : (c) <= 0xDBFF ? 1                 \
-   : (c) <= 0xDFFF ? 2                 \
-   : 0)
+INLINE bool
+char_surrogate_p (int c)
+{
+  return 0xD800 <= c && c <= 0xDFFF;
+}
  
  /* Data type for Unicode general category.
  
  
  /* Data type for Unicode general category.
  
diff --git a/src/lread.c b/src/lread.c

index c3b6bd79e42f08e3f193971dee49386adac52485..a42c1f60c9555a2b0c7327ec0d9606af57e9f573 100644 (file)
--- a/src/lread.c
+++ b/src/lread.c
@@ -44,7 +44,6 @@ along with GNU Emacs.  If not, see <http://www.gnu.org/licenses/>.  */
  #include "termhooks.h"
  #include "blockinput.h"
  #include <c-ctype.h>
  #include "termhooks.h"
  #include "blockinput.h"
  #include <c-ctype.h>
-#include <string.h>
  
  #ifdef MSDOS
  #include "msdos.h"
  
  #ifdef MSDOS
  #include "msdos.h"
@@ -2151,88 +2150,42 @@ grow_read_buffer (void)
                          MAX_MULTIBYTE_LENGTH, -1, 1);
  }
  
                          MAX_MULTIBYTE_LENGTH, -1, 1);
  }
  
-/* Signal an invalid-read-syntax error indicating that the character
-   name in an \N{…} literal is invalid.  */
-static _Noreturn void
-invalid_character_name (Lisp_Object name)
-{
-  AUTO_STRING (format, "\\N{%s}");
-  xsignal1 (Qinvalid_read_syntax, CALLN (Fformat, format, name));
-}
-
-/* Check that CODE is a valid Unicode scalar value, and return its
-   value.  CODE should be parsed from the character name given by
-   NAME.  NAME is used for error messages.  */
+/* Return the scalar value that has the Unicode character name NAME.
+   Raise 'invalid-read-syntax' if there is no such character.  */
  static int
  static int
-check_scalar_value (Lisp_Object code, Lisp_Object name)
+character_name_to_code (char const *name, ptrdiff_t name_len)
  {
  {
-  if (! NUMBERP (code))
-    invalid_character_name (name);
-  EMACS_INT i = XINT (code);
-  if (! (0 <= i && i <= MAX_UNICODE_CHAR)
-      /* Don't allow surrogates.  */
-      || (0xD800 <= code && code <= 0xDFFF))
-    invalid_character_name (name);
-  return i;
-}
+  Lisp_Object code;
  
  
-/* If NAME starts with PREFIX, interpret the rest as a hexadecimal
-   number and return its value.  Raise invalid-read-syntax if the
-   number is not a valid scalar value.  Return −1 if NAME doesn’t
-   start with PREFIX.  */
-static int
-parse_code_after_prefix (Lisp_Object name, const char *prefix)
-{
-  ptrdiff_t name_len = SBYTES (name);
-  ptrdiff_t prefix_len = strlen (prefix);
-  /* Allow between one and eight hexadecimal digits after the
-     prefix.  */
-  if (prefix_len < name_len && name_len <= prefix_len + 8
-      && memcmp (SDATA (name), prefix, prefix_len) == 0)
+  /* Code point as U+XXXX....  */
+  if (name[0] == 'U' && name[1] == '+')
      {
      {
-      Lisp_Object code = string_to_number (SDATA (name) + prefix_len, 16, false);
-      if (NUMBERP (code))
-        return check_scalar_value (code, name);
+      /* Pass the leading '+' to string_to_number, so that it
+        rejects monstrosities such as negative values.  */
+      code = string_to_number (name + 1, 16, false);
+    }
+  else
+    {
+      /* Look up the name in the table returned by 'ucs-names'.  */
+      AUTO_STRING_WITH_LEN (namestr, name, name_len);
+      Lisp_Object names = call0 (Qucs_names);
+      code = CDR (Fassoc (namestr, names));
      }
      }
-  return -1;
-}
  
  
-/* Returns the scalar value that has the Unicode character name NAME.
-   Raises `invalid-read-syntax' if there is no such character.  */
-static int
-character_name_to_code (Lisp_Object name)
-{
-  /* Code point as U+N, where N is between 1 and 8 hexadecimal
-     digits.  */
-  int code = parse_code_after_prefix (name, "U+");
-  if (code >= 0)
-    return code;
-
-  /* CJK ideographs are not contained in the association list returned
-     by `ucs-names'.  But they follow a predictable naming pattern: a
-     fixed prefix plus the hexadecimal codepoint value.  */
-  code = parse_code_after_prefix (name, "CJK IDEOGRAPH-");
-  if (code >= 0)
+  if (! (INTEGERP (code)
+        && 0 <= XINT (code) && XINT (code) <= MAX_UNICODE_CHAR
+        && ! char_surrogate_p (XINT (code))))
      {
      {
-      /* Various ranges of CJK characters; see UnicodeData.txt.  */
-      if ((0x3400 <= code && code <= 0x4DB5)
-          || (0x4E00 <= code && code <= 0x9FD5)
-          || (0x20000 <= code && code <= 0x2A6D6)
-          || (0x2A700 <= code && code <= 0x2B734)
-          || (0x2B740 <= code && code <= 0x2B81D)
-          || (0x2B820 <= code && code <= 0x2CEA1))
-        return code;
-      else
-        invalid_character_name (name);
+      AUTO_STRING (format, "\\N{%s}");
+      AUTO_STRING_WITH_LEN (namestr, name, name_len);
+      xsignal1 (Qinvalid_read_syntax, CALLN (Fformat, format, namestr));
      }
  
      }
  
-  /* Look up the name in the table returned by `ucs-names'.  */
-  Lisp_Object names = call0 (Qucs_names);
-  return check_scalar_value (CDR (Fassoc (name, names)), name);
+  return XINT (code);
  }
  
  /* Bound on the length of a Unicode character name.  As of
  }
  
  /* Bound on the length of a Unicode character name.  As of
-   Unicode 9.0.0 the maximum is 83, so this should be safe. */
+   Unicode 9.0.0 the maximum is 83, so this should be safe.  */
  enum { UNICODE_CHARACTER_NAME_LENGTH_BOUND = 200 };
  
  /* Read a \-escape sequence, assuming we already read the `\'.
  enum { UNICODE_CHARACTER_NAME_LENGTH_BOUND = 200 };
  
  /* Read a \-escape sequence, assuming we already read the `\'.
@@ -2458,14 +2411,14 @@ read_escape (Lisp_Object readcharfun, bool stringp)
                end_of_file_error ();
              if (c == '}')
                break;
                end_of_file_error ();
              if (c == '}')
                break;
-            if (! c_isascii (c))
+            if (! (0 < c && c < 0x80))
                {
                  AUTO_STRING (format,
                {
                  AUTO_STRING (format,
-                             "Non-ASCII character U+%04X in character name");
+                             "Invalid character U+%04X in character name");
                  xsignal1 (Qinvalid_read_syntax,
                            CALLN (Fformat, format, make_natnum (c)));
                }
                  xsignal1 (Qinvalid_read_syntax,
                            CALLN (Fformat, format, make_natnum (c)));
                }
-            /* We treat multiple adjacent whitespace characters as a
+            /* Treat multiple adjacent whitespace characters as a
                 single space character.  This makes it easier to use
                 character names in e.g. multi-line strings.  */
              if (c_isspace (c))
                 single space character.  This makes it easier to use
                 character names in e.g. multi-line strings.  */
              if (c_isspace (c))
@@ -2483,7 +2436,8 @@ read_escape (Lisp_Object readcharfun, bool stringp)
            }
          if (length == 0)
            invalid_syntax ("Empty character name");
            }
          if (length == 0)
            invalid_syntax ("Empty character name");
-        return character_name_to_code (make_unibyte_string (name, length));
+       name[length] = '\0';
+       return character_name_to_code (name, length);
        }
  
      default:
        }
  
      default:
diff --git a/test/src/lread-tests.el b/test/src/lread-tests.el

index ff5d0f655f3ffe4d538568229b6da24b08922d89..2ebaf491120ff12aa744cade636dd7d27065f941 100644 (file)
--- a/test/src/lread-tests.el
+++ b/test/src/lread-tests.el
@@ -1,6 +1,6 @@
  ;;; lread-tests.el --- tests for lread.c -*- lexical-binding: t; -*-
  
  ;;; lread-tests.el --- tests for lread.c -*- lexical-binding: t; -*-
  
-;; Copyright (C) 2016  Google Inc.
+;; Copyright (C) 2016 Free Software Foundation, Inc.
  
  ;; Author: Philipp Stephani <phst@google.com>
  
  
  ;; Author: Philipp Stephani <phst@google.com>
  
@@ -26,11 +26,10 @@
  ;;; Code:
  
  (ert-deftest lread-char-number ()
  ;;; Code:
  
  (ert-deftest lread-char-number ()
-  (should (equal ?\N{U+A817} #xA817)))
+  (should (equal (read "?\\N{U+A817}") #xA817)))
  
  (ert-deftest lread-char-name ()
  
  (ert-deftest lread-char-name ()
-  (should (equal ?\N{SYLOTI  NAGRI LETTER
-                 DHO}
+  (should (equal (read "?\\N{SYLOTI  NAGRI LETTER \n DHO}")
                   #xA817)))
  
  (ert-deftest lread-char-invalid-number ()
                   #xA817)))
  
  (ert-deftest lread-char-invalid-number ()
@@ -46,16 +45,23 @@
  (ert-deftest lread-char-empty-name ()
    (should-error (read "?\\N{}") :type 'invalid-read-syntax))
  
  (ert-deftest lread-char-empty-name ()
    (should-error (read "?\\N{}") :type 'invalid-read-syntax))
  
-(ert-deftest lread-char-cjk-name ()
-  (should (equal ?\N{CJK IDEOGRAPH-2B734} #x2B734)))
-
-(ert-deftest lread-char-invalid-cjk-name ()
-  (should-error (read "?\\N{CJK IDEOGRAPH-2B735}") :type 'invalid-read-syntax))
-
-(ert-deftest lread-string-char-number ()
-  (should (equal "a\N{U+A817}b" "a\uA817b")))
+(ert-deftest lread-char-surrogate-1 ()
+  (should-error (read "?\\N{U+D800}") :type 'invalid-read-syntax))
+(ert-deftest lread-char-surrogate-2 ()
+  (should-error (read "?\\N{U+D801}") :type 'invalid-read-syntax))
+(ert-deftest lread-char-surrogate-3 ()
+  (should-error (read "?\\N{U+Dffe}") :type 'invalid-read-syntax))
+(ert-deftest lread-char-surrogate-4 ()
+  (should-error (read "?\\N{U+DFFF}") :type 'invalid-read-syntax))
+
+(ert-deftest lread-string-char-number-1 ()
+  (should (equal (read "a\\N{U+A817}b") "a\uA817bx")))
+(ert-deftest lread-string-char-number-2 ()
+  (should-error (read "?\\N{0.5}") :type 'invalid-read-syntax))
+(ert-deftest lread-string-char-number-3 ()
+  (should-error (read "?\\N{U+-0}") :type 'invalid-read-syntax))
  
  (ert-deftest lread-string-char-name ()
  
  (ert-deftest lread-string-char-name ()
-  (should (equal "a\N{SYLOTI NAGRI  LETTER DHO}b" "a\uA817b")))
+  (should (equal (read "a\\N{SYLOTI NAGRI  LETTER DHO}b") "a\uA817b")))
  
  ;;; lread-tests.el ends here
  
  ;;; lread-tests.el ends here
author	Paul Eggert <eggert@cs.ucla.edu>
	Fri, 22 Apr 2016 02:26:34 +0000 (19:26 -0700)
committer	Paul Eggert <eggert@cs.ucla.edu>
	Fri, 22 Apr 2016 02:29:41 +0000 (19:29 -0700)
doc/lispref/nonascii.texi		patch \| blob \| history
doc/lispref/objects.texi		patch \| blob \| history
src/character.h		patch \| blob \| history
src/lread.c		patch \| blob \| history
test/src/lread-tests.el		patch \| blob \| history