2018-01-12 07:12:21

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH RFC 00/13] UTF-8 case insensitive lookups for EXT4

Hi,

In the past few months, I've been working to support case-insensitive
lookups of utf8 encoded strings, primarily for EXT4, and then for other
filesystems. This RFC uses the awesome UTF8 normalization
implementation done by the SGI guys in 2014, namely Olaf Weber and Ben
Myers, but it, unfortunately, never went upstream. That SGI effort was
made of 3 versions of an RFC submitted to this list, and the last
version was archived below:

https://www.spinics.net/lists/xfs/msg30069.html

For normalization support, I basically rebased those patches and
addressed the issues that where raised on the list at that time. I also
implemented an extension to do some testing of the exported functions in
kernelspace, to make sure we can catch regressions early. Obviously,
more tests are needed, particularly for Hangul alorithmic decomposition.

Like the original submission from Ben, I excluded the commit that
includes the generated header file and unicode files because they are
too big and would bounce the list. Instead, instructions on fetching
and generating the files are documented in the commit message.

An important difference to the original SGI patches is that I have
introduced a midlayer API between the low-level normalization code and
the userfilesystem usercode. The goal is to hide implementation details
behind a more simple interface of strncmp()/strcasecmp()-like functions,
as well as a more specific casefold() operation, which implements the
behavior defined by the unicode spec. This reduces filesystem changes
to a minimal. As a quick example, the fs code can load a struct
charset, which is decided by the encoding mount parameter or sb
information and then call the helpers charset_strncmp or
charset_strncasecmp when matching names.

This implementation has an obvious intersection with the NLS code
already in the kernel. It holds a few differences, though, like
implementing some higher-level functions instead of toupper/tolower
functions, which are not enough for full caseless comparison, and it
also supports versioning of the encoding, which is required to ensure
stability of case-folding operations. If the community understands we
should merge these changes back to the NLS code, I can work on it, but
it should require some reworking on how the NLS system is implemented.

The charsets code doesn't do any locking on the module or refcounts the
registered encoding modules yet. I was assuming I would be asked to
merge it into NLS, so I would rather discuss this change first, rather
than polish final details in advance.

The ext4 insensitive-lookup doesn't require any on-disk changes. It has
a performance hit for huge directories since if the lookup doesn't use
the exact case, we will fallback to linear search. This is a
performance problem, but it feels acceptable for now.

Right now, with the RFC applied, you can mount an existing ext4
filesystem with:

mount -o encoding=utf8-7.0.0 /dev/sdaX /mnt

And perform lookups of compatible sequences (NKFD), the filesystem
should successfully complete the lookup. If you add 'ignorecase' as a
mountoption, casefolding will be performed and caseless matching of
compatible sequences should work.

Finally, Thank you Olaf and Ben for your work on the normalization
patches. I am really looking forward to have your contribuitions
merged, so I'd love to hear people thoughts and suggestions on what is
needed for upstream acceptance.

Gabriel Krisman Bertazi (9):
charsets: Introduce middle-layer for character encoding
charsets: ascii: Wrap ascii functions to charsets library
charsets: utf8: Hook-up utf-8 code to charsets library
charsets: utf8: Introduce test module for kernel UTF-8 implementation
ext4: Add ignorecase mount option
ext4: Include encoding information on the superblock
fscrypt: Introduce charset-based matching functions
ext4: Support charset name matching
ext4: Implement ext4 dcache hooks for custom charsets

Olaf Weber (4):
charsets: utf8: Add unicode character database files
scripts: add trie generator for UTF-8
charsets: utf8: Introduce code for UTF-8 normalization
charsets: utf8: reduce the size of utf8data[]

fs/ext4/dir.c | 63 +
fs/ext4/ext4.h | 6 +
fs/ext4/namei.c | 27 +-
fs/ext4/super.c | 35 +
include/linux/charsets.h | 73 +
include/linux/fscrypt.h | 1 +
include/linux/fscrypt_notsupp.h | 16 +
include/linux/fscrypt_supp.h | 27 +
include/linux/utf8norm.h | 116 ++
lib/Kconfig | 16 +
lib/Makefile | 2 +
lib/charsets/Makefile | 24 +
lib/charsets/ascii.c | 98 ++
lib/charsets/core.c | 68 +
lib/charsets/test_ucd.c | 186 +++
lib/charsets/ucd/README | 33 +
lib/charsets/utf8_core.c | 178 ++
lib/charsets/utf8norm.c | 794 +++++++++
scripts/Makefile | 1 +
scripts/mkutf8data.c | 3464 +++++++++++++++++++++++++++++++++++++++
20 files changed, 5219 insertions(+), 9 deletions(-)
create mode 100644 include/linux/charsets.h
create mode 100644 include/linux/utf8norm.h
create mode 100644 lib/charsets/Makefile
create mode 100644 lib/charsets/ascii.c
create mode 100644 lib/charsets/core.c
create mode 100644 lib/charsets/test_ucd.c
create mode 100644 lib/charsets/ucd/README
create mode 100644 lib/charsets/utf8_core.c
create mode 100644 lib/charsets/utf8norm.c
create mode 100644 scripts/mkutf8data.c

--
2.15.1


2018-01-12 07:12:22

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH RFC 01/13] charsets: Introduce middle-layer for character encoding

This implements an abstraction for high-level encoding-wise string
manipulation functions. It defines some hooks that encoding modules must
implement, which will be used by filesystem code to support lookups that
consider normalization and case-folding.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
include/linux/charsets.h | 73 ++++++++++++++++++++++++++++++++++++++++++++++++
lib/Kconfig | 2 ++
lib/Makefile | 2 ++
lib/charsets/Makefile | 3 ++
lib/charsets/core.c | 68 ++++++++++++++++++++++++++++++++++++++++++++
5 files changed, 148 insertions(+)
create mode 100644 include/linux/charsets.h
create mode 100644 lib/charsets/Makefile
create mode 100644 lib/charsets/core.c

diff --git a/include/linux/charsets.h b/include/linux/charsets.h
new file mode 100644
index 000000000000..8465e93c9d9f
--- /dev/null
+++ b/include/linux/charsets.h
@@ -0,0 +1,73 @@
+/*
+ * Copyright (c) 2017 Collabora Ltd.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#ifndef _CHARSET_H
+#define _CHARSET_H
+
+struct charset_info;
+struct charset;
+
+struct charset_ops {
+ int (*strncmp)(const struct charset *charset, const char *str1,
+ const char *str2, int len);
+ int (*strncasecmp)(const struct charset *charset, const char *str1,
+ const char *str2, int len);
+ int (*casefold)(const struct charset *charset, const char *str,
+ int len, char **folded_str);
+ int (*normalize)(const struct charset *charset, const char *str,
+ int len, char **normalization);
+};
+
+struct charset {
+ const struct charset_info *info;
+ unsigned int version;
+ const struct charset_ops *ops;
+};
+
+struct charset_info {
+ char *name;
+ char *match_token;
+ struct charset* (*load_charset)(void *args);
+};
+
+static inline int charset_strncmp(const struct charset *charset,
+ const char *str1, const char *str2,
+ int len)
+{
+ return charset->ops->strncmp(charset, str1, str2, len);
+}
+
+static inline int charset_strncasecmp(const struct charset *charset,
+ const char *str1, const char *str2,
+ int len)
+{
+ return charset->ops->strncasecmp(charset, str1, str2, len);
+}
+
+static inline int charset_casefold(const struct charset *charset,
+ const char *str, int len, char **folded_str)
+{
+ return charset->ops->casefold(charset, str, len, folded_str);
+}
+
+static inline int charset_normalize(const struct charset *charset,
+ const char *str, int len,
+ char **normalization)
+{
+ return charset->ops->normalize(charset, str, len, normalization);
+}
+
+int charset_register(struct charset_info *charset);
+const struct charset *charset_load(char *charset);
+#endif
diff --git a/lib/Kconfig b/lib/Kconfig
index c5e84fbcb30b..bf5c751cfb8a 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -582,6 +582,8 @@ config PRIME_NUMBERS
config STRING_SELFTEST
tristate "Test string functions"

+config CHARSETS
+ tristate "Character encoding sets"
endmenu

config GENERIC_ASHLDI3
diff --git a/lib/Makefile b/lib/Makefile
index d11c48ec8ffd..f6b2360fedfa 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -258,3 +258,5 @@ obj-$(CONFIG_GENERIC_LSHRDI3) += lshrdi3.o
obj-$(CONFIG_GENERIC_MULDI3) += muldi3.o
obj-$(CONFIG_GENERIC_CMPDI2) += cmpdi2.o
obj-$(CONFIG_GENERIC_UCMPDI2) += ucmpdi2.o
+
+obj-$(CONFIG_CHARSETS) += charsets/
diff --git a/lib/charsets/Makefile b/lib/charsets/Makefile
new file mode 100644
index 000000000000..01ff9fd09f98
--- /dev/null
+++ b/lib/charsets/Makefile
@@ -0,0 +1,3 @@
+charsets-y += core.o
+
+obj-$(CONFIG_CHARSETS) += charsets.o
diff --git a/lib/charsets/core.c b/lib/charsets/core.c
new file mode 100644
index 000000000000..fb7b297b978e
--- /dev/null
+++ b/lib/charsets/core.c
@@ -0,0 +1,68 @@
+/*
+ * Copyright (c) 2017 Collabora Ltd.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#include <linux/module.h>
+#include <linux/string.h>
+#include <linux/charsets.h>
+#include <linux/parser.h>
+
+#define MAX_ENCODINGS 10
+
+static struct match_token encoding_tokens[MAX_ENCODINGS + 1];
+static struct charset_info *charsets[MAX_ENCODINGS];
+static int n_encodings;
+
+const struct charset *charset_load(char *charset)
+{
+ substring_t args[MAX_OPT_ARGS];
+ int token;
+
+ args[0].to = args[0].from = NULL;
+ token = match_token(charset, encoding_tokens, args);
+
+ if (!encoding_tokens[token].pattern)
+ return NULL;
+
+ return charsets[token]->load_charset(args);
+}
+
+int charset_register(struct charset_info *charset)
+{
+ encoding_tokens[n_encodings].token = n_encodings;
+ encoding_tokens[n_encodings].pattern = charset->match_token;
+
+ charsets[n_encodings] = charset;
+ n_encodings += 1;
+ return 0;
+}
+EXPORT_SYMBOL(charset_register);
+
+static int __init init_charset(void)
+{
+ encoding_tokens[0].token = 0;
+ encoding_tokens[0].pattern = NULL;
+ n_encodings = 0;
+ return 0;
+}
+
+static void __exit exit_charset(void)
+{
+}
+
+module_init(init_charset);
+module_exit(exit_charset);
+
+MODULE_AUTHOR("Gabriel Krisman Bertazi");
+MODULE_DESCRIPTION("charset abstraction for filesystems");
+MODULE_LICENSE("GPL");
--
2.15.1

2018-01-12 07:13:21

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH RFC 02/13] charsets: ascii: Wrap ascii functions to charsets library

This allows filesystems to always use the charsets interfaces, even when
not caring about encoding.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
lib/charsets/Makefile | 2 ++
lib/charsets/ascii.c | 98 +++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 100 insertions(+)
create mode 100644 lib/charsets/ascii.c

diff --git a/lib/charsets/Makefile b/lib/charsets/Makefile
index 01ff9fd09f98..2184e7ff25de 100644
--- a/lib/charsets/Makefile
+++ b/lib/charsets/Makefile
@@ -1,3 +1,5 @@
charsets-y += core.o

obj-$(CONFIG_CHARSETS) += charsets.o
+
+obj-$(CONFIG_CHARSETS) += ascii.o
diff --git a/lib/charsets/ascii.c b/lib/charsets/ascii.c
new file mode 100644
index 000000000000..d45b7f01ebd4
--- /dev/null
+++ b/lib/charsets/ascii.c
@@ -0,0 +1,98 @@
+/*
+ * Copyright (c) 2017 Collabora Ltd.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ */
+#include <linux/types.h>
+#include <linux/module.h>
+#include <linux/charsets.h>
+#include <linux/string.h>
+#include <linux/ctype.h>
+#include <linux/parser.h>
+
+static struct charset_info ascii_info;
+
+int ascii_strncmp(const struct charset *charset, const char *str1,
+ const char *str2, int len)
+{
+ return strncmp(str1, str2, len);
+}
+
+int ascii_strncasecmp(const struct charset *charset, const char *str1,
+ const char *str2, int len)
+{
+ return strncasecmp(str1, str2, len);
+}
+
+int ascii_normalize(const struct charset *charset, const char *str,
+ int len, char **normalization)
+{
+ *normalization = kstrdup(str, GFP_NOFS);
+ return (*normalization) ? len : -ENOMEM;
+}
+
+int ascii_casefold(const struct charset *charset, const char *str,
+ int len, char **folded_str)
+{
+ int i;
+ char *fold;
+
+ fold = kstrdup(str, GFP_NOFS);
+ if (!fold)
+ return -ENOMEM;
+
+ for (i = 0; i < len; i++)
+ fold[i] = tolower(fold[i]);
+
+ *folded_str = fold;
+ return len;
+}
+
+static const struct charset_ops ascii_ops = {
+ .strncmp = ascii_strncmp,
+ .strncasecmp = ascii_strncasecmp,
+ .casefold = ascii_casefold,
+ .normalize = ascii_normalize,
+};
+
+static struct charset ascii_charset = {
+ .version = 0,
+ .info = &ascii_info,
+ .ops = &ascii_ops
+};
+
+static struct charset *ascii_load_charset(void *pargs)
+{
+ return &ascii_charset;
+}
+
+static struct charset_info ascii_info = {
+ .name = "ascii",
+ .match_token = "ascii",
+ .load_charset = ascii_load_charset,
+};
+
+static int __init init_ascii(void)
+{
+ charset_register(&ascii_info);
+ return 0;
+}
+
+static void __exit exit_ascii(void)
+{
+}
+
+module_init(init_ascii);
+module_exit(exit_ascii);
+MODULE_AUTHOR("Gabriel Krisman Bertazi");
+MODULE_DESCRIPTION("ASCII charset for filesystems");
+MODULE_LICENSE("GPL");
+
--
2.15.1

2018-01-12 07:13:26

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH RFC 03/13] charsets: utf8: Add unicode character database files

From: Olaf Weber <[email protected]>

Add files from the Unicode Character Database, version 7.0.0, to the source.
A helper program that generates a trie used for normalization from these
files is part of a separate commit.

Signed-off-by: Olaf Weber <[email protected]>
Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
[Move ucd directory to lib/charsets]
---
lib/charsets/ucd/README | 33 +++++++++++++++++++++++++++++++++
1 file changed, 33 insertions(+)
create mode 100644 lib/charsets/ucd/README

diff --git a/lib/charsets/ucd/README b/lib/charsets/ucd/README
new file mode 100644
index 000000000000..d713e663cdf9
--- /dev/null
+++ b/lib/charsets/ucd/README
@@ -0,0 +1,33 @@
+The files in this directory are part of the Unicode Character Database
+for version 7.0.0 of the Unicode standard.
+
+The full set of files can be found here:
+
+ http://www.unicode.org/Public/7.0.0/ucd/
+
+The latest released version of the UCD can be found here:
+
+ http://www.unicode.org/Public/UCD/latest/
+
+The files in this directory are identical, except that they have been
+renamed with a suffix indicating the unicode version.
+
+Individual source links:
+
+ http://www.unicode.org/Public/7.0.0/ucd/CaseFolding.txt
+ http://www.unicode.org/Public/7.0.0/ucd/DerivedAge.txt
+ http://www.unicode.org/Public/7.0.0/ucd/extracted/DerivedCombiningClass.txt
+ http://www.unicode.org/Public/7.0.0/ucd/DerivedCoreProperties.txt
+ http://www.unicode.org/Public/7.0.0/ucd/NormalizationCorrections.txt
+ http://www.unicode.org/Public/7.0.0/ucd/NormalizationTest.txt
+ http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt
+
+md5sums
+
+ 9a92b2bfe56c6719def926bab524fefd CaseFolding-7.0.0.txt
+ 07b8b1027eb824cf0835314e94f23d2e DerivedAge-7.0.0.txt
+ 90c3340b16821e2f2153acdbe6fc6180 DerivedCombiningClass-7.0.0.txt
+ c41c0601f808116f623de47110ed4f93 DerivedCoreProperties-7.0.0.txt
+ 522720ddfc150d8e63a2518634829bce NormalizationCorrections-7.0.0.txt
+ 1f35175eba4a2ad795db489f789ae352 NormalizationTest-7.0.0.txt
+ c8355655731d75e6a3de8c20d7e601ba UnicodeData-7.0.0.txt
--
2.15.1

2018-01-12 07:13:33

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH RFC 04/13] scripts: add trie generator for UTF-8

From: Olaf Weber <[email protected]>

mkutf8data.c is the source for a program that generates utf8data.h, which
contains the trie that utf8norm.c uses. The trie is generated from the
Unicode 7.0.0 data files. The format of the utf8data[] table is described
in utf8norm.c, which is added in the next patch.

Signed-off-by: Olaf Weber <[email protected]>
Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
[Rebased to mainline]
[Fix out-of-tree-build]
[Fix checkpatch warnings]
[Merge back robustness fixes from original patch. Requested by Dave Chinner]
---
lib/Kconfig | 9 +
lib/charsets/Makefile | 14 +
scripts/Makefile | 1 +
scripts/mkutf8data.c | 3239 +++++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 3263 insertions(+)
create mode 100644 scripts/mkutf8data.c

diff --git a/lib/Kconfig b/lib/Kconfig
index bf5c751cfb8a..9ca3bda7b2be 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -584,6 +584,15 @@ config STRING_SELFTEST

config CHARSETS
tristate "Character encoding sets"
+
+#
+# utf8 normalization module
+#
+config UTF8_NORMALIZATION
+ tristate "UTF-8 normalization support"
+ depends on CHARSETS
+ help
+ Say Y here to enable utf8 normalization support.
endmenu

config GENERIC_ASHLDI3
diff --git a/lib/charsets/Makefile b/lib/charsets/Makefile
index 2184e7ff25de..d486d9437f5c 100644
--- a/lib/charsets/Makefile
+++ b/lib/charsets/Makefile
@@ -3,3 +3,17 @@ charsets-y += core.o
obj-$(CONFIG_CHARSETS) += charsets.o

obj-$(CONFIG_CHARSETS) += ascii.o
+
+$(obj)/utf8data.h: $(srctree)/$(src)/ucd/*.txt $(objtree)/scripts/mkutf8data FORCE
+ $(call cmd,mkutf8data)
+quiet_cmd_mkutf8data = MKUTF8DATA $@
+ cmd_mkutf8data = $(objtree)/scripts/mkutf8data \
+ -a $(srctree)/$(src)/ucd/DerivedAge-7.0.0.txt \
+ -c $(srctree)/$(src)/ucd/DerivedCombiningClass-7.0.0.txt \
+ -p $(srctree)/$(src)/ucd/DerivedCoreProperties-7.0.0.txt \
+ -d $(srctree)/$(src)/ucd/UnicodeData-7.0.0.txt \
+ -f $(srctree)/$(src)/ucd/CaseFolding-7.0.0.txt \
+ -n $(srctree)/$(src)/ucd/NormalizationCorrections-7.0.0.txt \
+ -t $(srctree)/$(src)/ucd/NormalizationTest-7.0.0.txt \
+ -o $@
+
diff --git a/scripts/Makefile b/scripts/Makefile
index 25ab143cbe14..f681601aa15c 100644
--- a/scripts/Makefile
+++ b/scripts/Makefile
@@ -19,6 +19,7 @@ hostprogs-$(CONFIG_ASN1) += asn1_compiler
hostprogs-$(CONFIG_MODULE_SIG) += sign-file
hostprogs-$(CONFIG_SYSTEM_TRUSTED_KEYRING) += extract-cert
hostprogs-$(CONFIG_SYSTEM_EXTRA_CERTIFICATE) += insert-sys-cert
+hostprogs-$(CONFIG_UTF8_NORMALIZATION) += mkutf8data

HOSTCFLAGS_sortextable.o = -I$(srctree)/tools/include
HOSTCFLAGS_asn1_compiler.o = -I$(srctree)/include
diff --git a/scripts/mkutf8data.c b/scripts/mkutf8data.c
new file mode 100644
index 000000000000..700b41c0cb66
--- /dev/null
+++ b/scripts/mkutf8data.c
@@ -0,0 +1,3239 @@
+/*
+ * Copyright (c) 2014 SGI.
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+/* Generator for a compact trie for unicode normalization */
+
+#include <sys/types.h>
+#include <stddef.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <assert.h>
+#include <string.h>
+#include <unistd.h>
+#include <errno.h>
+
+/* Default names of the in- and output files. */
+
+#define AGE_NAME "DerivedAge.txt"
+#define CCC_NAME "DerivedCombiningClass.txt"
+#define PROP_NAME "DerivedCoreProperties.txt"
+#define DATA_NAME "UnicodeData.txt"
+#define FOLD_NAME "CaseFolding.txt"
+#define NORM_NAME "NormalizationCorrections.txt"
+#define TEST_NAME "NormalizationTest.txt"
+#define UTF8_NAME "utf8data.h"
+
+const char *age_name = AGE_NAME;
+const char *ccc_name = CCC_NAME;
+const char *prop_name = PROP_NAME;
+const char *data_name = DATA_NAME;
+const char *fold_name = FOLD_NAME;
+const char *norm_name = NORM_NAME;
+const char *test_name = TEST_NAME;
+const char *utf8_name = UTF8_NAME;
+
+int verbose = 0;
+
+/* An arbitrary line size limit on input lines. */
+
+#define LINESIZE 1024
+char line[LINESIZE];
+char buf0[LINESIZE];
+char buf1[LINESIZE];
+char buf2[LINESIZE];
+char buf3[LINESIZE];
+
+const char *argv0;
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * Unicode version numbers consist of three parts: major, minor, and a
+ * revision. These numbers are packed into an unsigned int to obtain
+ * a single version number.
+ *
+ * To save space in the generated trie, the unicode version is not
+ * stored directly, instead we calculate a generation number from the
+ * unicode versions seen in the DerivedAge file, and use that as an
+ * index into a table of unicode versions.
+ */
+#define UNICODE_MAJ_SHIFT (16)
+#define UNICODE_MIN_SHIFT (8)
+
+#define UNICODE_MAJ_MAX ((unsigned short)-1)
+#define UNICODE_MIN_MAX ((unsigned char)-1)
+#define UNICODE_REV_MAX ((unsigned char)-1)
+
+#define UNICODE_AGE(MAJ,MIN,REV) \
+ (((unsigned int)(MAJ) << UNICODE_MAJ_SHIFT) | \
+ ((unsigned int)(MIN) << UNICODE_MIN_SHIFT) | \
+ ((unsigned int)(REV)))
+
+unsigned int *ages;
+int ages_count;
+
+unsigned int unicode_maxage;
+
+static int
+age_valid(unsigned int major, unsigned int minor, unsigned int revision)
+{
+ if (major > UNICODE_MAJ_MAX)
+ return 0;
+ if (minor > UNICODE_MIN_MAX)
+ return 0;
+ if (revision > UNICODE_REV_MAX)
+ return 0;
+ return 1;
+}
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * utf8trie_t
+ *
+ * A compact binary tree, used to decode UTF-8 characters.
+ *
+ * Internal nodes are one byte for the node itself, and up to three
+ * bytes for an offset into the tree. The first byte contains the
+ * following information:
+ * NEXTBYTE - flag - advance to next byte if set
+ * BITNUM - 3 bit field - the bit number to tested
+ * OFFLEN - 2 bit field - number of bytes in the offset
+ * if offlen == 0 (non-branching node)
+ * RIGHTPATH - 1 bit field - set if the following node is for the
+ * right-hand path (tested bit is set)
+ * TRIENODE - 1 bit field - set if the following node is an internal
+ * node, otherwise it is a leaf node
+ * if offlen != 0 (branching node)
+ * LEFTNODE - 1 bit field - set if the left-hand node is internal
+ * RIGHTNODE - 1 bit field - set if the right-hand node is internal
+ *
+ * Due to the way utf8 works, there cannot be branching nodes with
+ * NEXTBYTE set, and moreover those nodes always have a righthand
+ * descendant.
+ */
+typedef unsigned char utf8trie_t;
+#define BITNUM 0x07
+#define NEXTBYTE 0x08
+#define OFFLEN 0x30
+#define OFFLEN_SHIFT 4
+#define RIGHTPATH 0x40
+#define TRIENODE 0x80
+#define RIGHTNODE 0x40
+#define LEFTNODE 0x80
+
+/*
+ * utf8leaf_t
+ *
+ * The leaves of the trie are embedded in the trie, and so the same
+ * underlying datatype, unsigned char.
+ *
+ * leaf[0]: The unicode version, stored as a generation number that is
+ * an index into utf8agetab[]. With this we can filter code
+ * points based on the unicode version in which they were
+ * defined. The CCC of a non-defined code point is 0.
+ * leaf[1]: Canonical Combining Class. During normalization, we need
+ * to do a stable sort into ascending order of all characters
+ * with a non-zero CCC that occur between two characters with
+ * a CCC of 0, or at the begin or end of a string.
+ * The unicode standard guarantees that all CCC values are
+ * between 0 and 254 inclusive, which leaves 255 available as
+ * a special value.
+ * Code points with CCC 0 are known as stoppers.
+ * leaf[2]: Decomposition. If leaf[1] == 255, then leaf[2] is the
+ * start of a NUL-terminated string that is the decomposition
+ * of the character.
+ * The CCC of a decomposable character is the same as the CCC
+ * of the first character of its decomposition.
+ * Some characters decompose as the empty string: these are
+ * characters with the Default_Ignorable_Code_Point property.
+ * These do affect normalization, as they all have CCC 0.
+ *
+ * The decompositions in the trie have been fully expanded.
+ *
+ * Casefolding, if applicable, is also done using decompositions.
+ */
+typedef unsigned char utf8leaf_t;
+
+#define LEAF_GEN(LEAF) ((LEAF)[0])
+#define LEAF_CCC(LEAF) ((LEAF)[1])
+#define LEAF_STR(LEAF) ((const char*)((LEAF) + 2))
+
+#define MAXGEN (255)
+
+#define MINCCC (0)
+#define MAXCCC (254)
+#define STOPPER (0)
+#define DECOMPOSE (255)
+
+struct tree;
+static utf8leaf_t *utf8nlookup(struct tree *, const char *, size_t);
+static utf8leaf_t *utf8lookup(struct tree *, const char *);
+
+unsigned char *utf8data;
+size_t utf8data_size;
+
+utf8trie_t *nfkdi;
+utf8trie_t *nfkdicf;
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * UTF8 valid ranges.
+ *
+ * The UTF-8 encoding spreads the bits of a 32bit word over several
+ * bytes. This table gives the ranges that can be held and how they'd
+ * be represented.
+ *
+ * 0x00000000 0x0000007F: 0xxxxxxx
+ * 0x00000000 0x000007FF: 110xxxxx 10xxxxxx
+ * 0x00000000 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ *
+ * There is an additional requirement on UTF-8, in that only the
+ * shortest representation of a 32bit value is to be used. A decoder
+ * must not decode sequences that do not satisfy this requirement.
+ * Thus the allowed ranges have a lower bound.
+ *
+ * 0x00000000 0x0000007F: 0xxxxxxx
+ * 0x00000080 0x000007FF: 110xxxxx 10xxxxxx
+ * 0x00000800 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
+ * 0x00010000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00200000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x04000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ *
+ * Actual unicode characters are limited to the range 0x0 - 0x10FFFF,
+ * 17 planes of 65536 values. This limits the sequences actually seen
+ * even more, to just the following.
+ *
+ * 0 - 0x7f: 0 0x7f
+ * 0x80 - 0x7ff: 0xc2 0x80 0xdf 0xbf
+ * 0x800 - 0xffff: 0xe0 0xa0 0x80 0xef 0xbf 0xbf
+ * 0x10000 - 0x10ffff: 0xf0 0x90 0x80 0x80 0xf4 0x8f 0xbf 0xbf
+ *
+ * Even within those ranges not all values are allowed: the surrogates
+ * 0xd800 - 0xdfff should never be seen.
+ *
+ * Note that the longest sequence seen with valid usage is 4 bytes,
+ * the same a single UTF-32 character. This makes the UTF-8
+ * representation of Unicode strictly smaller than UTF-32.
+ *
+ * The shortest sequence requirement was introduced by:
+ * Corrigendum #1: UTF-8 Shortest Form
+ * It can be found here:
+ * http://www.unicode.org/versions/corrigendum1.html
+ *
+ */
+
+#define UTF8_2_BITS 0xC0
+#define UTF8_3_BITS 0xE0
+#define UTF8_4_BITS 0xF0
+#define UTF8_N_BITS 0x80
+#define UTF8_2_MASK 0xE0
+#define UTF8_3_MASK 0xF0
+#define UTF8_4_MASK 0xF8
+#define UTF8_N_MASK 0xC0
+#define UTF8_V_MASK 0x3F
+#define UTF8_V_SHIFT 6
+
+static int
+utf8encode(char *str, unsigned int val)
+{
+ int len;
+
+ if (val < 0x80) {
+ str[0] = val;
+ len = 1;
+ } else if (val < 0x800) {
+ str[1] = val & UTF8_V_MASK;
+ str[1] |= UTF8_N_BITS;
+ val >>= UTF8_V_SHIFT;
+ str[0] = val;
+ str[0] |= UTF8_2_BITS;
+ len = 2;
+ } else if (val < 0x10000) {
+ str[2] = val & UTF8_V_MASK;
+ str[2] |= UTF8_N_BITS;
+ val >>= UTF8_V_SHIFT;
+ str[1] = val & UTF8_V_MASK;
+ str[1] |= UTF8_N_BITS;
+ val >>= UTF8_V_SHIFT;
+ str[0] = val;
+ str[0] |= UTF8_3_BITS;
+ len = 3;
+ } else if (val < 0x110000) {
+ str[3] = val & UTF8_V_MASK;
+ str[3] |= UTF8_N_BITS;
+ val >>= UTF8_V_SHIFT;
+ str[2] = val & UTF8_V_MASK;
+ str[2] |= UTF8_N_BITS;
+ val >>= UTF8_V_SHIFT;
+ str[1] = val & UTF8_V_MASK;
+ str[1] |= UTF8_N_BITS;
+ val >>= UTF8_V_SHIFT;
+ str[0] = val;
+ str[0] |= UTF8_4_BITS;
+ len = 4;
+ } else {
+ printf("%#x: illegal val\n", val);
+ len = 0;
+ }
+ return len;
+}
+
+static unsigned int
+utf8decode(const char *str)
+{
+ const unsigned char *s = (const unsigned char*)str;
+ unsigned int unichar = 0;
+
+ if (*s < 0x80) {
+ unichar = *s;
+ } else if (*s < UTF8_3_BITS) {
+ unichar = *s++ & 0x1F;
+ unichar <<= UTF8_V_SHIFT;
+ unichar |= *s & 0x3F;
+ } else if (*s < UTF8_4_BITS) {
+ unichar = *s++ & 0x0F;
+ unichar <<= UTF8_V_SHIFT;
+ unichar |= *s++ & 0x3F;
+ unichar <<= UTF8_V_SHIFT;
+ unichar |= *s & 0x3F;
+ } else {
+ unichar = *s++ & 0x0F;
+ unichar <<= UTF8_V_SHIFT;
+ unichar |= *s++ & 0x3F;
+ unichar <<= UTF8_V_SHIFT;
+ unichar |= *s++ & 0x3F;
+ unichar <<= UTF8_V_SHIFT;
+ unichar |= *s & 0x3F;
+ }
+ return unichar;
+}
+
+static int
+utf32valid(unsigned int unichar)
+{
+ return unichar < 0x110000;
+}
+
+#define NODE 1
+#define LEAF 0
+
+struct tree {
+ void *root;
+ int childnode;
+ const char *type;
+ unsigned int maxage;
+ struct tree *next;
+ int (*leaf_equal)(void *, void *);
+ void (*leaf_print)(void *, int);
+ int (*leaf_mark)(void *);
+ int (*leaf_size)(void *);
+ int *(*leaf_index)(struct tree *, void *);
+ unsigned char *(*leaf_emit)(void *, unsigned char *);
+ int leafindex[0x110000];
+ int index;
+};
+
+struct node {
+ int index;
+ int offset;
+ int mark;
+ int size;
+ struct node *parent;
+ void *left;
+ void *right;
+ unsigned char bitnum;
+ unsigned char nextbyte;
+ unsigned char leftnode;
+ unsigned char rightnode;
+ unsigned int keybits;
+ unsigned int keymask;
+};
+
+/*
+ * Example lookup function for a tree.
+ */
+static void *
+lookup(struct tree *tree, const char *key)
+{
+ struct node *node;
+ void *leaf = NULL;
+
+ node = tree->root;
+ while (!leaf && node) {
+ if (node->nextbyte)
+ key++;
+ if (*key & (1 << (node->bitnum & 7))) {
+ /* Right leg */
+ if (node->rightnode == NODE) {
+ node = node->right;
+ } else if (node->rightnode == LEAF) {
+ leaf = node->right;
+ } else {
+ node = NULL;
+ }
+ } else {
+ /* Left leg */
+ if (node->leftnode == NODE) {
+ node = node->left;
+ } else if (node->leftnode == LEAF) {
+ leaf = node->left;
+ } else {
+ node = NULL;
+ }
+ }
+ }
+
+ return leaf;
+}
+
+/*
+ * A simple non-recursive tree walker: keep track of visits to the
+ * left and right branches in the leftmask and rightmask.
+ */
+static void
+tree_walk(struct tree *tree)
+{
+ struct node *node;
+ unsigned int leftmask;
+ unsigned int rightmask;
+ unsigned int bitmask;
+ int indent = 1;
+ int nodes, singletons, leaves;
+
+ nodes = singletons = leaves = 0;
+
+ printf("%s_%x root %p\n", tree->type, tree->maxage, tree->root);
+ if (tree->childnode == LEAF) {
+ assert(tree->root);
+ tree->leaf_print(tree->root, indent);
+ leaves = 1;
+ } else {
+ assert(tree->childnode == NODE);
+ node = tree->root;
+ leftmask = rightmask = 0;
+ while (node) {
+ printf("%*snode @ %p bitnum %d nextbyte %d"
+ " left %p right %p mask %x bits %x\n",
+ indent, "", node,
+ node->bitnum, node->nextbyte,
+ node->left, node->right,
+ node->keymask, node->keybits);
+ nodes += 1;
+ if (!(node->left && node->right))
+ singletons += 1;
+
+ while (node) {
+ bitmask = 1 << node->bitnum;
+ if ((leftmask & bitmask) == 0) {
+ leftmask |= bitmask;
+ if (node->leftnode == LEAF) {
+ assert(node->left);
+ tree->leaf_print(node->left,
+ indent+1);
+ leaves += 1;
+ } else if (node->left) {
+ assert(node->leftnode == NODE);
+ indent += 1;
+ node = node->left;
+ break;
+ }
+ }
+ if ((rightmask & bitmask) == 0) {
+ rightmask |= bitmask;
+ if (node->rightnode == LEAF) {
+ assert(node->right);
+ tree->leaf_print(node->right,
+ indent+1);
+ leaves += 1;
+ } else if (node->right) {
+ assert(node->rightnode==NODE);
+ indent += 1;
+ node = node->right;
+ break;
+ }
+ }
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ node = node->parent;
+ indent -= 1;
+ }
+ }
+ }
+ printf("nodes %d leaves %d singletons %d\n",
+ nodes, leaves, singletons);
+}
+
+/*
+ * Allocate an initialize a new internal node.
+ */
+static struct node *
+alloc_node(struct node *parent)
+{
+ struct node *node;
+ int bitnum;
+
+ node = malloc(sizeof(*node));
+ node->left = node->right = NULL;
+ node->parent = parent;
+ node->leftnode = NODE;
+ node->rightnode = NODE;
+ node->keybits = 0;
+ node->keymask = 0;
+ node->mark = 0;
+ node->index = 0;
+ node->offset = -1;
+ node->size = 4;
+
+ if (node->parent) {
+ bitnum = parent->bitnum;
+ if ((bitnum & 7) == 0) {
+ node->bitnum = bitnum + 7 + 8;
+ node->nextbyte = 1;
+ } else {
+ node->bitnum = bitnum - 1;
+ node->nextbyte = 0;
+ }
+ } else {
+ node->bitnum = 7;
+ node->nextbyte = 0;
+ }
+
+ return node;
+}
+
+/*
+ * Insert a new leaf into the tree, and collapse any subtrees that are
+ * fully populated and end in identical leaves. A nextbyte tagged
+ * internal node will not be removed to preserve the tree's integrity.
+ * Note that due to the structure of utf8, no nextbyte tagged node
+ * will be a candidate for removal.
+ */
+static int
+insert(struct tree *tree, char *key, int keylen, void *leaf)
+{
+ struct node *node;
+ struct node *parent;
+ void **cursor;
+ int keybits;
+
+ assert(keylen >= 1 && keylen <= 4);
+
+ node = NULL;
+ cursor = &tree->root;
+ keybits = 8 * keylen;
+
+ /* Insert, creating path along the way. */
+ while (keybits) {
+ if (!*cursor)
+ *cursor = alloc_node(node);
+ node = *cursor;
+ if (node->nextbyte)
+ key++;
+ if (*key & (1 << (node->bitnum & 7)))
+ cursor = &node->right;
+ else
+ cursor = &node->left;
+ keybits--;
+ }
+ *cursor = leaf;
+
+ /* Merge subtrees if possible. */
+ while (node) {
+ if (*key & (1 << (node->bitnum & 7)))
+ node->rightnode = LEAF;
+ else
+ node->leftnode = LEAF;
+ if (node->nextbyte)
+ break;
+ if (node->leftnode == NODE || node->rightnode == NODE)
+ break;
+ assert(node->left);
+ assert(node->right);
+ /* Compare */
+ if (! tree->leaf_equal(node->left, node->right))
+ break;
+ /* Keep left, drop right leaf. */
+ leaf = node->left;
+ /* Check in parent */
+ parent = node->parent;
+ if (!parent) {
+ /* root of tree! */
+ tree->root = leaf;
+ tree->childnode = LEAF;
+ } else if (parent->left == node) {
+ parent->left = leaf;
+ parent->leftnode = LEAF;
+ if (parent->right) {
+ parent->keymask = 0;
+ parent->keybits = 0;
+ } else {
+ parent->keymask |= (1 << node->bitnum);
+ }
+ } else if (parent->right == node) {
+ parent->right = leaf;
+ parent->rightnode = LEAF;
+ if (parent->left) {
+ parent->keymask = 0;
+ parent->keybits = 0;
+ } else {
+ parent->keymask |= (1 << node->bitnum);
+ parent->keybits |= (1 << node->bitnum);
+ }
+ } else {
+ /* internal tree error */
+ assert(0);
+ }
+ free(node);
+ node = parent;
+ }
+
+ /* Propagate keymasks up along singleton chains. */
+ while (node) {
+ parent = node->parent;
+ if (!parent)
+ break;
+ /* Nix the mask for parents with two children. */
+ if (node->keymask == 0) {
+ parent->keymask = 0;
+ parent->keybits = 0;
+ } else if (parent->left && parent->right) {
+ parent->keymask = 0;
+ parent->keybits = 0;
+ } else {
+ assert((parent->keymask & node->keymask) == 0);
+ parent->keymask |= node->keymask;
+ parent->keymask |= (1 << parent->bitnum);
+ parent->keybits |= node->keybits;
+ if (parent->right)
+ parent->keybits |= (1 << parent->bitnum);
+ }
+ node = parent;
+ }
+
+ return 0;
+}
+
+/*
+ * Prune internal nodes.
+ *
+ * Fully populated subtrees that end at the same leaf have already
+ * been collapsed. There are still internal nodes that have for both
+ * their left and right branches a sequence of singletons that make
+ * identical choices and end in identical leaves. The keymask and
+ * keybits collected in the nodes describe the choices made in these
+ * singleton chains. When they are identical for the left and right
+ * branch of a node, and the two leaves comare identical, the node in
+ * question can be removed.
+ *
+ * Note that nodes with the nextbyte tag set will not be removed by
+ * this to ensure tree integrity. Note as well that the structure of
+ * utf8 ensures that these nodes would not have been candidates for
+ * removal in any case.
+ */
+static void
+prune(struct tree *tree)
+{
+ struct node *node;
+ struct node *left;
+ struct node *right;
+ struct node *parent;
+ void *leftleaf;
+ void *rightleaf;
+ unsigned int leftmask;
+ unsigned int rightmask;
+ unsigned int bitmask;
+ int count;
+
+ if (verbose > 0)
+ printf("Pruning %s_%x\n", tree->type, tree->maxage);
+
+ count = 0;
+ if (tree->childnode == LEAF)
+ return;
+ if (!tree->root)
+ return;
+
+ leftmask = rightmask = 0;
+ node = tree->root;
+ while (node) {
+ if (node->nextbyte)
+ goto advance;
+ if (node->leftnode == LEAF)
+ goto advance;
+ if (node->rightnode == LEAF)
+ goto advance;
+ if (!node->left)
+ goto advance;
+ if (!node->right)
+ goto advance;
+ left = node->left;
+ right = node->right;
+ if (left->keymask == 0)
+ goto advance;
+ if (right->keymask == 0)
+ goto advance;
+ if (left->keymask != right->keymask)
+ goto advance;
+ if (left->keybits != right->keybits)
+ goto advance;
+ leftleaf = NULL;
+ while (!leftleaf) {
+ assert(left->left || left->right);
+ if (left->leftnode == LEAF)
+ leftleaf = left->left;
+ else if (left->rightnode == LEAF)
+ leftleaf = left->right;
+ else if (left->left)
+ left = left->left;
+ else if (left->right)
+ left = left->right;
+ else
+ assert(0);
+ }
+ rightleaf = NULL;
+ while (!rightleaf) {
+ assert(right->left || right->right);
+ if (right->leftnode == LEAF)
+ rightleaf = right->left;
+ else if (right->rightnode == LEAF)
+ rightleaf = right->right;
+ else if (right->left)
+ right = right->left;
+ else if (right->right)
+ right = right->right;
+ else
+ assert(0);
+ }
+ if (! tree->leaf_equal(leftleaf, rightleaf))
+ goto advance;
+ /*
+ * This node has identical singleton-only subtrees.
+ * Remove it.
+ */
+ parent = node->parent;
+ left = node->left;
+ right = node->right;
+ if (parent->left == node)
+ parent->left = left;
+ else if (parent->right == node)
+ parent->right = left;
+ else
+ assert(0);
+ left->parent = parent;
+ left->keymask |= (1 << node->bitnum);
+ node->left = NULL;
+ while (node) {
+ bitmask = 1 << node->bitnum;
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ if (node->leftnode == NODE && node->left) {
+ left = node->left;
+ free(node);
+ count++;
+ node = left;
+ } else if (node->rightnode == NODE && node->right) {
+ right = node->right;
+ free(node);
+ count++;
+ node = right;
+ } else {
+ node = NULL;
+ }
+ }
+ /* Propagate keymasks up along singleton chains. */
+ node = parent;
+ /* Force re-check */
+ bitmask = 1 << node->bitnum;
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ for (;;) {
+ if (node->left && node->right)
+ break;
+ if (node->left) {
+ left = node->left;
+ node->keymask |= left->keymask;
+ node->keybits |= left->keybits;
+ }
+ if (node->right) {
+ right = node->right;
+ node->keymask |= right->keymask;
+ node->keybits |= right->keybits;
+ }
+ node->keymask |= (1 << node->bitnum);
+ node = node->parent;
+ /* Force re-check */
+ bitmask = 1 << node->bitnum;
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ }
+ advance:
+ bitmask = 1 << node->bitnum;
+ if ((leftmask & bitmask) == 0 &&
+ node->leftnode == NODE &&
+ node->left) {
+ leftmask |= bitmask;
+ node = node->left;
+ } else if ((rightmask & bitmask) == 0 &&
+ node->rightnode == NODE &&
+ node->right) {
+ rightmask |= bitmask;
+ node = node->right;
+ } else {
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ node = node->parent;
+ }
+ }
+ if (verbose > 0)
+ printf("Pruned %d nodes\n", count);
+}
+
+/*
+ * Mark the nodes in the tree that lead to leaves that must be
+ * emitted.
+ */
+static void
+mark_nodes(struct tree *tree)
+{
+ struct node *node;
+ struct node *n;
+ unsigned int leftmask;
+ unsigned int rightmask;
+ unsigned int bitmask;
+ int marked;
+
+ marked = 0;
+ if (verbose > 0)
+ printf("Marking %s_%x\n", tree->type, tree->maxage);
+ if (tree->childnode == LEAF)
+ goto done;
+
+ assert(tree->childnode == NODE);
+ node = tree->root;
+ leftmask = rightmask = 0;
+ while (node) {
+ bitmask = 1 << node->bitnum;
+ if ((leftmask & bitmask) == 0) {
+ leftmask |= bitmask;
+ if (node->leftnode == LEAF) {
+ assert(node->left);
+ if (tree->leaf_mark(node->left)) {
+ n = node;
+ while (n && !n->mark) {
+ marked++;
+ n->mark = 1;
+ n = n->parent;
+ }
+ }
+ } else if (node->left) {
+ assert(node->leftnode == NODE);
+ node = node->left;
+ continue;
+ }
+ }
+ if ((rightmask & bitmask) == 0) {
+ rightmask |= bitmask;
+ if (node->rightnode == LEAF) {
+ assert(node->right);
+ if (tree->leaf_mark(node->right)) {
+ n = node;
+ while (n && !n->mark) {
+ marked++;
+ n->mark = 1;
+ n = n->parent;
+ }
+ }
+ } else if (node->right) {
+ assert(node->rightnode==NODE);
+ node = node->right;
+ continue;
+ }
+ }
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ node = node->parent;
+ }
+
+ /* second pass: left siblings and singletons */
+
+ assert(tree->childnode == NODE);
+ node = tree->root;
+ leftmask = rightmask = 0;
+ while (node) {
+ bitmask = 1 << node->bitnum;
+ if ((leftmask & bitmask) == 0) {
+ leftmask |= bitmask;
+ if (node->leftnode == LEAF) {
+ assert(node->left);
+ if (tree->leaf_mark(node->left)) {
+ n = node;
+ while (n && !n->mark) {
+ marked++;
+ n->mark = 1;
+ n = n->parent;
+ }
+ }
+ } else if (node->left) {
+ assert(node->leftnode == NODE);
+ node = node->left;
+ if (!node->mark && node->parent->mark) {
+ marked++;
+ node->mark = 1;
+ }
+ continue;
+ }
+ }
+ if ((rightmask & bitmask) == 0) {
+ rightmask |= bitmask;
+ if (node->rightnode == LEAF) {
+ assert(node->right);
+ if (tree->leaf_mark(node->right)) {
+ n = node;
+ while (n && !n->mark) {
+ marked++;
+ n->mark = 1;
+ n = n->parent;
+ }
+ }
+ } else if (node->right) {
+ assert(node->rightnode==NODE);
+ node = node->right;
+ if (!node->mark && node->parent->mark &&
+ !node->parent->left) {
+ marked++;
+ node->mark = 1;
+ }
+ continue;
+ }
+ }
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ node = node->parent;
+ }
+done:
+ if (verbose > 0)
+ printf("Marked %d nodes\n", marked);
+}
+
+/*
+ * Compute the index of each node and leaf, which is the offset in the
+ * emitted trie. These values must be pre-computed because relative
+ * offsets between nodes are used to navigate the tree.
+ */
+static int
+index_nodes(struct tree *tree, int index)
+{
+ struct node *node;
+ unsigned int leftmask;
+ unsigned int rightmask;
+ unsigned int bitmask;
+ int count;
+ int indent;
+
+ /* Align to a cache line (or half a cache line?). */
+ while (index % 64)
+ index++;
+ tree->index = index;
+ indent = 1;
+ count = 0;
+
+ if (verbose > 0)
+ printf("Indexing %s_%x: %d\n", tree->type, tree->maxage, index);
+ if (tree->childnode == LEAF) {
+ index += tree->leaf_size(tree->root);
+ goto done;
+ }
+
+ assert(tree->childnode == NODE);
+ node = tree->root;
+ leftmask = rightmask = 0;
+ while (node) {
+ if (!node->mark)
+ goto skip;
+ count++;
+ if (node->index != index)
+ node->index = index;
+ index += node->size;
+skip:
+ while (node) {
+ bitmask = 1 << node->bitnum;
+ if (node->mark && (leftmask & bitmask) == 0) {
+ leftmask |= bitmask;
+ if (node->leftnode == LEAF) {
+ assert(node->left);
+ *tree->leaf_index(tree, node->left) =
+ index;
+ index += tree->leaf_size(node->left);
+ count++;
+ } else if (node->left) {
+ assert(node->leftnode == NODE);
+ indent += 1;
+ node = node->left;
+ break;
+ }
+ }
+ if (node->mark && (rightmask & bitmask) == 0) {
+ rightmask |= bitmask;
+ if (node->rightnode == LEAF) {
+ assert(node->right);
+ *tree->leaf_index(tree, node->right) = index;
+ index += tree->leaf_size(node->right);
+ count++;
+ } else if (node->right) {
+ assert(node->rightnode==NODE);
+ indent += 1;
+ node = node->right;
+ break;
+ }
+ }
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ node = node->parent;
+ indent -= 1;
+ }
+ }
+done:
+ /* Round up to a multiple of 16 */
+ while (index % 16)
+ index++;
+ if (verbose > 0)
+ printf("Final index %d\n", index);
+ return index;
+}
+
+/*
+ * Compute the size of nodes and leaves. We start by assuming that
+ * each node needs to store a three-byte offset. The indexes of the
+ * nodes are calculated based on that, and then this function is
+ * called to see if the sizes of some nodes can be reduced. This is
+ * repeated until no more changes are seen.
+ */
+static int
+size_nodes(struct tree *tree)
+{
+ struct tree *next;
+ struct node *node;
+ struct node *right;
+ struct node *n;
+ unsigned int leftmask;
+ unsigned int rightmask;
+ unsigned int bitmask;
+ unsigned int pathbits;
+ unsigned int pathmask;
+ int changed;
+ int offset;
+ int size;
+ int indent;
+
+ indent = 1;
+ changed = 0;
+ size = 0;
+
+ if (verbose > 0)
+ printf("Sizing %s_%x\n", tree->type, tree->maxage);
+ if (tree->childnode == LEAF)
+ goto done;
+
+ assert(tree->childnode == NODE);
+ pathbits = 0;
+ pathmask = 0;
+ node = tree->root;
+ leftmask = rightmask = 0;
+ while (node) {
+ if (!node->mark)
+ goto skip;
+ offset = 0;
+ if (!node->left || !node->right) {
+ size = 1;
+ } else {
+ if (node->rightnode == NODE) {
+ right = node->right;
+ next = tree->next;
+ while (!right->mark) {
+ assert(next);
+ n = next->root;
+ while (n->bitnum != node->bitnum) {
+ if (pathbits & (1<<n->bitnum))
+ n = n->right;
+ else
+ n = n->left;
+ }
+ n = n->right;
+ assert(right->bitnum == n->bitnum);
+ right = n;
+ next = next->next;
+ }
+ offset = right->index - node->index;
+ } else {
+ offset = *tree->leaf_index(tree, node->right);
+ offset -= node->index;
+ }
+ assert(offset >= 0);
+ assert(offset <= 0xffffff);
+ if (offset <= 0xff) {
+ size = 2;
+ } else if (offset <= 0xffff) {
+ size = 3;
+ } else { /* offset <= 0xffffff */
+ size = 4;
+ }
+ }
+ if (node->size != size || node->offset != offset) {
+ node->size = size;
+ node->offset = offset;
+ changed++;
+ }
+skip:
+ while (node) {
+ bitmask = 1 << node->bitnum;
+ pathmask |= bitmask;
+ if (node->mark && (leftmask & bitmask) == 0) {
+ leftmask |= bitmask;
+ if (node->leftnode == LEAF) {
+ assert(node->left);
+ } else if (node->left) {
+ assert(node->leftnode == NODE);
+ indent += 1;
+ node = node->left;
+ break;
+ }
+ }
+ if (node->mark && (rightmask & bitmask) == 0) {
+ rightmask |= bitmask;
+ pathbits |= bitmask;
+ if (node->rightnode == LEAF) {
+ assert(node->right);
+ } else if (node->right) {
+ assert(node->rightnode==NODE);
+ indent += 1;
+ node = node->right;
+ break;
+ }
+ }
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ pathmask &= ~bitmask;
+ pathbits &= ~bitmask;
+ node = node->parent;
+ indent -= 1;
+ }
+ }
+done:
+ if (verbose > 0)
+ printf("Found %d changes\n", changed);
+ return changed;
+}
+
+/*
+ * Emit a trie for the given tree into the data array.
+ */
+static void
+emit(struct tree *tree, unsigned char *data)
+{
+ struct node *node;
+ unsigned int leftmask;
+ unsigned int rightmask;
+ unsigned int bitmask;
+ int offlen;
+ int offset;
+ int index;
+ int indent;
+ unsigned char byte;
+
+ index = tree->index;
+ data += index;
+ indent = 1;
+ if (verbose > 0)
+ printf("Emitting %s_%x\n", tree->type, tree->maxage);
+ if (tree->childnode == LEAF) {
+ assert(tree->root);
+ tree->leaf_emit(tree->root, data);
+ return;
+ }
+
+ assert(tree->childnode == NODE);
+ node = tree->root;
+ leftmask = rightmask = 0;
+ while (node) {
+ if (!node->mark)
+ goto skip;
+ assert(node->offset != -1);
+ assert(node->index == index);
+
+ byte = 0;
+ if (node->nextbyte)
+ byte |= NEXTBYTE;
+ byte |= (node->bitnum & BITNUM);
+ if (node->left && node->right) {
+ if (node->leftnode == NODE)
+ byte |= LEFTNODE;
+ if (node->rightnode == NODE)
+ byte |= RIGHTNODE;
+ if (node->offset <= 0xff)
+ offlen = 1;
+ else if (node->offset <= 0xffff)
+ offlen = 2;
+ else
+ offlen = 3;
+ offset = node->offset;
+ byte |= offlen << OFFLEN_SHIFT;
+ *data++ = byte;
+ index++;
+ while (offlen--) {
+ *data++ = offset & 0xff;
+ index++;
+ offset >>= 8;
+ }
+ } else if (node->left) {
+ if (node->leftnode == NODE)
+ byte |= TRIENODE;
+ *data++ = byte;
+ index++;
+ } else if (node->right) {
+ byte |= RIGHTNODE;
+ if (node->rightnode == NODE)
+ byte |= TRIENODE;
+ *data++ = byte;
+ index++;
+ } else {
+ assert(0);
+ }
+skip:
+ while (node) {
+ bitmask = 1 << node->bitnum;
+ if (node->mark && (leftmask & bitmask) == 0) {
+ leftmask |= bitmask;
+ if (node->leftnode == LEAF) {
+ assert(node->left);
+ data = tree->leaf_emit(node->left,
+ data);
+ index += tree->leaf_size(node->left);
+ } else if (node->left) {
+ assert(node->leftnode == NODE);
+ indent += 1;
+ node = node->left;
+ break;
+ }
+ }
+ if (node->mark && (rightmask & bitmask) == 0) {
+ rightmask |= bitmask;
+ if (node->rightnode == LEAF) {
+ assert(node->right);
+ data = tree->leaf_emit(node->right,
+ data);
+ index += tree->leaf_size(node->right);
+ } else if (node->right) {
+ assert(node->rightnode==NODE);
+ indent += 1;
+ node = node->right;
+ break;
+ }
+ }
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ node = node->parent;
+ indent -= 1;
+ }
+ }
+}
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * Unicode data.
+ *
+ * We need to keep track of the Canonical Combining Class, the Age,
+ * and decompositions for a code point.
+ *
+ * For the Age, we store the index into the ages table. Effectively
+ * this is a generation number that the table maps to a unicode
+ * version.
+ *
+ * The correction field is used to indicate that this entry is in the
+ * corrections array, which contains decompositions that were
+ * corrected in later revisions. The value of the correction field is
+ * the Unicode version in which the mapping was corrected.
+ */
+struct unicode_data {
+ unsigned int code;
+ int ccc;
+ int gen;
+ int correction;
+ unsigned int *utf32nfkdi;
+ unsigned int *utf32nfkdicf;
+ char *utf8nfkdi;
+ char *utf8nfkdicf;
+};
+
+struct unicode_data unicode_data[0x110000];
+struct unicode_data *corrections;
+int corrections_count;
+
+struct tree *nfkdi_tree;
+struct tree *nfkdicf_tree;
+
+struct tree *trees;
+int trees_count;
+
+/*
+ * Check the corrections array to see if this entry was corrected at
+ * some point.
+ */
+static struct unicode_data *
+corrections_lookup(struct unicode_data *u)
+{
+ int i;
+
+ for (i = 0; i != corrections_count; i++)
+ if (u->code == corrections[i].code)
+ return &corrections[i];
+ return u;
+}
+
+static int
+nfkdi_equal(void *l, void *r)
+{
+ struct unicode_data *left = l;
+ struct unicode_data *right = r;
+
+ if (left->gen != right->gen)
+ return 0;
+ if (left->ccc != right->ccc)
+ return 0;
+ if (left->utf8nfkdi && right->utf8nfkdi &&
+ strcmp(left->utf8nfkdi, right->utf8nfkdi) == 0)
+ return 1;
+ if (left->utf8nfkdi || right->utf8nfkdi)
+ return 0;
+ return 1;
+}
+
+static int
+nfkdicf_equal(void *l, void *r)
+{
+ struct unicode_data *left = l;
+ struct unicode_data *right = r;
+
+ if (left->gen != right->gen)
+ return 0;
+ if (left->ccc != right->ccc)
+ return 0;
+ if (left->utf8nfkdicf && right->utf8nfkdicf &&
+ strcmp(left->utf8nfkdicf, right->utf8nfkdicf) == 0)
+ return 1;
+ if (left->utf8nfkdicf && right->utf8nfkdicf)
+ return 0;
+ if (left->utf8nfkdicf || right->utf8nfkdicf)
+ return 0;
+ if (left->utf8nfkdi && right->utf8nfkdi &&
+ strcmp(left->utf8nfkdi, right->utf8nfkdi) == 0)
+ return 1;
+ if (left->utf8nfkdi || right->utf8nfkdi)
+ return 0;
+ return 1;
+}
+
+static void
+nfkdi_print(void *l, int indent)
+{
+ struct unicode_data *leaf = l;
+
+ printf("%*sleaf @ %p code %X ccc %d gen %d", indent, "", leaf,
+ leaf->code, leaf->ccc, leaf->gen);
+ if (leaf->utf8nfkdi)
+ printf(" nfkdi \"%s\"", (const char*)leaf->utf8nfkdi);
+ printf("\n");
+}
+
+static void
+nfkdicf_print(void *l, int indent)
+{
+ struct unicode_data *leaf = l;
+
+ printf("%*sleaf @ %p code %X ccc %d gen %d", indent, "", leaf,
+ leaf->code, leaf->ccc, leaf->gen);
+ if (leaf->utf8nfkdicf)
+ printf(" nfkdicf \"%s\"", (const char*)leaf->utf8nfkdicf);
+ else if (leaf->utf8nfkdi)
+ printf(" nfkdi \"%s\"", (const char*)leaf->utf8nfkdi);
+ printf("\n");
+}
+
+static int
+nfkdi_mark(void *l)
+{
+ return 1;
+}
+
+static int
+nfkdicf_mark(void *l)
+{
+ struct unicode_data *leaf = l;
+
+ if (leaf->utf8nfkdicf)
+ return 1;
+ return 0;
+}
+
+static int
+correction_mark(void *l)
+{
+ struct unicode_data *leaf = l;
+
+ return leaf->correction;
+}
+
+static int
+nfkdi_size(void *l)
+{
+ struct unicode_data *leaf = l;
+
+ int size = 2;
+ if (leaf->utf8nfkdi)
+ size += strlen(leaf->utf8nfkdi) + 1;
+ return size;
+}
+
+static int
+nfkdicf_size(void *l)
+{
+ struct unicode_data *leaf = l;
+
+ int size = 2;
+ if (leaf->utf8nfkdicf)
+ size += strlen(leaf->utf8nfkdicf) + 1;
+ else if (leaf->utf8nfkdi)
+ size += strlen(leaf->utf8nfkdi) + 1;
+ return size;
+}
+
+static int *
+nfkdi_index(struct tree *tree, void *l)
+{
+ struct unicode_data *leaf = l;
+
+ return &tree->leafindex[leaf->code];
+}
+
+static int *
+nfkdicf_index(struct tree *tree, void *l)
+{
+ struct unicode_data *leaf = l;
+
+ return &tree->leafindex[leaf->code];
+}
+
+static unsigned char *
+nfkdi_emit(void *l, unsigned char *data)
+{
+ struct unicode_data *leaf = l;
+ unsigned char *s;
+
+ *data++ = leaf->gen;
+ if (leaf->utf8nfkdi) {
+ *data++ = DECOMPOSE;
+ s = (unsigned char*)leaf->utf8nfkdi;
+ while ((*data++ = *s++) != 0)
+ ;
+ } else {
+ *data++ = leaf->ccc;
+ }
+ return data;
+}
+
+static unsigned char *
+nfkdicf_emit(void *l, unsigned char *data)
+{
+ struct unicode_data *leaf = l;
+ unsigned char *s;
+
+ *data++ = leaf->gen;
+ if (leaf->utf8nfkdicf) {
+ *data++ = DECOMPOSE;
+ s = (unsigned char*)leaf->utf8nfkdicf;
+ while ((*data++ = *s++) != 0)
+ ;
+ } else if (leaf->utf8nfkdi) {
+ *data++ = DECOMPOSE;
+ s = (unsigned char*)leaf->utf8nfkdi;
+ while ((*data++ = *s++) != 0)
+ ;
+ } else {
+ *data++ = leaf->ccc;
+ }
+ return data;
+}
+
+static void
+utf8_create(struct unicode_data *data)
+{
+ char utf[18*4+1];
+ char *u;
+ unsigned int *um;
+ int i;
+
+ u = utf;
+ um = data->utf32nfkdi;
+ if (um) {
+ for (i = 0; um[i]; i++)
+ u += utf8encode(u, um[i]);
+ *u = '\0';
+ data->utf8nfkdi = strdup(utf);
+ }
+ u = utf;
+ um = data->utf32nfkdicf;
+ if (um) {
+ for (i = 0; um[i]; i++)
+ u += utf8encode(u, um[i]);
+ *u = '\0';
+ if (!data->utf8nfkdi || strcmp(data->utf8nfkdi, utf))
+ data->utf8nfkdicf = strdup(utf);
+ }
+}
+
+static void
+utf8_init(void)
+{
+ unsigned int unichar;
+ int i;
+
+ for (unichar = 0; unichar != 0x110000; unichar++)
+ utf8_create(&unicode_data[unichar]);
+
+ for (i = 0; i != corrections_count; i++)
+ utf8_create(&corrections[i]);
+}
+
+static void
+trees_init(void)
+{
+ struct unicode_data *data;
+ unsigned int maxage;
+ unsigned int nextage;
+ int count;
+ int i;
+ int j;
+
+ /* Count the number of different ages. */
+ count = 0;
+ nextage = (unsigned int)-1;
+ do {
+ maxage = nextage;
+ nextage = 0;
+ for (i = 0; i <= corrections_count; i++) {
+ data = &corrections[i];
+ if (nextage < data->correction &&
+ data->correction < maxage)
+ nextage = data->correction;
+ }
+ count++;
+ } while (nextage);
+
+ /* Two trees per age: nfkdi and nfkdicf */
+ trees_count = count * 2;
+ trees = calloc(trees_count, sizeof(struct tree));
+
+ /* Assign ages to the trees. */
+ count = trees_count;
+ nextage = (unsigned int)-1;
+ do {
+ maxage = nextage;
+ trees[--count].maxage = maxage;
+ trees[--count].maxage = maxage;
+ nextage = 0;
+ for (i = 0; i <= corrections_count; i++) {
+ data = &corrections[i];
+ if (nextage < data->correction &&
+ data->correction < maxage)
+ nextage = data->correction;
+ }
+ } while (nextage);
+
+ /* The ages assigned above are off by one. */
+ for (i = 0; i != trees_count; i++) {
+ j = 0;
+ while (ages[j] < trees[i].maxage)
+ j++;
+ trees[i].maxage = ages[j-1];
+ }
+
+ /* Set up the forwarding between trees. */
+ trees[trees_count-2].next = &trees[trees_count-1];
+ trees[trees_count-1].leaf_mark = nfkdi_mark;
+ trees[trees_count-2].leaf_mark = nfkdicf_mark;
+ for (i = 0; i != trees_count-2; i += 2) {
+ trees[i].next = &trees[trees_count-2];
+ trees[i].leaf_mark = correction_mark;
+ trees[i+1].next = &trees[trees_count-1];
+ trees[i+1].leaf_mark = correction_mark;
+ }
+
+ /* Assign the callouts. */
+ for (i = 0; i != trees_count; i += 2) {
+ trees[i].type = "nfkdicf";
+ trees[i].leaf_equal = nfkdicf_equal;
+ trees[i].leaf_print = nfkdicf_print;
+ trees[i].leaf_size = nfkdicf_size;
+ trees[i].leaf_index = nfkdicf_index;
+ trees[i].leaf_emit = nfkdicf_emit;
+
+ trees[i+1].type = "nfkdi";
+ trees[i+1].leaf_equal = nfkdi_equal;
+ trees[i+1].leaf_print = nfkdi_print;
+ trees[i+1].leaf_size = nfkdi_size;
+ trees[i+1].leaf_index = nfkdi_index;
+ trees[i+1].leaf_emit = nfkdi_emit;
+ }
+
+ /* Finish init. */
+ for (i = 0; i != trees_count; i++)
+ trees[i].childnode = NODE;
+}
+
+static void
+trees_populate(void)
+{
+ struct unicode_data *data;
+ unsigned int unichar;
+ char keyval[4];
+ int keylen;
+ int i;
+
+ for (i = 0; i != trees_count; i++) {
+ if (verbose > 0) {
+ printf("Populating %s_%x\n",
+ trees[i].type, trees[i].maxage);
+ }
+ for (unichar = 0; unichar != 0x110000; unichar++) {
+ if (unicode_data[unichar].gen < 0)
+ continue;
+ keylen = utf8encode(keyval, unichar);
+ data = corrections_lookup(&unicode_data[unichar]);
+ if (data->correction <= trees[i].maxage)
+ data = &unicode_data[unichar];
+ insert(&trees[i], keyval, keylen, data);
+ }
+ }
+}
+
+static void
+trees_reduce(void)
+{
+ int i;
+ int size;
+ int changed;
+
+ for (i = 0; i != trees_count; i++)
+ prune(&trees[i]);
+ for (i = 0; i != trees_count; i++)
+ mark_nodes(&trees[i]);
+ do {
+ size = 0;
+ for (i = 0; i != trees_count; i++)
+ size = index_nodes(&trees[i], size);
+ changed = 0;
+ for (i = 0; i != trees_count; i++)
+ changed += size_nodes(&trees[i]);
+ } while (changed);
+
+ utf8data = calloc(size, 1);
+ utf8data_size = size;
+ for (i = 0; i != trees_count; i++)
+ emit(&trees[i], utf8data);
+
+ if (verbose > 0) {
+ for (i = 0; i != trees_count; i++) {
+ printf("%s_%x idx %d\n",
+ trees[i].type, trees[i].maxage, trees[i].index);
+ }
+ }
+
+ nfkdi = utf8data + trees[trees_count-1].index;
+ nfkdicf = utf8data + trees[trees_count-2].index;
+
+ nfkdi_tree = &trees[trees_count-1];
+ nfkdicf_tree = &trees[trees_count-2];
+}
+
+static void
+verify(struct tree *tree)
+{
+ struct unicode_data *data;
+ utf8leaf_t *leaf;
+ unsigned int unichar;
+ char key[4];
+ int report;
+ int nocf;
+
+ if (verbose > 0)
+ printf("Verifying %s_%x\n", tree->type, tree->maxage);
+ nocf = strcmp(tree->type, "nfkdicf");
+
+ for (unichar = 0; unichar != 0x110000; unichar++) {
+ report = 0;
+ data = corrections_lookup(&unicode_data[unichar]);
+ if (data->correction <= tree->maxage)
+ data = &unicode_data[unichar];
+ utf8encode(key,unichar);
+ leaf = utf8lookup(tree, key);
+ if (!leaf) {
+ if (data->gen != -1)
+ report++;
+ if (unichar < 0xd800 || unichar > 0xdfff)
+ report++;
+ } else {
+ if (unichar >= 0xd800 && unichar <= 0xdfff)
+ report++;
+ if (data->gen == -1)
+ report++;
+ if (data->gen != LEAF_GEN(leaf))
+ report++;
+ if (LEAF_CCC(leaf) == DECOMPOSE) {
+ if (nocf) {
+ if (!data->utf8nfkdi) {
+ report++;
+ } else if (strcmp(data->utf8nfkdi,
+ LEAF_STR(leaf))) {
+ report++;
+ }
+ } else {
+ if (!data->utf8nfkdicf &&
+ !data->utf8nfkdi) {
+ report++;
+ } else if (data->utf8nfkdicf) {
+ if (strcmp(data->utf8nfkdicf,
+ LEAF_STR(leaf)))
+ report++;
+ } else if (strcmp(data->utf8nfkdi,
+ LEAF_STR(leaf))) {
+ report++;
+ }
+ }
+ } else if (data->ccc != LEAF_CCC(leaf)) {
+ report++;
+ }
+ }
+ if (report) {
+ printf("%X code %X gen %d ccc %d"
+ " nfkdi -> \"%s\"",
+ unichar, data->code, data->gen,
+ data->ccc,
+ data->utf8nfkdi);
+ if (leaf) {
+ printf(" gen %d ccc %d"
+ " nfkdi -> \"%s\"",
+ LEAF_GEN(leaf),
+ LEAF_CCC(leaf),
+ LEAF_CCC(leaf) == DECOMPOSE ?
+ LEAF_STR(leaf) : "");
+ }
+ printf("\n");
+ }
+ }
+}
+
+static void
+trees_verify(void)
+{
+ int i;
+
+ for (i = 0; i != trees_count; i++)
+ verify(&trees[i]);
+}
+
+/* ------------------------------------------------------------------ */
+
+static void
+help(void)
+{
+ printf("Usage: %s [options]\n", argv0);
+ printf("\n");
+ printf("This program creates an a data trie used for parsing and\n");
+ printf("normalization of UTF-8 strings. The trie is derived from\n");
+ printf("a set of input files from the Unicode character database\n");
+ printf("found at: http://www.unicode.org/Public/UCD/latest/ucd/\n");
+ printf("\n");
+ printf("The generated tree supports two normalization forms:\n");
+ printf("\n");
+ printf("\tnfkdi:\n");
+ printf("\t- Apply unicode normalization form NFKD.\n");
+ printf("\t- Remove any Default_Ignorable_Code_Point.\n");
+ printf("\n");
+ printf("\tnfkdicf:\n");
+ printf("\t- Apply unicode normalization form NFKD.\n");
+ printf("\t- Remove any Default_Ignorable_Code_Point.\n");
+ printf("\t- Apply a full casefold (C + F).\n");
+ printf("\n");
+ printf("These forms were chosen as being most useful when dealing\n");
+ printf("with file names: NFKD catches most cases where characters\n");
+ printf("should be considered equivalent. The ignorables are mostly\n");
+ printf("invisible, making names hard to type.\n");
+ printf("\n");
+ printf("The options to specify the files to be used are listed\n");
+ printf("below with their default values, which are the names used\n");
+ printf("by version 7.0.0 of the Unicode Character Database.\n");
+ printf("\n");
+ printf("The input files:\n");
+ printf("\t-a %s\n", AGE_NAME);
+ printf("\t-c %s\n", CCC_NAME);
+ printf("\t-p %s\n", PROP_NAME);
+ printf("\t-d %s\n", DATA_NAME);
+ printf("\t-f %s\n", FOLD_NAME);
+ printf("\t-n %s\n", NORM_NAME);
+ printf("\n");
+ printf("Additionally, the generated tables are tested using:\n");
+ printf("\t-t %s\n", TEST_NAME);
+ printf("\n");
+ printf("Finally, the output file:\n");
+ printf("\t-o %s\n", UTF8_NAME);
+ printf("\n");
+}
+
+static void
+usage(void)
+{
+ help();
+ exit(1);
+}
+
+static void
+open_fail(const char *name, int error)
+{
+ printf("Error %d opening %s: %s\n", error, name, strerror(error));
+ exit(1);
+}
+
+static void
+file_fail(const char *filename)
+{
+ printf("Error parsing %s\n", filename);
+ exit(1);
+}
+
+static void
+line_fail(const char *filename, const char *line)
+{
+ printf("Error parsing %s:%s\n", filename, line);
+ exit(1);
+}
+
+/* ------------------------------------------------------------------ */
+
+static void
+print_utf32(unsigned int *utf32str)
+{
+ int i;
+
+ for (i = 0; utf32str[i]; i++)
+ printf(" %X", utf32str[i]);
+}
+
+static void
+print_utf32nfkdi(unsigned int unichar)
+{
+ printf(" %X ->", unichar);
+ print_utf32(unicode_data[unichar].utf32nfkdi);
+ printf("\n");
+}
+
+static void
+print_utf32nfkdicf(unsigned int unichar)
+{
+ printf(" %X ->", unichar);
+ print_utf32(unicode_data[unichar].utf32nfkdicf);
+ printf("\n");
+}
+
+/* ------------------------------------------------------------------ */
+
+static void
+age_init(void)
+{
+ FILE *file;
+ unsigned int first;
+ unsigned int last;
+ unsigned int unichar;
+ unsigned int major;
+ unsigned int minor;
+ unsigned int revision;
+ int gen;
+ int count;
+ int ret;
+
+ if (verbose > 0)
+ printf("Parsing %s\n", age_name);
+
+ file = fopen(age_name, "r");
+ if (!file)
+ open_fail(age_name, errno);
+ count = 0;
+
+ gen = 0;
+ while (fgets(line, LINESIZE, file)) {
+ ret = sscanf(line, "# Age=V%d_%d_%d",
+ &major, &minor, &revision);
+ if (ret == 3) {
+ ages_count++;
+ if (verbose > 1)
+ printf(" Age V%d_%d_%d\n",
+ major, minor, revision);
+ if (!age_valid(major, minor, revision))
+ line_fail(age_name, line);
+ continue;
+ }
+ ret = sscanf(line, "# Age=V%d_%d", &major, &minor);
+ if (ret == 2) {
+ ages_count++;
+ if (verbose > 1)
+ printf(" Age V%d_%d\n", major, minor);
+ if (!age_valid(major, minor, 0))
+ line_fail(age_name, line);
+ continue;
+ }
+ }
+
+ /* We must have found something above. */
+ if (verbose > 1)
+ printf("%d age entries\n", ages_count);
+ if (ages_count == 0 || ages_count > MAXGEN)
+ file_fail(age_name);
+
+ /* There is a 0 entry. */
+ ages_count++;
+ ages = calloc(ages_count + 1, sizeof(*ages));
+ /* And a guard entry. */
+ ages[ages_count] = (unsigned int)-1;
+
+ rewind(file);
+ count = 0;
+ gen = 0;
+ while (fgets(line, LINESIZE, file)) {
+ ret = sscanf(line, "# Age=V%d_%d_%d",
+ &major, &minor, &revision);
+ if (ret == 3) {
+ ages[++gen] =
+ UNICODE_AGE(major, minor, revision);
+ if (verbose > 1)
+ printf(" Age V%d_%d_%d = gen %d\n",
+ major, minor, revision, gen);
+ if (!age_valid(major, minor, revision))
+ line_fail(age_name, line);
+ continue;
+ }
+ ret = sscanf(line, "# Age=V%d_%d", &major, &minor);
+ if (ret == 2) {
+ ages[++gen] = UNICODE_AGE(major, minor, 0);
+ if (verbose > 1)
+ printf(" Age V%d_%d = %d\n",
+ major, minor, gen);
+ if (!age_valid(major, minor, 0))
+ line_fail(age_name, line);
+ continue;
+ }
+ ret = sscanf(line, "%X..%X ; %d.%d #",
+ &first, &last, &major, &minor);
+ if (ret == 4) {
+ for (unichar = first; unichar <= last; unichar++)
+ unicode_data[unichar].gen = gen;
+ count += 1 + last - first;
+ if (verbose > 1)
+ printf(" %X..%X gen %d\n", first, last, gen);
+ if (!utf32valid(first) || !utf32valid(last))
+ line_fail(age_name, line);
+ continue;
+ }
+ ret = sscanf(line, "%X ; %d.%d #", &unichar, &major, &minor);
+ if (ret == 3) {
+ unicode_data[unichar].gen = gen;
+ count++;
+ if (verbose > 1)
+ printf(" %X gen %d\n", unichar, gen);
+ if (!utf32valid(unichar))
+ line_fail(age_name, line);
+ continue;
+ }
+ }
+ unicode_maxage = ages[gen];
+ fclose(file);
+
+ /* Nix surrogate block */
+ if (verbose > 1)
+ printf(" Removing surrogate block D800..DFFF\n");
+ for (unichar = 0xd800; unichar <= 0xdfff; unichar++)
+ unicode_data[unichar].gen = -1;
+
+ if (verbose > 0)
+ printf("Found %d entries\n", count);
+ if (count == 0)
+ file_fail(age_name);
+}
+
+static void
+ccc_init(void)
+{
+ FILE *file;
+ unsigned int first;
+ unsigned int last;
+ unsigned int unichar;
+ unsigned int value;
+ int count;
+ int ret;
+
+ if (verbose > 0)
+ printf("Parsing %s\n", ccc_name);
+
+ file = fopen(ccc_name, "r");
+ if (!file)
+ open_fail(ccc_name, errno);
+
+ count = 0;
+ while (fgets(line, LINESIZE, file)) {
+ ret = sscanf(line, "%X..%X ; %d #", &first, &last, &value);
+ if (ret == 3) {
+ for (unichar = first; unichar <= last; unichar++) {
+ unicode_data[unichar].ccc = value;
+ count++;
+ }
+ if (verbose > 1)
+ printf(" %X..%X ccc %d\n", first, last, value);
+ if (!utf32valid(first) || !utf32valid(last))
+ line_fail(ccc_name, line);
+ continue;
+ }
+ ret = sscanf(line, "%X ; %d #", &unichar, &value);
+ if (ret == 2) {
+ unicode_data[unichar].ccc = value;
+ count++;
+ if (verbose > 1)
+ printf(" %X ccc %d\n", unichar, value);
+ if (!utf32valid(unichar))
+ line_fail(ccc_name, line);
+ continue;
+ }
+ }
+ fclose(file);
+
+ if (verbose > 0)
+ printf("Found %d entries\n", count);
+ if (count == 0)
+ file_fail(ccc_name);
+}
+
+static void
+nfkdi_init(void)
+{
+ FILE *file;
+ unsigned int unichar;
+ unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+ char *s;
+ unsigned int *um;
+ int count;
+ int i;
+ int ret;
+
+ if (verbose > 0)
+ printf("Parsing %s\n", data_name);
+ file = fopen(data_name, "r");
+ if (!file)
+ open_fail(data_name, errno);
+
+ count = 0;
+ while (fgets(line, LINESIZE, file)) {
+ ret = sscanf(line, "%X;%*[^;];%*[^;];%*[^;];%*[^;];%[^;];",
+ &unichar, buf0);
+ if (ret != 2)
+ continue;
+ if (!utf32valid(unichar))
+ line_fail(data_name, line);
+
+ s = buf0;
+ /* skip over <tag> */
+ if (*s == '<')
+ while (*s++ != ' ')
+ ;
+ /* decode the decomposition into UTF-32 */
+ i = 0;
+ while (*s) {
+ mapping[i] = strtoul(s, &s, 16);
+ if (!utf32valid(mapping[i]))
+ line_fail(data_name, line);
+ i++;
+ }
+ mapping[i++] = 0;
+
+ um = malloc(i * sizeof(unsigned int));
+ memcpy(um, mapping, i * sizeof(unsigned int));
+ unicode_data[unichar].utf32nfkdi = um;
+
+ if (verbose > 1)
+ print_utf32nfkdi(unichar);
+ count++;
+ }
+ fclose(file);
+ if (verbose > 0)
+ printf("Found %d entries\n", count);
+ if (count == 0)
+ file_fail(data_name);
+}
+
+static void
+nfkdicf_init(void)
+{
+ FILE *file;
+ unsigned int unichar;
+ unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+ char status;
+ char *s;
+ unsigned int *um;
+ int i;
+ int count;
+ int ret;
+
+ if (verbose > 0)
+ printf("Parsing %s\n", fold_name);
+ file = fopen(fold_name, "r");
+ if (!file)
+ open_fail(fold_name, errno);
+
+ count = 0;
+ while (fgets(line, LINESIZE, file)) {
+ ret = sscanf(line, "%X; %c; %[^;];", &unichar, &status, buf0);
+ if (ret != 3)
+ continue;
+ if (!utf32valid(unichar))
+ line_fail(fold_name, line);
+ /* Use the C+F casefold. */
+ if (status != 'C' && status != 'F')
+ continue;
+ s = buf0;
+ if (*s == '<')
+ while (*s++ != ' ')
+ ;
+ i = 0;
+ while (*s) {
+ mapping[i] = strtoul(s, &s, 16);
+ if (!utf32valid(mapping[i]))
+ line_fail(fold_name, line);
+ i++;
+ }
+ mapping[i++] = 0;
+
+ um = malloc(i * sizeof(unsigned int));
+ memcpy(um, mapping, i * sizeof(unsigned int));
+ unicode_data[unichar].utf32nfkdicf = um;
+
+ if (verbose > 1)
+ print_utf32nfkdicf(unichar);
+ count++;
+ }
+ fclose(file);
+ if (verbose > 0)
+ printf("Found %d entries\n", count);
+ if (count == 0)
+ file_fail(fold_name);
+}
+
+static void
+ignore_init(void)
+{
+ FILE *file;
+ unsigned int unichar;
+ unsigned int first;
+ unsigned int last;
+ unsigned int *um;
+ int count;
+ int ret;
+
+ if (verbose > 0)
+ printf("Parsing %s\n", prop_name);
+ file = fopen(prop_name, "r");
+ if (!file)
+ open_fail(prop_name, errno);
+ assert(file);
+ count = 0;
+ while (fgets(line, LINESIZE, file)) {
+ ret = sscanf(line, "%X..%X ; %s # ", &first, &last, buf0);
+ if (ret == 3) {
+ if (strcmp(buf0, "Default_Ignorable_Code_Point"))
+ continue;
+ if (!utf32valid(first) || !utf32valid(last))
+ line_fail(prop_name, line);
+ for (unichar = first; unichar <= last; unichar++) {
+ free(unicode_data[unichar].utf32nfkdi);
+ um = malloc(sizeof(unsigned int));
+ *um = 0;
+ unicode_data[unichar].utf32nfkdi = um;
+ free(unicode_data[unichar].utf32nfkdicf);
+ um = malloc(sizeof(unsigned int));
+ *um = 0;
+ unicode_data[unichar].utf32nfkdicf = um;
+ count++;
+ }
+ if (verbose > 1)
+ printf(" %X..%X Default_Ignorable_Code_Point\n",
+ first, last);
+ continue;
+ }
+ ret = sscanf(line, "%X ; %s # ", &unichar, buf0);
+ if (ret == 2) {
+ if (strcmp(buf0, "Default_Ignorable_Code_Point"))
+ continue;
+ if (!utf32valid(unichar))
+ line_fail(prop_name, line);
+ free(unicode_data[unichar].utf32nfkdi);
+ um = malloc(sizeof(unsigned int));
+ *um = 0;
+ unicode_data[unichar].utf32nfkdi = um;
+ free(unicode_data[unichar].utf32nfkdicf);
+ um = malloc(sizeof(unsigned int));
+ *um = 0;
+ unicode_data[unichar].utf32nfkdicf = um;
+ if (verbose > 1)
+ printf(" %X Default_Ignorable_Code_Point\n",
+ unichar);
+ count++;
+ continue;
+ }
+ }
+ fclose(file);
+
+ if (verbose > 0)
+ printf("Found %d entries\n", count);
+ if (count == 0)
+ file_fail(prop_name);
+}
+
+static void
+corrections_init(void)
+{
+ FILE *file;
+ unsigned int unichar;
+ unsigned int major;
+ unsigned int minor;
+ unsigned int revision;
+ unsigned int age;
+ unsigned int *um;
+ unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+ char *s;
+ int i;
+ int count;
+ int ret;
+
+ if (verbose > 0)
+ printf("Parsing %s\n", norm_name);
+ file = fopen(norm_name, "r");
+ if (!file)
+ open_fail(norm_name, errno);
+
+ count = 0;
+ while (fgets(line, LINESIZE, file)) {
+ ret = sscanf(line, "%X;%[^;];%[^;];%d.%d.%d #",
+ &unichar, buf0, buf1,
+ &major, &minor, &revision);
+ if (ret != 6)
+ continue;
+ if (!utf32valid(unichar) || !age_valid(major, minor, revision))
+ line_fail(norm_name, line);
+ count++;
+ }
+ corrections = calloc(count, sizeof(struct unicode_data));
+ corrections_count = count;
+ rewind(file);
+
+ count = 0;
+ while (fgets(line, LINESIZE, file)) {
+ ret = sscanf(line, "%X;%[^;];%[^;];%d.%d.%d #",
+ &unichar, buf0, buf1,
+ &major, &minor, &revision);
+ if (ret != 6)
+ continue;
+ if (!utf32valid(unichar) || !age_valid(major, minor, revision))
+ line_fail(norm_name, line);
+ corrections[count] = unicode_data[unichar];
+ assert(corrections[count].code == unichar);
+ age = UNICODE_AGE(major, minor, revision);
+ corrections[count].correction = age;
+
+ i = 0;
+ s = buf0;
+ while (*s) {
+ mapping[i] = strtoul(s, &s, 16);
+ if (!utf32valid(mapping[i]))
+ line_fail(norm_name, line);
+ i++;
+ }
+ mapping[i++] = 0;
+
+ um = malloc(i * sizeof(unsigned int));
+ memcpy(um, mapping, i * sizeof(unsigned int));
+ corrections[count].utf32nfkdi = um;
+
+ if (verbose > 1)
+ printf(" %X -> %s -> %s V%d_%d_%d\n",
+ unichar, buf0, buf1, major, minor, revision);
+ count++;
+ }
+ fclose(file);
+
+ if (verbose > 0)
+ printf("Found %d entries\n", count);
+ if (count == 0)
+ file_fail(norm_name);
+}
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * Hangul decomposition (algorithm from Section 3.12 of Unicode 6.3.0)
+ *
+ * AC00;<Hangul Syllable, First>;Lo;0;L;;;;;N;;;;;
+ * D7A3;<Hangul Syllable, Last>;Lo;0;L;;;;;N;;;;;
+ *
+ * SBase = 0xAC00
+ * LBase = 0x1100
+ * VBase = 0x1161
+ * TBase = 0x11A7
+ * LCount = 19
+ * VCount = 21
+ * TCount = 28
+ * NCount = 588 (VCount * TCount)
+ * SCount = 11172 (LCount * NCount)
+ *
+ * Decomposition:
+ * SIndex = s - SBase
+ *
+ * LV (Canonical/Full)
+ * LIndex = SIndex / NCount
+ * VIndex = (Sindex % NCount) / TCount
+ * LPart = LBase + LIndex
+ * VPart = VBase + VIndex
+ *
+ * LVT (Canonical)
+ * LVIndex = (SIndex / TCount) * TCount
+ * TIndex = (Sindex % TCount)
+ * LVPart = SBase + LVIndex
+ * TPart = TBase + TIndex
+ *
+ * LVT (Full)
+ * LIndex = SIndex / NCount
+ * VIndex = (Sindex % NCount) / TCount
+ * TIndex = (Sindex % TCount)
+ * LPart = LBase + LIndex
+ * VPart = VBase + VIndex
+ * if (TIndex == 0) {
+ * d = <LPart, VPart>
+ * } else {
+ * TPart = TBase + TIndex
+ * d = <LPart, VPart, TPart>
+ * }
+ *
+ */
+
+static void
+hangul_decompose(void)
+{
+ unsigned int sb = 0xAC00;
+ unsigned int lb = 0x1100;
+ unsigned int vb = 0x1161;
+ unsigned int tb = 0x11a7;
+ /* unsigned int lc = 19; */
+ unsigned int vc = 21;
+ unsigned int tc = 28;
+ unsigned int nc = (vc * tc);
+ /* unsigned int sc = (lc * nc); */
+ unsigned int unichar;
+ unsigned int mapping[4];
+ unsigned int *um;
+ int count;
+ int i;
+
+ if (verbose > 0)
+ printf("Decomposing hangul\n");
+ /* Hangul */
+ count = 0;
+ for (unichar = 0xAC00; unichar <= 0xD7A3; unichar++) {
+ unsigned int si = unichar - sb;
+ unsigned int li = si / nc;
+ unsigned int vi = (si % nc) / tc;
+ unsigned int ti = si % tc;
+
+ i = 0;
+ mapping[i++] = lb + li;
+ mapping[i++] = vb + vi;
+ if (ti)
+ mapping[i++] = tb + ti;
+ mapping[i++] = 0;
+
+ assert(!unicode_data[unichar].utf32nfkdi);
+ um = malloc(i * sizeof(unsigned int));
+ memcpy(um, mapping, i * sizeof(unsigned int));
+ unicode_data[unichar].utf32nfkdi = um;
+
+ assert(!unicode_data[unichar].utf32nfkdicf);
+ um = malloc(i * sizeof(unsigned int));
+ memcpy(um, mapping, i * sizeof(unsigned int));
+ unicode_data[unichar].utf32nfkdicf = um;
+
+ if (verbose > 1)
+ print_utf32nfkdi(unichar);
+
+ count++;
+ }
+ if (verbose > 0)
+ printf("Created %d entries\n", count);
+}
+
+static void
+nfkdi_decompose(void)
+{
+ unsigned int unichar;
+ unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+ unsigned int *um;
+ unsigned int *dc;
+ int count;
+ int i;
+ int j;
+ int ret;
+
+ if (verbose > 0)
+ printf("Decomposing nfkdi\n");
+
+ count = 0;
+ for (unichar = 0; unichar != 0x110000; unichar++) {
+ if (!unicode_data[unichar].utf32nfkdi)
+ continue;
+ for (;;) {
+ ret = 1;
+ i = 0;
+ um = unicode_data[unichar].utf32nfkdi;
+ while (*um) {
+ dc = unicode_data[*um].utf32nfkdi;
+ if (dc) {
+ for (j = 0; dc[j]; j++)
+ mapping[i++] = dc[j];
+ ret = 0;
+ } else {
+ mapping[i++] = *um;
+ }
+ um++;
+ }
+ mapping[i++] = 0;
+ if (ret)
+ break;
+ free(unicode_data[unichar].utf32nfkdi);
+ um = malloc(i * sizeof(unsigned int));
+ memcpy(um, mapping, i * sizeof(unsigned int));
+ unicode_data[unichar].utf32nfkdi = um;
+ }
+ /* Add this decomposition to nfkdicf if there is no entry. */
+ if (!unicode_data[unichar].utf32nfkdicf) {
+ um = malloc(i * sizeof(unsigned int));
+ memcpy(um, mapping, i * sizeof(unsigned int));
+ unicode_data[unichar].utf32nfkdicf = um;
+ }
+ if (verbose > 1)
+ print_utf32nfkdi(unichar);
+ count++;
+ }
+ if (verbose > 0)
+ printf("Processed %d entries\n", count);
+}
+
+static void
+nfkdicf_decompose(void)
+{
+ unsigned int unichar;
+ unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+ unsigned int *um;
+ unsigned int *dc;
+ int count;
+ int i;
+ int j;
+ int ret;
+
+ if (verbose > 0)
+ printf("Decomposing nfkdicf\n");
+ count = 0;
+ for (unichar = 0; unichar != 0x110000; unichar++) {
+ if (!unicode_data[unichar].utf32nfkdicf)
+ continue;
+ for (;;) {
+ ret = 1;
+ i = 0;
+ um = unicode_data[unichar].utf32nfkdicf;
+ while (*um) {
+ dc = unicode_data[*um].utf32nfkdicf;
+ if (dc) {
+ for (j = 0; dc[j]; j++)
+ mapping[i++] = dc[j];
+ ret = 0;
+ } else {
+ mapping[i++] = *um;
+ }
+ um++;
+ }
+ mapping[i++] = 0;
+ if (ret)
+ break;
+ free(unicode_data[unichar].utf32nfkdicf);
+ um = malloc(i * sizeof(unsigned int));
+ memcpy(um, mapping, i * sizeof(unsigned int));
+ unicode_data[unichar].utf32nfkdicf = um;
+ }
+ if (verbose > 1)
+ print_utf32nfkdicf(unichar);
+ count++;
+ }
+ if (verbose > 0)
+ printf("Processed %d entries\n", count);
+}
+
+/* ------------------------------------------------------------------ */
+
+int utf8agemax(struct tree *, const char *);
+int utf8nagemax(struct tree *, const char *, size_t);
+int utf8agemin(struct tree *, const char *);
+int utf8nagemin(struct tree *, const char *, size_t);
+ssize_t utf8len(struct tree *, const char *);
+ssize_t utf8nlen(struct tree *, const char *, size_t);
+struct utf8cursor;
+int utf8cursor(struct utf8cursor *, struct tree *, const char *);
+int utf8ncursor(struct utf8cursor *, struct tree *, const char *, size_t);
+int utf8byte(struct utf8cursor *);
+
+/*
+ * Use trie to scan s, touching at most len bytes.
+ * Returns the leaf if one exists, NULL otherwise.
+ *
+ * A non-NULL return guarantees that the UTF-8 sequence starting at s
+ * is well-formed and corresponds to a known unicode code point. The
+ * shorthand for this will be "is valid UTF-8 unicode".
+ */
+static utf8leaf_t *
+utf8nlookup(struct tree *tree, const char *s, size_t len)
+{
+ utf8trie_t *trie = utf8data + tree->index;
+ int offlen;
+ int offset;
+ int mask;
+ int node;
+
+ if (!tree)
+ return NULL;
+ if (len == 0)
+ return NULL;
+ node = 1;
+ while (node) {
+ offlen = (*trie & OFFLEN) >> OFFLEN_SHIFT;
+ if (*trie & NEXTBYTE) {
+ if (--len == 0)
+ return NULL;
+ s++;
+ }
+ mask = 1 << (*trie & BITNUM);
+ if (*s & mask) {
+ /* Right leg */
+ if (offlen) {
+ /* Right node at offset of trie */
+ node = (*trie & RIGHTNODE);
+ offset = trie[offlen];
+ while (--offlen) {
+ offset <<= 8;
+ offset |= trie[offlen];
+ }
+ trie += offset;
+ } else if (*trie & RIGHTPATH) {
+ /* Right node after this node */
+ node = (*trie & TRIENODE);
+ trie++;
+ } else {
+ /* No right node. */
+ return NULL;
+ }
+ } else {
+ /* Left leg */
+ if (offlen) {
+ /* Left node after this node. */
+ node = (*trie & LEFTNODE);
+ trie += offlen + 1;
+ } else if (*trie & RIGHTPATH) {
+ /* No left node. */
+ return NULL;
+ } else {
+ /* Left node after this node */
+ node = (*trie & TRIENODE);
+ trie++;
+ }
+ }
+ }
+ return trie;
+}
+
+/*
+ * Use trie to scan s.
+ * Returns the leaf if one exists, NULL otherwise.
+ *
+ * Forwards to trie_nlookup().
+ */
+static utf8leaf_t *
+utf8lookup(struct tree *tree, const char *s)
+{
+ return utf8nlookup(tree, s, (size_t)-1);
+}
+
+/*
+ * Return the number of bytes used by the current UTF-8 sequence.
+ * Assumes the input points to the first byte of a valid UTF-8
+ * sequence.
+ */
+static inline int
+utf8clen(const char *s)
+{
+ unsigned char c = *s;
+ return 1 + (c >= 0xC0) + (c >= 0xE0) + (c >= 0xF0);
+}
+
+/*
+ * Maximum age of any character in s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ * Return 0 if only non-assigned code points are used.
+ */
+int
+utf8agemax(struct tree *tree, const char *s)
+{
+ utf8leaf_t *leaf;
+ int age = 0;
+ int leaf_age;
+
+ if (!tree)
+ return -1;
+ while (*s) {
+ if (!(leaf = utf8lookup(tree, s)))
+ return -1;
+ leaf_age = ages[LEAF_GEN(leaf)];
+ if (leaf_age <= tree->maxage && leaf_age > age)
+ age = leaf_age;
+ s += utf8clen(s);
+ }
+ return age;
+}
+
+/*
+ * Minimum age of any character in s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ * Return 0 if non-assigned code points are used.
+ */
+int
+utf8agemin(struct tree *tree, const char *s)
+{
+ utf8leaf_t *leaf;
+ int age;
+ int leaf_age;
+
+ if (!tree)
+ return -1;
+ age = tree->maxage;
+ while (*s) {
+ if (!(leaf = utf8lookup(tree, s)))
+ return -1;
+ leaf_age = ages[LEAF_GEN(leaf)];
+ if (leaf_age <= tree->maxage && leaf_age < age)
+ age = leaf_age;
+ s += utf8clen(s);
+ }
+ return age;
+}
+
+/*
+ * Maximum age of any character in s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+int
+utf8nagemax(struct tree *tree, const char *s, size_t len)
+{
+ utf8leaf_t *leaf;
+ int age = 0;
+ int leaf_age;
+
+ if (!tree)
+ return -1;
+ while (len && *s) {
+ if (!(leaf = utf8nlookup(tree, s, len)))
+ return -1;
+ leaf_age = ages[LEAF_GEN(leaf)];
+ if (leaf_age <= tree->maxage && leaf_age > age)
+ age = leaf_age;
+ len -= utf8clen(s);
+ s += utf8clen(s);
+ }
+ return age;
+}
+
+/*
+ * Maximum age of any character in s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+int
+utf8nagemin(struct tree *tree, const char *s, size_t len)
+{
+ utf8leaf_t *leaf;
+ int leaf_age;
+ int age;
+
+ if (!tree)
+ return -1;
+ age = tree->maxage;
+ while (len && *s) {
+ if (!(leaf = utf8nlookup(tree, s, len)))
+ return -1;
+ leaf_age = ages[LEAF_GEN(leaf)];
+ if (leaf_age <= tree->maxage && leaf_age < age)
+ age = leaf_age;
+ len -= utf8clen(s);
+ s += utf8clen(s);
+ }
+ return age;
+}
+
+/*
+ * Length of the normalization of s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ *
+ * A string of Default_Ignorable_Code_Point has length 0.
+ */
+ssize_t
+utf8len(struct tree *tree, const char *s)
+{
+ utf8leaf_t *leaf;
+ size_t ret = 0;
+
+ if (!tree)
+ return -1;
+ while (*s) {
+ if (!(leaf = utf8lookup(tree, s)))
+ return -1;
+ if (ages[LEAF_GEN(leaf)] > tree->maxage)
+ ret += utf8clen(s);
+ else if (LEAF_CCC(leaf) == DECOMPOSE)
+ ret += strlen(LEAF_STR(leaf));
+ else
+ ret += utf8clen(s);
+ s += utf8clen(s);
+ }
+ return ret;
+}
+
+/*
+ * Length of the normalization of s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+ssize_t
+utf8nlen(struct tree *tree, const char *s, size_t len)
+{
+ utf8leaf_t *leaf;
+ size_t ret = 0;
+
+ if (!tree)
+ return -1;
+ while (len && *s) {
+ if (!(leaf = utf8nlookup(tree, s, len)))
+ return -1;
+ if (ages[LEAF_GEN(leaf)] > tree->maxage)
+ ret += utf8clen(s);
+ else if (LEAF_CCC(leaf) == DECOMPOSE)
+ ret += strlen(LEAF_STR(leaf));
+ else
+ ret += utf8clen(s);
+ len -= utf8clen(s);
+ s += utf8clen(s);
+ }
+ return ret;
+}
+
+/*
+ * Cursor structure used by the normalizer.
+ */
+struct utf8cursor {
+ struct tree *tree;
+ const char *s;
+ const char *p;
+ const char *ss;
+ const char *sp;
+ unsigned int len;
+ unsigned int slen;
+ short int ccc;
+ short int nccc;
+ unsigned int unichar;
+};
+
+/*
+ * Set up an utf8cursor for use by utf8byte().
+ *
+ * s : string.
+ * len : length of s.
+ * u8c : pointer to cursor.
+ * trie : utf8trie_t to use for normalization.
+ *
+ * Returns -1 on error, 0 on success.
+ */
+int
+utf8ncursor(
+ struct utf8cursor *u8c,
+ struct tree *tree,
+ const char *s,
+ size_t len)
+{
+ if (!tree)
+ return -1;
+ if (!s)
+ return -1;
+ u8c->tree = tree;
+ u8c->s = s;
+ u8c->p = NULL;
+ u8c->ss = NULL;
+ u8c->sp = NULL;
+ u8c->len = len;
+ u8c->slen = 0;
+ u8c->ccc = STOPPER;
+ u8c->nccc = STOPPER;
+ u8c->unichar = 0;
+ /* Check we didn't clobber the maximum length. */
+ if (u8c->len != len)
+ return -1;
+ /* The first byte of s may not be an utf8 continuation. */
+ if (len > 0 && (*s & 0xC0) == 0x80)
+ return -1;
+ return 0;
+}
+
+/*
+ * Set up an utf8cursor for use by utf8byte().
+ *
+ * s : NUL-terminated string.
+ * u8c : pointer to cursor.
+ * trie : utf8trie_t to use for normalization.
+ *
+ * Returns -1 on error, 0 on success.
+ */
+int
+utf8cursor(
+ struct utf8cursor *u8c,
+ struct tree *tree,
+ const char *s)
+{
+ return utf8ncursor(u8c, tree, s, (unsigned int)-1);
+}
+
+/*
+ * Get one byte from the normalized form of the string described by u8c.
+ *
+ * Returns the byte cast to an unsigned char on succes, and -1 on failure.
+ *
+ * The cursor keeps track of the location in the string in u8c->s.
+ * When a character is decomposed, the current location is stored in
+ * u8c->p, and u8c->s is set to the start of the decomposition. Note
+ * that bytes from a decomposition do not count against u8c->len.
+ *
+ * Characters are emitted if they match the current CCC in u8c->ccc.
+ * Hitting end-of-string while u8c->ccc == STOPPER means we're done,
+ * and the function returns 0 in that case.
+ *
+ * Sorting by CCC is done by repeatedly scanning the string. The
+ * values of u8c->s and u8c->p are stored in u8c->ss and u8c->sp at
+ * the start of the scan. The first pass finds the lowest CCC to be
+ * emitted and stores it in u8c->nccc, the second pass emits the
+ * characters with this CCC and finds the next lowest CCC. This limits
+ * the number of passes to 1 + the number of different CCCs in the
+ * sequence being scanned.
+ *
+ * Therefore:
+ * u8c->p != NULL -> a decomposition is being scanned.
+ * u8c->ss != NULL -> this is a repeating scan.
+ * u8c->ccc == -1 -> this is the first scan of a repeating scan.
+ */
+int
+utf8byte(struct utf8cursor *u8c)
+{
+ utf8leaf_t *leaf;
+ int ccc;
+
+ for (;;) {
+ /* Check for the end of a decomposed character. */
+ if (u8c->p && *u8c->s == '\0') {
+ u8c->s = u8c->p;
+ u8c->p = NULL;
+ }
+
+ /* Check for end-of-string. */
+ if (!u8c->p && (u8c->len == 0 || *u8c->s == '\0')) {
+ /* There is no next byte. */
+ if (u8c->ccc == STOPPER)
+ return 0;
+ /* End-of-string during a scan counts as a stopper. */
+ ccc = STOPPER;
+ goto ccc_mismatch;
+ } else if ((*u8c->s & 0xC0) == 0x80) {
+ /* This is a continuation of the current character. */
+ if (!u8c->p)
+ u8c->len--;
+ return (unsigned char)*u8c->s++;
+ }
+
+ /* Look up the data for the current character. */
+ if (u8c->p)
+ leaf = utf8lookup(u8c->tree, u8c->s);
+ else
+ leaf = utf8nlookup(u8c->tree, u8c->s, u8c->len);
+
+ /* No leaf found implies that the input is a binary blob. */
+ if (!leaf)
+ return -1;
+
+ /* Characters that are too new have CCC 0. */
+ if (ages[LEAF_GEN(leaf)] > u8c->tree->maxage) {
+ ccc = STOPPER;
+ } else if ((ccc = LEAF_CCC(leaf)) == DECOMPOSE) {
+ u8c->len -= utf8clen(u8c->s);
+ u8c->p = u8c->s + utf8clen(u8c->s);
+ u8c->s = LEAF_STR(leaf);
+ /* Empty decomposition implies CCC 0. */
+ if (*u8c->s == '\0') {
+ if (u8c->ccc == STOPPER)
+ continue;
+ ccc = STOPPER;
+ goto ccc_mismatch;
+ }
+ leaf = utf8lookup(u8c->tree, u8c->s);
+ ccc = LEAF_CCC(leaf);
+ }
+ u8c->unichar = utf8decode(u8c->s);
+
+ /*
+ * If this is not a stopper, then see if it updates
+ * the next canonical class to be emitted.
+ */
+ if (ccc != STOPPER && u8c->ccc < ccc && ccc < u8c->nccc)
+ u8c->nccc = ccc;
+
+ /*
+ * Return the current byte if this is the current
+ * combining class.
+ */
+ if (ccc == u8c->ccc) {
+ if (!u8c->p)
+ u8c->len--;
+ return (unsigned char)*u8c->s++;
+ }
+
+ /* Current combining class mismatch. */
+ ccc_mismatch:
+ if (u8c->nccc == STOPPER) {
+ /*
+ * Scan forward for the first canonical class
+ * to be emitted. Save the position from
+ * which to restart.
+ */
+ assert(u8c->ccc == STOPPER);
+ u8c->ccc = MINCCC - 1;
+ u8c->nccc = ccc;
+ u8c->sp = u8c->p;
+ u8c->ss = u8c->s;
+ u8c->slen = u8c->len;
+ if (!u8c->p)
+ u8c->len -= utf8clen(u8c->s);
+ u8c->s += utf8clen(u8c->s);
+ } else if (ccc != STOPPER) {
+ /* Not a stopper, and not the ccc we're emitting. */
+ if (!u8c->p)
+ u8c->len -= utf8clen(u8c->s);
+ u8c->s += utf8clen(u8c->s);
+ } else if (u8c->nccc != MAXCCC + 1) {
+ /* At a stopper, restart for next ccc. */
+ u8c->ccc = u8c->nccc;
+ u8c->nccc = MAXCCC + 1;
+ u8c->s = u8c->ss;
+ u8c->p = u8c->sp;
+ u8c->len = u8c->slen;
+ } else {
+ /* All done, proceed from here. */
+ u8c->ccc = STOPPER;
+ u8c->nccc = STOPPER;
+ u8c->sp = NULL;
+ u8c->ss = NULL;
+ u8c->slen = 0;
+ }
+ }
+}
+
+/* ------------------------------------------------------------------ */
+
+static int
+normalize_line(struct tree *tree)
+{
+ char *s;
+ char *t;
+ int c;
+ struct utf8cursor u8c;
+
+ /* First test: null-terminated string. */
+ s = buf2;
+ t = buf3;
+ if (utf8cursor(&u8c, tree, s))
+ return -1;
+ while ((c = utf8byte(&u8c)) > 0)
+ if (c != (unsigned char)*t++)
+ return -1;
+ if (c < 0)
+ return -1;
+ if (*t != 0)
+ return -1;
+
+ /* Second test: length-limited string. */
+ s = buf2;
+ /* Replace NUL with a value that will cause an error if seen. */
+ s[strlen(s) + 1] = -1;
+ t = buf3;
+ if (utf8cursor(&u8c, tree, s))
+ return -1;
+ while ((c = utf8byte(&u8c)) > 0)
+ if (c != (unsigned char)*t++)
+ return -1;
+ if (c < 0)
+ return -1;
+ if (*t != 0)
+ return -1;
+
+ return 0;
+}
+
+static void
+normalization_test(void)
+{
+ FILE *file;
+ unsigned int unichar;
+ struct unicode_data *data;
+ char *s;
+ char *t;
+ int ret;
+ int ignorables;
+ int tests = 0;
+ int failures = 0;
+
+ if (verbose > 0)
+ printf("Parsing %s\n", test_name);
+ /* Step one, read data from file. */
+ file = fopen(test_name, "r");
+ if (!file)
+ open_fail(test_name, errno);
+
+ while (fgets(line, LINESIZE, file)) {
+ ret = sscanf(line, "%[^;];%*[^;];%*[^;];%*[^;];%[^;];",
+ buf0, buf1);
+ if (ret != 2 || *line == '#')
+ continue;
+ s = buf0;
+ t = buf2;
+ while (*s) {
+ unichar = strtoul(s, &s, 16);
+ t += utf8encode(t, unichar);
+ }
+ *t = '\0';
+
+ ignorables = 0;
+ s = buf1;
+ t = buf3;
+ while (*s) {
+ unichar = strtoul(s, &s, 16);
+ data = &unicode_data[unichar];
+ if (data->utf8nfkdi && !*data->utf8nfkdi)
+ ignorables = 1;
+ else
+ t += utf8encode(t, unichar);
+ }
+ *t = '\0';
+
+ tests++;
+ if (normalize_line(nfkdi_tree) < 0) {
+ printf("Line %s -> %s", buf0, buf1);
+ if (ignorables)
+ printf(" (ignorables removed)");
+ printf(" failure\n");
+ failures++;
+ }
+ }
+ fclose(file);
+ if (verbose > 0)
+ printf("Ran %d tests with %d failures\n", tests, failures);
+ if (failures)
+ file_fail(test_name);
+}
+
+/* ------------------------------------------------------------------ */
+
+static void
+write_file(void)
+{
+ FILE *file;
+ int i;
+ int j;
+ int t;
+ int gen;
+
+ if (verbose > 0)
+ printf("Writing %s\n", utf8_name);
+ file = fopen(utf8_name, "w");
+ if (!file)
+ open_fail(utf8_name, errno);
+
+ fprintf(file, "/* This file is generated code, do not edit. */\n");
+ fprintf(file, "#ifndef __INCLUDED_FROM_UTF8NORM_C__\n");
+ fprintf(file, "#error Only xfs_utf8.c may include this file.\n");
+ fprintf(file, "#endif\n");
+ fprintf(file, "\n");
+ fprintf(file, "static const unsigned int utf8vers = %#x;\n",
+ unicode_maxage);
+ fprintf(file, "\n");
+ fprintf(file, "static const unsigned int utf8agetab[] = {\n");
+ for (i = 0; i != ages_count; i++)
+ fprintf(file, "\t%#x%s\n", ages[i],
+ ages[i] == unicode_maxage ? "" : ",");
+ fprintf(file, "};\n");
+ fprintf(file, "\n");
+ fprintf(file, "static const struct utf8data utf8nfkdicfdata[] = {\n");
+ t = 0;
+ for (gen = 0; gen < ages_count; gen++) {
+ fprintf(file, "\t{ %#x, %d }%s\n",
+ ages[gen], trees[t].index,
+ ages[gen] == unicode_maxage ? "" : ",");
+ if (trees[t].maxage == ages[gen])
+ t += 2;
+ }
+ fprintf(file, "};\n");
+ fprintf(file, "\n");
+ fprintf(file, "static const struct utf8data utf8nfkdidata[] = {\n");
+ t = 1;
+ for (gen = 0; gen < ages_count; gen++) {
+ fprintf(file, "\t{ %#x, %d }%s\n",
+ ages[gen], trees[t].index,
+ ages[gen] == unicode_maxage ? "" : ",");
+ if (trees[t].maxage == ages[gen])
+ t += 2;
+ }
+ fprintf(file, "};\n");
+ fprintf(file, "\n");
+ fprintf(file, "static const unsigned char utf8data[%zd] = {\n",
+ utf8data_size);
+ t = 0;
+ for (i = 0; i != utf8data_size; i += 16) {
+ if (i == trees[t].index) {
+ fprintf(file, "\t/* %s_%x */\n",
+ trees[t].type, trees[t].maxage);
+ if (t < trees_count-1)
+ t++;
+ }
+ fprintf(file, "\t");
+ for (j = i; j != i + 16; j++)
+ fprintf(file, "0x%.2x%s", utf8data[j],
+ (j < utf8data_size -1 ? "," : ""));
+ fprintf(file, "\n");
+ }
+ fprintf(file, "};\n");
+ fclose(file);
+}
+
+/* ------------------------------------------------------------------ */
+
+int
+main(int argc, char *argv[])
+{
+ unsigned int unichar;
+ int opt;
+
+ argv0 = argv[0];
+
+ while ((opt = getopt(argc, argv, "a:c:d:f:hn:o:p:t:v")) != -1) {
+ switch (opt) {
+ case 'a':
+ age_name = optarg;
+ break;
+ case 'c':
+ ccc_name = optarg;
+ break;
+ case 'd':
+ data_name = optarg;
+ break;
+ case 'f':
+ fold_name = optarg;
+ break;
+ case 'n':
+ norm_name = optarg;
+ break;
+ case 'o':
+ utf8_name = optarg;
+ break;
+ case 'p':
+ prop_name = optarg;
+ break;
+ case 't':
+ test_name = optarg;
+ break;
+ case 'v':
+ verbose++;
+ break;
+ case 'h':
+ help();
+ exit(0);
+ default:
+ usage();
+ }
+ }
+
+ if (verbose > 1)
+ help();
+ for (unichar = 0; unichar != 0x110000; unichar++)
+ unicode_data[unichar].code = unichar;
+ age_init();
+ ccc_init();
+ nfkdi_init();
+ nfkdicf_init();
+ ignore_init();
+ corrections_init();
+ hangul_decompose();
+ nfkdi_decompose();
+ nfkdicf_decompose();
+ utf8_init();
+ trees_init();
+ trees_populate();
+ trees_reduce();
+ trees_verify();
+ /* Prevent "unused function" warning. */
+ (void)lookup(nfkdi_tree, " ");
+ if (verbose > 2)
+ tree_walk(nfkdi_tree);
+ if (verbose > 2)
+ tree_walk(nfkdicf_tree);
+ normalization_test();
+ write_file();
+
+ return 0;
+}
--
2.15.1

2018-01-12 07:13:39

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH RFC 05/13] charsets: utf8: Introduce code for UTF-8 normalization

From: Olaf Weber <[email protected]>

Supporting functions for UTF-8 normalization are in utf8norm.c with the
header utf8norm.h. Two normalization forms are supported: nfkdi and nfkdicf.

nfkdi:
- Apply unicode normalization form NFKD.
- Remove any Default_Ignorable_Code_Point.

nfkdicf:
- Apply unicode normalization form NFKD.
- Remove any Default_Ignorable_Code_Point.
- Apply a full casefold (C + F).

For the purposes of the code, a string is valid UTF-8 if:

- The values encoded are 0x1..0x10FFFF.
- The surrogate codepoints 0xD800..0xDFFFF are not encoded.
- The shortest possible encoding is used for all values.

The supporting functions work on null-terminated strings (utf8 prefix) and
on length-limited strings (utf8n prefix).

>From the original SGI patch and for conformity with coding standards,
the utf8data_t typedef was dropped, since it was just masking the struct
keyword. On other occasions, namely utf8leaf_t and utf8trie_t, I
decided to keep it, since they are simple pointers to memory buffers,
and using uchars here wouldn't provide any more meaningful information.

Signed-off-by: Olaf Weber <[email protected]>
Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
[Rebase to Mainline]
[Fix up checkpatch.pl warnings]
[Drop typedefs]
[Merge with charsets subsystem]
---
include/linux/utf8norm.h | 112 +++++++++
lib/charsets/Makefile | 4 +
lib/charsets/utf8norm.c | 643 +++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 759 insertions(+)
create mode 100644 include/linux/utf8norm.h
create mode 100644 lib/charsets/utf8norm.c

diff --git a/include/linux/utf8norm.h b/include/linux/utf8norm.h
new file mode 100644
index 000000000000..9c0c4898d169
--- /dev/null
+++ b/include/linux/utf8norm.h
@@ -0,0 +1,112 @@
+/*
+ * Copyright (c) 2014 SGI.
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#ifndef UTF8NORM_H
+#define UTF8NORM_H
+
+#include <linux/types.h>
+#include <linux/export.h>
+#include <linux/string.h>
+#include <linux/module.h>
+
+/* Encoding a unicode version number as a single unsigned int. */
+#define UNICODE_MAJ_SHIFT (16)
+#define UNICODE_MIN_SHIFT (8)
+
+#define UNICODE_AGE(MAJ, MIN, REV) \
+ (((unsigned int)(MAJ) << UNICODE_MAJ_SHIFT) | \
+ ((unsigned int)(MIN) << UNICODE_MIN_SHIFT) | \
+ ((unsigned int)(REV)))
+
+/* Highest unicode version supported by the data tables. */
+extern int utf8version_is_supported(unsigned int sb_utf8version);
+
+/*
+ * Look for the correct const struct utf8data for a unicode version.
+ * Returns NULL if the version requested is too new.
+ *
+ * Two normalization forms are supported: nfkdi and nfkdicf.
+ *
+ * nfkdi:
+ * - Apply unicode normalization form NFKD.
+ * - Remove any Default_Ignorable_Code_Point.
+ *
+ * nfkdicf:
+ * - Apply unicode normalization form NFKD.
+ * - Remove any Default_Ignorable_Code_Point.
+ * - Apply a full casefold (C + F).
+ */
+extern const struct utf8data *utf8nfkdi(unsigned int maxage);
+extern const struct utf8data *utf8nfkdicf(unsigned int maxage);
+
+/*
+ * Determine the maximum age of any unicode character in the string.
+ * Returns 0 if only unassigned code points are present.
+ * Returns -1 if the input is not valid UTF-8.
+ */
+extern int utf8agemax(const struct utf8data *data, const char *s);
+extern int utf8nagemax(const struct utf8data *data, const char *s, size_t len);
+
+/*
+ * Determine the minimum age of any unicode character in the string.
+ * Returns 0 if any unassigned code points are present.
+ * Returns -1 if the input is not valid UTF-8.
+ */
+extern int utf8agemin(const struct utf8data *data, const char *s);
+extern int utf8nagemin(const struct utf8data *data, const char *s, size_t len);
+
+/*
+ * Determine the length of the normalized from of the string,
+ * excluding any terminating NULL byte.
+ * Returns 0 if only ignorable code points are present.
+ * Returns -1 if the input is not valid UTF-8.
+ */
+extern ssize_t utf8len(const struct utf8data *data, const char *s);
+extern ssize_t utf8nlen(const struct utf8data *data, const char *s, size_t len);
+
+/*
+ * Cursor structure used by the normalizer.
+ */
+struct utf8cursor {
+ const struct utf8data *data;
+ const char *s;
+ const char *p;
+ const char *ss;
+ const char *sp;
+ unsigned int len;
+ unsigned int slen;
+ short int ccc;
+ short int nccc;
+};
+
+/*
+ * Initialize a utf8cursor to normalize a string.
+ * Returns 0 on success.
+ * Returns -1 on failure.
+ */
+extern int utf8cursor(struct utf8cursor *u8c, const struct utf8data *data,
+ const char *s);
+extern int utf8ncursor(struct utf8cursor *u8c, const struct utf8data *data,
+ const char *s, size_t len);
+
+/*
+ * Get the next byte in the normalization.
+ * Returns a value > 0 && < 256 on success.
+ * Returns 0 when the end of the normalization is reached.
+ * Returns -1 if the string being normalized is not valid UTF-8.
+ */
+extern int utf8byte(struct utf8cursor *u8c);
+
+#endif /* UTF8NORM_H */
diff --git a/lib/charsets/Makefile b/lib/charsets/Makefile
index d486d9437f5c..95389c4193b0 100644
--- a/lib/charsets/Makefile
+++ b/lib/charsets/Makefile
@@ -4,6 +4,10 @@ obj-$(CONFIG_CHARSETS) += charsets.o

obj-$(CONFIG_CHARSETS) += ascii.o

+utf8-y += utf8norm.o
+obj-$(CONFIG_UTF8_NORMALIZATION) += utf8.o
+
+$(obj)/utf8norm.o: $(obj)/utf8data.h
$(obj)/utf8data.h: $(srctree)/$(src)/ucd/*.txt $(objtree)/scripts/mkutf8data FORCE
$(call cmd,mkutf8data)
quiet_cmd_mkutf8data = MKUTF8DATA $@
diff --git a/lib/charsets/utf8norm.c b/lib/charsets/utf8norm.c
new file mode 100644
index 000000000000..6ca150ee268d
--- /dev/null
+++ b/lib/charsets/utf8norm.c
@@ -0,0 +1,643 @@
+/*
+ * Copyright (c) 2014 SGI.
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#include <linux/utf8norm.h>
+
+struct utf8data {
+ unsigned int maxage;
+ unsigned int offset;
+};
+
+#define __INCLUDED_FROM_UTF8NORM_C__
+#include "utf8data.h"
+#undef __INCLUDED_FROM_UTF8NORM_C__
+
+int utf8version_is_supported(unsigned int sb_utf8version)
+{
+ int i = ARRAY_SIZE(utf8agetab) - 1;
+
+ while (i >= 0 && utf8agetab[i] != 0) {
+ if (sb_utf8version == utf8agetab[i])
+ return 1;
+ i--;
+ }
+ return 0;
+}
+EXPORT_SYMBOL(utf8version_is_supported);
+
+/*
+ * UTF-8 valid ranges.
+ *
+ * The UTF-8 encoding spreads the bits of a 32bit word over several
+ * bytes. This table gives the ranges that can be held and how they'd
+ * be represented.
+ *
+ * 0x00000000 0x0000007F: 0xxxxxxx
+ * 0x00000000 0x000007FF: 110xxxxx 10xxxxxx
+ * 0x00000000 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ *
+ * There is an additional requirement on UTF-8, in that only the
+ * shortest representation of a 32bit value is to be used. A decoder
+ * must not decode sequences that do not satisfy this requirement.
+ * Thus the allowed ranges have a lower bound.
+ *
+ * 0x00000000 0x0000007F: 0xxxxxxx
+ * 0x00000080 0x000007FF: 110xxxxx 10xxxxxx
+ * 0x00000800 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
+ * 0x00010000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00200000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x04000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ *
+ * Actual unicode characters are limited to the range 0x0 - 0x10FFFF,
+ * 17 planes of 65536 values. This limits the sequences actually seen
+ * even more, to just the following.
+ *
+ * 0 - 0x7F: 0 - 0x7F
+ * 0x80 - 0x7FF: 0xC2 0x80 - 0xDF 0xBF
+ * 0x800 - 0xFFFF: 0xE0 0xA0 0x80 - 0xEF 0xBF 0xBF
+ * 0x10000 - 0x10FFFF: 0xF0 0x90 0x80 0x80 - 0xF4 0x8F 0xBF 0xBF
+ *
+ * Within those ranges the surrogates 0xD800 - 0xDFFF are not allowed.
+ *
+ * Note that the longest sequence seen with valid usage is 4 bytes,
+ * the same a single UTF-32 character. This makes the UTF-8
+ * representation of Unicode strictly smaller than UTF-32.
+ *
+ * The shortest sequence requirement was introduced by:
+ * Corrigendum #1: UTF-8 Shortest Form
+ * It can be found here:
+ * http://www.unicode.org/versions/corrigendum1.html
+ *
+ */
+
+/*
+ * Return the number of bytes used by the current UTF-8 sequence.
+ * Assumes the input points to the first byte of a valid UTF-8
+ * sequence.
+ */
+static inline int utf8clen(const char *s)
+{
+ unsigned char c = *s;
+
+ return 1 + (c >= 0xC0) + (c >= 0xE0) + (c >= 0xF0);
+}
+
+/*
+ * utf8trie_t
+ *
+ * A compact binary tree, used to decode UTF-8 characters.
+ *
+ * Internal nodes are one byte for the node itself, and up to three
+ * bytes for an offset into the tree. The first byte contains the
+ * following information:
+ * NEXTBYTE - flag - advance to next byte if set
+ * BITNUM - 3 bit field - the bit number to tested
+ * OFFLEN - 2 bit field - number of bytes in the offset
+ * if offlen == 0 (non-branching node)
+ * RIGHTPATH - 1 bit field - set if the following node is for the
+ * right-hand path (tested bit is set)
+ * TRIENODE - 1 bit field - set if the following node is an internal
+ * node, otherwise it is a leaf node
+ * if offlen != 0 (branching node)
+ * LEFTNODE - 1 bit field - set if the left-hand node is internal
+ * RIGHTNODE - 1 bit field - set if the right-hand node is internal
+ *
+ * Due to the way utf8 works, there cannot be branching nodes with
+ * NEXTBYTE set, and moreover those nodes always have a righthand
+ * descendant.
+ */
+typedef const unsigned char utf8trie_t;
+#define BITNUM 0x07
+#define NEXTBYTE 0x08
+#define OFFLEN 0x30
+#define OFFLEN_SHIFT 4
+#define RIGHTPATH 0x40
+#define TRIENODE 0x80
+#define RIGHTNODE 0x40
+#define LEFTNODE 0x80
+
+/*
+ * utf8leaf_t
+ *
+ * The leaves of the trie are embedded in the trie, and so the same
+ * underlying datatype: unsigned char.
+ *
+ * leaf[0]: The unicode version, stored as a generation number that is
+ * an index into utf8agetab[]. With this we can filter code
+ * points based on the unicode version in which they were
+ * defined. The CCC of a non-defined code point is 0.
+ * leaf[1]: Canonical Combining Class. During normalization, we need
+ * to do a stable sort into ascending order of all characters
+ * with a non-zero CCC that occur between two characters with
+ * a CCC of 0, or at the begin or end of a string.
+ * The unicode standard guarantees that all CCC values are
+ * between 0 and 254 inclusive, which leaves 255 available as
+ * a special value.
+ * Code points with CCC 0 are known as stoppers.
+ * leaf[2]: Decomposition. If leaf[1] == 255, then leaf[2] is the
+ * start of a NUL-terminated string that is the decomposition
+ * of the character.
+ * The CCC of a decomposable character is the same as the CCC
+ * of the first character of its decomposition.
+ * Some characters decompose as the empty string: these are
+ * characters with the Default_Ignorable_Code_Point property.
+ * These do affect normalization, as they all have CCC 0.
+ *
+ * The decompositions in the trie have been fully expanded.
+ *
+ * Casefolding, if applicable, is also done using decompositions.
+ *
+ * The trie is constructed in such a way that leaves exist for all
+ * UTF-8 sequences that match the criteria from the "UTF-8 valid
+ * ranges" comment above, and only for those sequences. Therefore a
+ * lookup in the trie can be used to validate the UTF-8 input.
+ */
+typedef const unsigned char utf8leaf_t;
+
+#define LEAF_GEN(LEAF) ((LEAF)[0])
+#define LEAF_CCC(LEAF) ((LEAF)[1])
+#define LEAF_STR(LEAF) ((const char *)((LEAF) + 2))
+
+#define MINCCC (0)
+#define MAXCCC (254)
+#define STOPPER (0)
+#define DECOMPOSE (255)
+
+/*
+ * Use trie to scan s, touching at most len bytes.
+ * Returns the leaf if one exists, NULL otherwise.
+ *
+ * A non-NULL return guarantees that the UTF-8 sequence starting at s
+ * is well-formed and corresponds to a known unicode code point. The
+ * shorthand for this will be "is valid UTF-8 unicode".
+ */
+static utf8leaf_t *utf8nlookup(const struct utf8data *data, const char *s,
+ size_t len)
+{
+ utf8trie_t *trie = utf8data + data->offset;
+ int offlen;
+ int offset;
+ int mask;
+ int node;
+
+ if (!data)
+ return NULL;
+ if (len == 0)
+ return NULL;
+ node = 1;
+ while (node) {
+ offlen = (*trie & OFFLEN) >> OFFLEN_SHIFT;
+ if (*trie & NEXTBYTE) {
+ if (--len == 0)
+ return NULL;
+ s++;
+ }
+ mask = 1 << (*trie & BITNUM);
+ if (*s & mask) {
+ /* Right leg */
+ if (offlen) {
+ /* Right node at offset of trie */
+ node = (*trie & RIGHTNODE);
+ offset = trie[offlen];
+ while (--offlen) {
+ offset <<= 8;
+ offset |= trie[offlen];
+ }
+ trie += offset;
+ } else if (*trie & RIGHTPATH) {
+ /* Right node after this node */
+ node = (*trie & TRIENODE);
+ trie++;
+ } else {
+ /* No right node. */
+ node = 0;
+ trie = NULL;
+ }
+ } else {
+ /* Left leg */
+ if (offlen) {
+ /* Left node after this node. */
+ node = (*trie & LEFTNODE);
+ trie += offlen + 1;
+ } else if (*trie & RIGHTPATH) {
+ /* No left node. */
+ node = 0;
+ trie = NULL;
+ } else {
+ /* Left node after this node */
+ node = (*trie & TRIENODE);
+ trie++;
+ }
+ }
+ }
+ return trie;
+}
+
+/*
+ * Use trie to scan s.
+ * Returns the leaf if one exists, NULL otherwise.
+ *
+ * Forwards to utf8nlookup().
+ */
+static utf8leaf_t *utf8lookup(const struct utf8data *data, const char *s)
+{
+ return utf8nlookup(data, s, (size_t)-1);
+}
+
+/*
+ * Maximum age of any character in s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ * Return 0 if only non-assigned code points are used.
+ */
+int utf8agemax(const struct utf8data *data, const char *s)
+{
+ utf8leaf_t *leaf;
+ int age = 0;
+ int leaf_age;
+
+ if (!data)
+ return -1;
+ while (*s) {
+ leaf = utf8lookup(data, s);
+ if (!leaf)
+ return -1;
+
+ leaf_age = utf8agetab[LEAF_GEN(leaf)];
+ if (leaf_age <= data->maxage && leaf_age > age)
+ age = leaf_age;
+ s += utf8clen(s);
+ }
+ return age;
+}
+EXPORT_SYMBOL(utf8agemax);
+
+/*
+ * Minimum age of any character in s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ * Return 0 if non-assigned code points are used.
+ */
+int utf8agemin(const struct utf8data *data, const char *s)
+{
+ utf8leaf_t *leaf;
+ int age;
+ int leaf_age;
+
+ if (!data)
+ return -1;
+ age = data->maxage;
+ while (*s) {
+ leaf = utf8lookup(data, s);
+ if (!leaf)
+ return -1;
+ leaf_age = utf8agetab[LEAF_GEN(leaf)];
+ if (leaf_age <= data->maxage && leaf_age < age)
+ age = leaf_age;
+ s += utf8clen(s);
+ }
+ return age;
+}
+EXPORT_SYMBOL(utf8agemin);
+
+/*
+ * Maximum age of any character in s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+int utf8nagemax(const struct utf8data *data, const char *s, size_t len)
+{
+ utf8leaf_t *leaf;
+ int age = 0;
+ int leaf_age;
+
+ if (!data)
+ return -1;
+ while (len && *s) {
+ leaf = utf8nlookup(data, s, len);
+ if (!leaf)
+ return -1;
+ leaf_age = utf8agetab[LEAF_GEN(leaf)];
+ if (leaf_age <= data->maxage && leaf_age > age)
+ age = leaf_age;
+ len -= utf8clen(s);
+ s += utf8clen(s);
+ }
+ return age;
+}
+EXPORT_SYMBOL(utf8nagemax);
+
+/*
+ * Maximum age of any character in s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+int utf8nagemin(const struct utf8data *data, const char *s, size_t len)
+{
+ utf8leaf_t *leaf;
+ int leaf_age;
+ int age;
+
+ if (!data)
+ return -1;
+ age = data->maxage;
+ while (len && *s) {
+ leaf = utf8nlookup(data, s, len);
+ if (!leaf)
+ return -1;
+ leaf_age = utf8agetab[LEAF_GEN(leaf)];
+ if (leaf_age <= data->maxage && leaf_age < age)
+ age = leaf_age;
+ len -= utf8clen(s);
+ s += utf8clen(s);
+ }
+ return age;
+}
+EXPORT_SYMBOL(utf8nagemin);
+
+/*
+ * Length of the normalization of s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ *
+ * A string of Default_Ignorable_Code_Point has length 0.
+ */
+ssize_t utf8len(const struct utf8data *data, const char *s)
+{
+ utf8leaf_t *leaf;
+ size_t ret = 0;
+
+ if (!data)
+ return -1;
+ while (*s) {
+ leaf = utf8lookup(data, s);
+ if (!leaf)
+ return -1;
+ if (utf8agetab[LEAF_GEN(leaf)] > data->maxage)
+ ret += utf8clen(s);
+ else if (LEAF_CCC(leaf) == DECOMPOSE)
+ ret += strlen(LEAF_STR(leaf));
+ else
+ ret += utf8clen(s);
+ s += utf8clen(s);
+ }
+ return ret;
+}
+EXPORT_SYMBOL(utf8len);
+
+/*
+ * Length of the normalization of s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+ssize_t utf8nlen(const struct utf8data *data, const char *s, size_t len)
+{
+ utf8leaf_t *leaf;
+ size_t ret = 0;
+
+ if (!data)
+ return -1;
+ while (len && *s) {
+ leaf = utf8nlookup(data, s, len);
+ if (!leaf)
+ return -1;
+ if (utf8agetab[LEAF_GEN(leaf)] > data->maxage)
+ ret += utf8clen(s);
+ else if (LEAF_CCC(leaf) == DECOMPOSE)
+ ret += strlen(LEAF_STR(leaf));
+ else
+ ret += utf8clen(s);
+ len -= utf8clen(s);
+ s += utf8clen(s);
+ }
+ return ret;
+}
+EXPORT_SYMBOL(utf8nlen);
+
+/*
+ * Set up an utf8cursor for use by utf8byte().
+ *
+ * u8c : pointer to cursor.
+ * data : const struct utf8data to use for normalization.
+ * s : string.
+ * len : length of s.
+ *
+ * Returns -1 on error, 0 on success.
+ */
+int utf8ncursor(struct utf8cursor *u8c, const struct utf8data *data,
+ const char *s, size_t len)
+{
+ if (!data)
+ return -1;
+ if (!s)
+ return -1;
+ u8c->data = data;
+ u8c->s = s;
+ u8c->p = NULL;
+ u8c->ss = NULL;
+ u8c->sp = NULL;
+ u8c->len = len;
+ u8c->slen = 0;
+ u8c->ccc = STOPPER;
+ u8c->nccc = STOPPER;
+ /* Check we didn't clobber the maximum length. */
+ if (u8c->len != len)
+ return -1;
+ /* The first byte of s may not be an utf8 continuation. */
+ if (len > 0 && (*s & 0xC0) == 0x80)
+ return -1;
+ return 0;
+}
+EXPORT_SYMBOL(utf8ncursor);
+
+/*
+ * Set up an utf8cursor for use by utf8byte().
+ *
+ * u8c : pointer to cursor.
+ * data : const struct utf8data to use for normalization.
+ * s : NUL-terminated string.
+ *
+ * Returns -1 on error, 0 on success.
+ */
+int utf8cursor(struct utf8cursor *u8c, const struct utf8data *data,
+ const char *s)
+{
+ return utf8ncursor(u8c, data, s, (unsigned int)-1);
+}
+EXPORT_SYMBOL(utf8cursor);
+
+/*
+ * Get one byte from the normalized form of the string described by u8c.
+ *
+ * Returns the byte cast to an unsigned char on succes, and -1 on failure.
+ *
+ * The cursor keeps track of the location in the string in u8c->s.
+ * When a character is decomposed, the current location is stored in
+ * u8c->p, and u8c->s is set to the start of the decomposition. Note
+ * that bytes from a decomposition do not count against u8c->len.
+ *
+ * Characters are emitted if they match the current CCC in u8c->ccc.
+ * Hitting end-of-string while u8c->ccc == STOPPER means we're done,
+ * and the function returns 0 in that case.
+ *
+ * Sorting by CCC is done by repeatedly scanning the string. The
+ * values of u8c->s and u8c->p are stored in u8c->ss and u8c->sp at
+ * the start of the scan. The first pass finds the lowest CCC to be
+ * emitted and stores it in u8c->nccc, the second pass emits the
+ * characters with this CCC and finds the next lowest CCC. This limits
+ * the number of passes to 1 + the number of different CCCs in the
+ * sequence being scanned.
+ *
+ * Therefore:
+ * u8c->p != NULL -> a decomposition is being scanned.
+ * u8c->ss != NULL -> this is a repeating scan.
+ * u8c->ccc == -1 -> this is the first scan of a repeating scan.
+ */
+int utf8byte(struct utf8cursor *u8c)
+{
+ utf8leaf_t *leaf;
+ int ccc;
+
+ for (;;) {
+ /* Check for the end of a decomposed character. */
+ if (u8c->p && *u8c->s == '\0') {
+ u8c->s = u8c->p;
+ u8c->p = NULL;
+ }
+
+ /* Check for end-of-string. */
+ if (!u8c->p && (u8c->len == 0 || *u8c->s == '\0')) {
+ /* There is no next byte. */
+ if (u8c->ccc == STOPPER)
+ return 0;
+ /* End-of-string during a scan counts as a stopper. */
+ ccc = STOPPER;
+ goto ccc_mismatch;
+ } else if ((*u8c->s & 0xC0) == 0x80) {
+ /* This is a continuation of the current character. */
+ if (!u8c->p)
+ u8c->len--;
+ return (unsigned char)*u8c->s++;
+ }
+
+ /* Look up the data for the current character. */
+ if (u8c->p)
+ leaf = utf8lookup(u8c->data, u8c->s);
+ else
+ leaf = utf8nlookup(u8c->data, u8c->s, u8c->len);
+
+ /* No leaf found implies that the input is a binary blob. */
+ if (!leaf)
+ return -1;
+
+ ccc = LEAF_CCC(leaf);
+ /* Characters that are too new have CCC 0. */
+ if (utf8agetab[LEAF_GEN(leaf)] > u8c->data->maxage) {
+ ccc = STOPPER;
+ } else if (ccc == DECOMPOSE) {
+ u8c->len -= utf8clen(u8c->s);
+ u8c->p = u8c->s + utf8clen(u8c->s);
+ u8c->s = LEAF_STR(leaf);
+ /* Empty decomposition implies CCC 0. */
+ if (*u8c->s == '\0') {
+ if (u8c->ccc == STOPPER)
+ continue;
+ ccc = STOPPER;
+ goto ccc_mismatch;
+ }
+ leaf = utf8lookup(u8c->data, u8c->s);
+ }
+
+ /*
+ * If this is not a stopper, then see if it updates
+ * the next canonical class to be emitted.
+ */
+ if (ccc != STOPPER && u8c->ccc < ccc && ccc < u8c->nccc)
+ u8c->nccc = ccc;
+
+ /*
+ * Return the current byte if this is the current
+ * combining class.
+ */
+ if (ccc == u8c->ccc) {
+ if (!u8c->p)
+ u8c->len--;
+ return (unsigned char)*u8c->s++;
+ }
+
+ /* Current combining class mismatch. */
+ccc_mismatch:
+ if (u8c->nccc == STOPPER) {
+ /*
+ * Scan forward for the first canonical class
+ * to be emitted. Save the position from
+ * which to restart.
+ */
+ u8c->ccc = MINCCC - 1;
+ u8c->nccc = ccc;
+ u8c->sp = u8c->p;
+ u8c->ss = u8c->s;
+ u8c->slen = u8c->len;
+ if (!u8c->p)
+ u8c->len -= utf8clen(u8c->s);
+ u8c->s += utf8clen(u8c->s);
+ } else if (ccc != STOPPER) {
+ /* Not a stopper, and not the ccc we're emitting. */
+ if (!u8c->p)
+ u8c->len -= utf8clen(u8c->s);
+ u8c->s += utf8clen(u8c->s);
+ } else if (u8c->nccc != MAXCCC + 1) {
+ /* At a stopper, restart for next ccc. */
+ u8c->ccc = u8c->nccc;
+ u8c->nccc = MAXCCC + 1;
+ u8c->s = u8c->ss;
+ u8c->p = u8c->sp;
+ u8c->len = u8c->slen;
+ } else {
+ /* All done, proceed from here. */
+ u8c->ccc = STOPPER;
+ u8c->nccc = STOPPER;
+ u8c->sp = NULL;
+ u8c->ss = NULL;
+ u8c->slen = 0;
+ }
+ }
+}
+EXPORT_SYMBOL(utf8byte);
+
+const struct utf8data *utf8nfkdi(unsigned int maxage)
+{
+ int i = ARRAY_SIZE(utf8nfkdidata) - 1;
+
+ while (maxage < utf8nfkdidata[i].maxage)
+ i--;
+ if (maxage > utf8nfkdidata[i].maxage)
+ return NULL;
+ return &utf8nfkdidata[i];
+}
+EXPORT_SYMBOL(utf8nfkdi);
+
+const struct utf8data *utf8nfkdicf(unsigned int maxage)
+{
+ int i = ARRAY_SIZE(utf8nfkdicfdata) - 1;
+
+ while (maxage < utf8nfkdicfdata[i].maxage)
+ i--;
+ if (maxage > utf8nfkdicfdata[i].maxage)
+ return NULL;
+ return &utf8nfkdicfdata[i];
+}
+EXPORT_SYMBOL(utf8nfkdicf);
+
+MODULE_AUTHOR("SGI");
+MODULE_DESCRIPTION("utf8 normalization");
+MODULE_LICENSE("GPL");
--
2.15.1

2018-01-12 07:12:28

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH RFC 07/13] charsets: utf8: Hook-up utf-8 code to charsets library

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
lib/charsets/Makefile | 2 +-
lib/charsets/utf8_core.c | 178 +++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 179 insertions(+), 1 deletion(-)
create mode 100644 lib/charsets/utf8_core.c

diff --git a/lib/charsets/Makefile b/lib/charsets/Makefile
index 95389c4193b0..5e2fa7c20a47 100644
--- a/lib/charsets/Makefile
+++ b/lib/charsets/Makefile
@@ -4,7 +4,7 @@ obj-$(CONFIG_CHARSETS) += charsets.o

obj-$(CONFIG_CHARSETS) += ascii.o

-utf8-y += utf8norm.o
+utf8-y += utf8_core.o utf8norm.o
obj-$(CONFIG_UTF8_NORMALIZATION) += utf8.o

$(obj)/utf8norm.o: $(obj)/utf8data.h
diff --git a/lib/charsets/utf8_core.c b/lib/charsets/utf8_core.c
new file mode 100644
index 000000000000..94427670e96e
--- /dev/null
+++ b/lib/charsets/utf8_core.c
@@ -0,0 +1,178 @@
+/*
+ * Copyright (c) 2017 Collabora Ltd.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#include <linux/charsets.h>
+#include <linux/utf8norm.h>
+#include <linux/slab.h>
+#include <linux/parser.h>
+#include <linux/string.h>
+
+static int utf8_strncmp(const struct charset *charset, const char *str1,
+ const char *str2, int len)
+{
+ const struct utf8data *data = utf8nfkdi(charset->version);
+ struct utf8cursor cur1, cur2;
+ unsigned char c1, c2;
+ int r, i;
+
+ r = utf8cursor(&cur1, data, str1);
+ if (r < 0)
+ return -EIO;
+ r = utf8cursor(&cur2, data, str2);
+ if (r < 0)
+ return -EIO;
+
+ for (i = 0 ; i < len ; i++) {
+ c1 = utf8byte(&cur1);
+ c2 = utf8byte(&cur2);
+
+ if (!c1 || !c2 || c1 != c2)
+ return 1;
+
+ }
+
+ return 0;
+}
+
+static int utf8_strncasecmp(const struct charset *charset, const char *str1,
+ const char *str2, int len)
+{
+ const struct utf8data *data = utf8nfkdicf(charset->version);
+ struct utf8cursor cur1, cur2;
+ unsigned char c1, c2;
+ int r, i;
+
+ r = utf8cursor(&cur1, data, str1);
+ if (r < 0)
+ return -EIO;
+
+ r = utf8cursor(&cur2, data, str2);
+ if (r < 0)
+ return -EIO;
+
+ for (i = 0 ; i < len ; i++) {
+ c1 = utf8byte(&cur1);
+ c2 = utf8byte(&cur2);
+
+ if (!c1 || !c2 || c1 != c2)
+ return 1;
+ }
+
+ return 0;
+}
+
+int utf8_casefold(const struct charset *charset, const char *str, int len,
+ char **folded_str)
+{
+ const struct utf8data *data = utf8nfkdicf(charset->version);
+ struct utf8cursor cur;
+ int i;
+ char buffer[1024];
+
+ if (utf8cursor(&cur, data, str))
+ return -EIO;
+
+ for (i = 0; i < (1024-1); i++) {
+ buffer[i] = utf8byte(&cur);
+ if (!buffer[i])
+ break;
+ }
+ buffer[i] = '\0';
+ *folded_str = kstrdup(buffer, GFP_NOFS);
+ if (!*folded_str)
+ return -ENOMEM;
+
+ return i;
+}
+
+int utf8_normalize(const struct charset *charset, const char *str, int len,
+ char **normalization)
+{
+ const struct utf8data *data = utf8nfkdi(charset->version);
+ struct utf8cursor cur;
+ int i;
+ char buffer[1024];
+
+ if (utf8cursor(&cur, data, str))
+ return -EIO;
+
+ for (i = 0; i < (1024-1); i++) {
+ buffer[i] = utf8byte(&cur);
+ if (!buffer[i])
+ break;
+ }
+ buffer[i] = '\0';
+ *normalization = kstrdup(buffer, GFP_NOFS);
+ if (!*normalization)
+ return -ENOMEM;
+
+ return i;
+}
+
+static const struct charset_ops utf8_ops = {
+ .strncmp = utf8_strncmp,
+ .strncasecmp = utf8_strncasecmp,
+ .casefold = utf8_casefold,
+ .normalize = utf8_normalize,
+};
+
+static struct charset *utf8_load_charset(void *pargs)
+{
+ int maj, min, rev;
+ unsigned int age;
+ struct charset *charset;
+ substring_t *args = pargs;
+
+ if (match_int(&args[0], &maj) || match_int(&args[1], &min) ||
+ match_int(&args[2], &rev))
+ return NULL;
+
+ age = UNICODE_AGE(maj, min, rev);
+
+ if (!utf8version_is_supported(age))
+ return NULL;
+
+ charset = kmalloc(sizeof(struct charset), GFP_KERNEL);
+ if (!charset)
+ return NULL;
+
+ charset->info = NULL;
+ charset->version = age;
+ charset->ops = &utf8_ops;
+
+ return charset;
+}
+
+static struct charset_info utf8_info = {
+ .name = "utf8",
+ .match_token = "utf8-%d.%d.%d",
+ .load_charset = utf8_load_charset,
+};
+
+static int __init init_utf8(void)
+{
+ charset_register(&utf8_info);
+ return 0;
+}
+
+static void __exit exit_utf8(void)
+{
+}
+
+module_init(init_utf8);
+module_exit(exit_utf8);
+MODULE_AUTHOR("Gabriel Krisman Bertazi");
+MODULE_DESCRIPTION("UTF-8 charset operations for filesystems");
+MODULE_LICENSE("GPL");
+
--
2.15.1

2018-01-12 07:13:44

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH RFC 06/13] charsets: utf8: reduce the size of utf8data[]

From: Olaf Weber <[email protected]>

Remove the Hangul decompositions from the utf8data trie, and do
algorithmic decomposition to calculate them on the fly. To store
the decomposition the caller of utf8lookup()/utf8nlookup() must
provide a 12-byte buffer, which is used to synthesize a leaf with
the decomposition. Trie size is reduced from 245kB to 90kB.

Signed-off-by: Olaf Weber <[email protected]>
Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
[Rebase to mainline]
[Fix checkpatch errors]
[Extract robustness fixes and merge back to original mkutf8data.c
patch]
---
include/linux/utf8norm.h | 4 +
lib/charsets/utf8norm.c | 191 ++++++++++++++++++++++++++----
scripts/mkutf8data.c | 295 +++++++++++++++++++++++++++++++++++++++++------
3 files changed, 435 insertions(+), 55 deletions(-)

diff --git a/include/linux/utf8norm.h b/include/linux/utf8norm.h
index 9c0c4898d169..2413b060f18d 100644
--- a/include/linux/utf8norm.h
+++ b/include/linux/utf8norm.h
@@ -76,6 +76,9 @@ extern int utf8nagemin(const struct utf8data *data, const char *s, size_t len);
extern ssize_t utf8len(const struct utf8data *data, const char *s);
extern ssize_t utf8nlen(const struct utf8data *data, const char *s, size_t len);

+/* Needed in struct utf8cursor below. */
+#define UTF8HANGULLEAF (12)
+
/*
* Cursor structure used by the normalizer.
*/
@@ -89,6 +92,7 @@ struct utf8cursor {
unsigned int slen;
short int ccc;
short int nccc;
+ unsigned char hangul[UTF8HANGULLEAF];
};

/*
diff --git a/lib/charsets/utf8norm.c b/lib/charsets/utf8norm.c
index 6ca150ee268d..7af2c9861cb9 100644
--- a/lib/charsets/utf8norm.c
+++ b/lib/charsets/utf8norm.c
@@ -97,6 +97,38 @@ static inline int utf8clen(const char *s)
return 1 + (c >= 0xC0) + (c >= 0xE0) + (c >= 0xF0);
}

+/*
+ * Decode a 3-byte UTF-8 sequence.
+ */
+static unsigned int
+utf8decode3(const char *str)
+{
+ unsigned int uc;
+
+ uc = *str++ & 0x0F;
+ uc <<= 6;
+ uc |= *str++ & 0x3F;
+ uc <<= 6;
+ uc |= *str++ & 0x3F;
+
+ return uc;
+}
+
+/*
+ * Encode a 3-byte UTF-8 sequence.
+ */
+static int
+utf8encode3(char *str, unsigned int val)
+{
+ str[2] = (val & 0x3F) | 0x80;
+ val >>= 6;
+ str[1] = (val & 0x3F) | 0x80;
+ val >>= 6;
+ str[0] = val | 0xE0;
+
+ return 3;
+}
+
/*
* utf8trie_t
*
@@ -158,7 +190,8 @@ typedef const unsigned char utf8trie_t;
* characters with the Default_Ignorable_Code_Point property.
* These do affect normalization, as they all have CCC 0.
*
- * The decompositions in the trie have been fully expanded.
+ * The decompositions in the trie have been fully expanded, with the
+ * exception of Hangul syllables, which are decomposed algorithmically.
*
* Casefolding, if applicable, is also done using decompositions.
*
@@ -178,6 +211,105 @@ typedef const unsigned char utf8leaf_t;
#define STOPPER (0)
#define DECOMPOSE (255)

+/* Marker for hangul syllable decomposition. */
+#define HANGUL ((char)(255))
+/* Size of the synthesized leaf used for Hangul syllable decomposition. */
+#define UTF8HANGULLEAF (12)
+
+/*
+ * Hangul decomposition (algorithm from Section 3.12 of Unicode 6.3.0)
+ *
+ * AC00;<Hangul Syllable, First>;Lo;0;L;;;;;N;;;;;
+ * D7A3;<Hangul Syllable, Last>;Lo;0;L;;;;;N;;;;;
+ *
+ * SBase = 0xAC00
+ * LBase = 0x1100
+ * VBase = 0x1161
+ * TBase = 0x11A7
+ * LCount = 19
+ * VCount = 21
+ * TCount = 28
+ * NCount = 588 (VCount * TCount)
+ * SCount = 11172 (LCount * NCount)
+ *
+ * Decomposition:
+ * SIndex = s - SBase
+ *
+ * LV (Canonical/Full)
+ * LIndex = SIndex / NCount
+ * VIndex = (Sindex % NCount) / TCount
+ * LPart = LBase + LIndex
+ * VPart = VBase + VIndex
+ *
+ * LVT (Canonical)
+ * LVIndex = (SIndex / TCount) * TCount
+ * TIndex = (Sindex % TCount)
+ * LVPart = SBase + LVIndex
+ * TPart = TBase + TIndex
+ *
+ * LVT (Full)
+ * LIndex = SIndex / NCount
+ * VIndex = (Sindex % NCount) / TCount
+ * TIndex = (Sindex % TCount)
+ * LPart = LBase + LIndex
+ * VPart = VBase + VIndex
+ * if (TIndex == 0) {
+ * d = <LPart, VPart>
+ * } else {
+ * TPart = TBase + TIndex
+ * d = <LPart, TPart, VPart>
+ * }
+ */
+
+/* Constants */
+#define SB (0xAC00)
+#define LB (0x1100)
+#define VB (0x1161)
+#define TB (0x11A7)
+#define LC (19)
+#define VC (21)
+#define TC (28)
+#define NC (VC * TC)
+#define SC (LC * NC)
+
+/* Algorithmic decomposition of hangul syllable. */
+static utf8leaf_t *
+utf8hangul(const char *str, unsigned char *hangul)
+{
+ unsigned int si;
+ unsigned int li;
+ unsigned int vi;
+ unsigned int ti;
+ unsigned char *h;
+
+ /* Calculate the SI, LI, VI, and TI values. */
+ si = utf8decode3(str) - SB;
+ li = si / NC;
+ vi = (si % NC) / TC;
+ ti = si % TC;
+
+ /* Fill in base of leaf. */
+ h = hangul;
+ LEAF_GEN(h) = 2;
+ LEAF_CCC(h) = DECOMPOSE;
+ h += 2;
+
+ /* Add LPart, a 3-byte UTF-8 sequence. */
+ h += utf8encode3((char *)h, li + LB);
+
+ /* Add VPart, a 3-byte UTF-8 sequence. */
+ h += utf8encode3((char *)h, vi + VB);
+
+ /* Add TPart if required, also a 3-byte UTF-8 sequence. */
+ if (ti)
+ h += utf8encode3((char *)h, ti + TB);
+
+ /* Terminate string. */
+ h[0] = '\0';
+
+ return hangul;
+}
+
/*
* Use trie to scan s, touching at most len bytes.
* Returns the leaf if one exists, NULL otherwise.
@@ -186,8 +318,8 @@ typedef const unsigned char utf8leaf_t;
* is well-formed and corresponds to a known unicode code point. The
* shorthand for this will be "is valid UTF-8 unicode".
*/
-static utf8leaf_t *utf8nlookup(const struct utf8data *data, const char *s,
- size_t len)
+static utf8leaf_t *utf8nlookup(const struct utf8data *data,
+ unsigned char *hangul, const char *s, size_t len)
{
utf8trie_t *trie = utf8data + data->offset;
int offlen;
@@ -225,8 +357,7 @@ static utf8leaf_t *utf8nlookup(const struct utf8data *data, const char *s,
trie++;
} else {
/* No right node. */
- node = 0;
- trie = NULL;
+ return NULL;
}
} else {
/* Left leg */
@@ -236,8 +367,7 @@ static utf8leaf_t *utf8nlookup(const struct utf8data *data, const char *s,
trie += offlen + 1;
} else if (*trie & RIGHTPATH) {
/* No left node. */
- node = 0;
- trie = NULL;
+ return NULL;
} else {
/* Left node after this node */
node = (*trie & TRIENODE);
@@ -245,6 +375,14 @@ static utf8leaf_t *utf8nlookup(const struct utf8data *data, const char *s,
}
}
}
+ /*
+ * Hangul decomposition is done algorithmically. These are the
+ * codepoints >= 0xAC00 and <= 0xD7A3. Their UTF-8 encoding is
+ * always 3 bytes long, so s has been advanced twice, and the
+ * start of the sequence is at s-2.
+ */
+ if (LEAF_CCC(trie) == DECOMPOSE && LEAF_STR(trie)[0] == HANGUL)
+ trie = utf8hangul(s - 2, hangul);
return trie;
}

@@ -254,9 +392,10 @@ static utf8leaf_t *utf8nlookup(const struct utf8data *data, const char *s,
*
* Forwards to utf8nlookup().
*/
-static utf8leaf_t *utf8lookup(const struct utf8data *data, const char *s)
+static utf8leaf_t *utf8lookup(const struct utf8data *data,
+ unsigned char *hangul, const char *s)
{
- return utf8nlookup(data, s, (size_t)-1);
+ return utf8nlookup(data, hangul, s, (size_t)-1);
}

/*
@@ -269,11 +408,13 @@ int utf8agemax(const struct utf8data *data, const char *s)
utf8leaf_t *leaf;
int age = 0;
int leaf_age;
+ unsigned char hangul[UTF8HANGULLEAF];

if (!data)
return -1;
+
while (*s) {
- leaf = utf8lookup(data, s);
+ leaf = utf8lookup(data, hangul, s);
if (!leaf)
return -1;

@@ -296,12 +437,13 @@ int utf8agemin(const struct utf8data *data, const char *s)
utf8leaf_t *leaf;
int age;
int leaf_age;
+ unsigned char hangul[UTF8HANGULLEAF];

if (!data)
return -1;
age = data->maxage;
while (*s) {
- leaf = utf8lookup(data, s);
+ leaf = utf8lookup(data, hangul, s);
if (!leaf)
return -1;
leaf_age = utf8agetab[LEAF_GEN(leaf)];
@@ -322,11 +464,13 @@ int utf8nagemax(const struct utf8data *data, const char *s, size_t len)
utf8leaf_t *leaf;
int age = 0;
int leaf_age;
+ unsigned char hangul[UTF8HANGULLEAF];

if (!data)
return -1;
+
while (len && *s) {
- leaf = utf8nlookup(data, s, len);
+ leaf = utf8nlookup(data, hangul, s, len);
if (!leaf)
return -1;
leaf_age = utf8agetab[LEAF_GEN(leaf)];
@@ -348,12 +492,13 @@ int utf8nagemin(const struct utf8data *data, const char *s, size_t len)
utf8leaf_t *leaf;
int leaf_age;
int age;
+ unsigned char hangul[UTF8HANGULLEAF];

if (!data)
return -1;
age = data->maxage;
while (len && *s) {
- leaf = utf8nlookup(data, s, len);
+ leaf = utf8nlookup(data, hangul, s, len);
if (!leaf)
return -1;
leaf_age = utf8agetab[LEAF_GEN(leaf)];
@@ -376,11 +521,12 @@ ssize_t utf8len(const struct utf8data *data, const char *s)
{
utf8leaf_t *leaf;
size_t ret = 0;
+ unsigned char hangul[UTF8HANGULLEAF];

if (!data)
return -1;
while (*s) {
- leaf = utf8lookup(data, s);
+ leaf = utf8lookup(data, hangul, s);
if (!leaf)
return -1;
if (utf8agetab[LEAF_GEN(leaf)] > data->maxage)
@@ -403,11 +549,12 @@ ssize_t utf8nlen(const struct utf8data *data, const char *s, size_t len)
{
utf8leaf_t *leaf;
size_t ret = 0;
+ unsigned char hangul[UTF8HANGULLEAF];

if (!data)
return -1;
while (len && *s) {
- leaf = utf8nlookup(data, s, len);
+ leaf = utf8nlookup(data, hangul, s, len);
if (!leaf)
return -1;
if (utf8agetab[LEAF_GEN(leaf)] > data->maxage)
@@ -530,10 +677,12 @@ int utf8byte(struct utf8cursor *u8c)
}

/* Look up the data for the current character. */
- if (u8c->p)
- leaf = utf8lookup(u8c->data, u8c->s);
- else
- leaf = utf8nlookup(u8c->data, u8c->s, u8c->len);
+ if (u8c->p) {
+ leaf = utf8lookup(u8c->data, u8c->hangul, u8c->s);
+ } else {
+ leaf = utf8nlookup(u8c->data, u8c->hangul,
+ u8c->s, u8c->len);
+ }

/* No leaf found implies that the input is a binary blob. */
if (!leaf)
@@ -554,7 +703,9 @@ int utf8byte(struct utf8cursor *u8c)
ccc = STOPPER;
goto ccc_mismatch;
}
- leaf = utf8lookup(u8c->data, u8c->s);
+
+ leaf = utf8lookup(u8c->data, u8c->hangul, u8c->s);
+ ccc = LEAF_CCC(leaf);
}

/*
diff --git a/scripts/mkutf8data.c b/scripts/mkutf8data.c
index 700b41c0cb66..69f3be92ba71 100644
--- a/scripts/mkutf8data.c
+++ b/scripts/mkutf8data.c
@@ -180,10 +180,14 @@ typedef unsigned char utf8leaf_t;
#define MAXCCC (254)
#define STOPPER (0)
#define DECOMPOSE (255)
+#define HANGUL ((char)(255))
+
+#define UTF8HANGULLEAF (12)

struct tree;
-static utf8leaf_t *utf8nlookup(struct tree *, const char *, size_t);
-static utf8leaf_t *utf8lookup(struct tree *, const char *);
+static utf8leaf_t *utf8nlookup(struct tree *, unsigned char *,
+ const char *, size_t);
+static utf8leaf_t *utf8lookup(struct tree *, unsigned char *, const char *);

unsigned char *utf8data;
size_t utf8data_size;
@@ -334,6 +338,8 @@ utf32valid(unsigned int unichar)
return unichar < 0x110000;
}

+#define HANGUL_SYLLABLE(U) ((U) >= 0xAC00 && (U) <= 0xD7A3)
+
#define NODE 1
#define LEAF 0

@@ -466,7 +472,7 @@ tree_walk(struct tree *tree)
indent+1);
leaves += 1;
} else if (node->right) {
- assert(node->rightnode==NODE);
+ assert(node->rightnode == NODE);
indent += 1;
node = node->right;
break;
@@ -864,7 +870,7 @@ mark_nodes(struct tree *tree)
}
}
} else if (node->right) {
- assert(node->rightnode==NODE);
+ assert(node->rightnode == NODE);
node = node->right;
continue;
}
@@ -916,7 +922,7 @@ mark_nodes(struct tree *tree)
}
}
} else if (node->right) {
- assert(node->rightnode==NODE);
+ assert(node->rightnode == NODE);
node = node->right;
if (!node->mark && node->parent->mark &&
!node->parent->left) {
@@ -1000,7 +1006,7 @@ index_nodes(struct tree *tree, int index)
index += tree->leaf_size(node->right);
count++;
} else if (node->right) {
- assert(node->rightnode==NODE);
+ assert(node->rightnode == NODE);
indent += 1;
node = node->right;
break;
@@ -1021,6 +1027,26 @@ index_nodes(struct tree *tree, int index)
return index;
}

+/*
+ * Mark the nodes in a subtree, helper for size_nodes().
+ */
+static int
+mark_subtree(struct node *node)
+{
+ int changed;
+
+ if (!node || node->mark)
+ return 0;
+ node->mark = 1;
+ node->index = node->parent->index;
+ changed = 1;
+ if (node->leftnode == NODE)
+ changed += mark_subtree(node->left);
+ if (node->rightnode == NODE)
+ changed += mark_subtree(node->right);
+ return changed;
+}
+
/*
* Compute the size of nodes and leaves. We start by assuming that
* each node needs to store a three-byte offset. The indexes of the
@@ -1040,6 +1066,7 @@ size_nodes(struct tree *tree)
unsigned int bitmask;
unsigned int pathbits;
unsigned int pathmask;
+ unsigned int nbit;
int changed;
int offset;
int size;
@@ -1067,22 +1094,40 @@ size_nodes(struct tree *tree)
size = 1;
} else {
if (node->rightnode == NODE) {
+ /*
+ * If the right node is not marked,
+ * look for a corresponding node in
+ * the next tree. Such a node need
+ * not exist.
+ */
right = node->right;
next = tree->next;
while (!right->mark) {
assert(next);
n = next->root;
while (n->bitnum != node->bitnum) {
- if (pathbits & (1<<n->bitnum))
+ nbit = 1 << n->bitnum;
+ if (!(pathmask & nbit))
+ break;
+ if (pathbits & nbit) {
+ if (n->rightnode == LEAF)
+ break;
n = n->right;
- else
+ } else {
+ if (n->leftnode == LEAF)
+ break;
n = n->left;
+ }
}
+ if (n->bitnum != node->bitnum)
+ break;
n = n->right;
- assert(right->bitnum == n->bitnum);
right = n;
next = next->next;
}
+ /* Make sure the right node is marked. */
+ if (!right->mark)
+ changed += mark_subtree(right);
offset = right->index - node->index;
} else {
offset = *tree->leaf_index(tree, node->right);
@@ -1124,7 +1169,7 @@ size_nodes(struct tree *tree)
if (node->rightnode == LEAF) {
assert(node->right);
} else if (node->right) {
- assert(node->rightnode==NODE);
+ assert(node->rightnode == NODE);
indent += 1;
node = node->right;
break;
@@ -1158,8 +1203,15 @@ emit(struct tree *tree, unsigned char *data)
int offset;
int index;
int indent;
+ int size;
+ int bytes;
+ int leaves;
+ int nodes[4];
unsigned char byte;

+ nodes[0] = nodes[1] = nodes[2] = nodes[3] = 0;
+ leaves = 0;
+ bytes = 0;
index = tree->index;
data += index;
indent = 1;
@@ -1168,7 +1220,10 @@ emit(struct tree *tree, unsigned char *data)
if (tree->childnode == LEAF) {
assert(tree->root);
tree->leaf_emit(tree->root, data);
- return;
+ size = tree->leaf_size(tree->root);
+ index += size;
+ leaves++;
+ goto done;
}

assert(tree->childnode == NODE);
@@ -1195,6 +1250,7 @@ emit(struct tree *tree, unsigned char *data)
offlen = 2;
else
offlen = 3;
+ nodes[offlen]++;
offset = node->offset;
byte |= offlen << OFFLEN_SHIFT;
*data++ = byte;
@@ -1207,12 +1263,14 @@ emit(struct tree *tree, unsigned char *data)
} else if (node->left) {
if (node->leftnode == NODE)
byte |= TRIENODE;
+ nodes[0]++;
*data++ = byte;
index++;
} else if (node->right) {
byte |= RIGHTNODE;
if (node->rightnode == NODE)
byte |= TRIENODE;
+ nodes[0]++;
*data++ = byte;
index++;
} else {
@@ -1227,7 +1285,10 @@ emit(struct tree *tree, unsigned char *data)
assert(node->left);
data = tree->leaf_emit(node->left,
data);
- index += tree->leaf_size(node->left);
+ size = tree->leaf_size(node->left);
+ index += size;
+ bytes += size;
+ leaves++;
} else if (node->left) {
assert(node->leftnode == NODE);
indent += 1;
@@ -1241,9 +1302,12 @@ emit(struct tree *tree, unsigned char *data)
assert(node->right);
data = tree->leaf_emit(node->right,
data);
- index += tree->leaf_size(node->right);
+ size = tree->leaf_size(node->right);
+ index += size;
+ bytes += size;
+ leaves++;
} else if (node->right) {
- assert(node->rightnode==NODE);
+ assert(node->rightnode == NODE);
indent += 1;
node = node->right;
break;
@@ -1255,6 +1319,15 @@ emit(struct tree *tree, unsigned char *data)
indent -= 1;
}
}
+done:
+ if (verbose > 0) {
+ printf("Emitted %d (%d) leaves",
+ leaves, bytes);
+ printf(" %d (%d+%d+%d+%d) nodes",
+ nodes[0] + nodes[1] + nodes[2] + nodes[3],
+ nodes[0], nodes[1], nodes[2], nodes[3]);
+ printf(" %d total\n", index - tree->index);
+ }
}

/* ------------------------------------------------------------------ */
@@ -1360,7 +1433,9 @@ nfkdi_print(void *l, int indent)

printf("%*sleaf @ %p code %X ccc %d gen %d", indent, "", leaf,
leaf->code, leaf->ccc, leaf->gen);
- if (leaf->utf8nfkdi)
+ if (leaf->utf8nfkdi && leaf->utf8nfkdi[0] == HANGUL)
+ printf(" nfkdi \"%s\"", "HANGUL SYLLABLE");
+ else if (leaf->utf8nfkdi)
printf(" nfkdi \"%s\"", (const char*)leaf->utf8nfkdi);
printf("\n");
}
@@ -1374,6 +1449,8 @@ nfkdicf_print(void *l, int indent)
leaf->code, leaf->ccc, leaf->gen);
if (leaf->utf8nfkdicf)
printf(" nfkdicf \"%s\"", (const char*)leaf->utf8nfkdicf);
+ else if (leaf->utf8nfkdi && leaf->utf8nfkdi[0] == HANGUL)
+ printf(" nfkdi \"%s\"", "HANGUL SYLLABLE");
else if (leaf->utf8nfkdi)
printf(" nfkdi \"%s\"", (const char*)leaf->utf8nfkdi);
printf("\n");
@@ -1409,7 +1486,9 @@ nfkdi_size(void *l)
struct unicode_data *leaf = l;

int size = 2;
- if (leaf->utf8nfkdi)
+ if (HANGUL_SYLLABLE(leaf->code))
+ size += 1;
+ else if (leaf->utf8nfkdi)
size += strlen(leaf->utf8nfkdi) + 1;
return size;
}
@@ -1420,7 +1499,9 @@ nfkdicf_size(void *l)
struct unicode_data *leaf = l;

int size = 2;
- if (leaf->utf8nfkdicf)
+ if (HANGUL_SYLLABLE(leaf->code))
+ size += 1;
+ else if (leaf->utf8nfkdicf)
size += strlen(leaf->utf8nfkdicf) + 1;
else if (leaf->utf8nfkdi)
size += strlen(leaf->utf8nfkdi) + 1;
@@ -1450,7 +1531,10 @@ nfkdi_emit(void *l, unsigned char *data)
unsigned char *s;

*data++ = leaf->gen;
- if (leaf->utf8nfkdi) {
+ if (HANGUL_SYLLABLE(leaf->code)) {
+ *data++ = DECOMPOSE;
+ *data++ = HANGUL;
+ } else if (leaf->utf8nfkdi) {
*data++ = DECOMPOSE;
s = (unsigned char*)leaf->utf8nfkdi;
while ((*data++ = *s++) != 0)
@@ -1468,7 +1552,10 @@ nfkdicf_emit(void *l, unsigned char *data)
unsigned char *s;

*data++ = leaf->gen;
- if (leaf->utf8nfkdicf) {
+ if (HANGUL_SYLLABLE(leaf->code)) {
+ *data++ = DECOMPOSE;
+ *data++ = HANGUL;
+ } else if (leaf->utf8nfkdicf) {
*data++ = DECOMPOSE;
s = (unsigned char*)leaf->utf8nfkdicf;
while ((*data++ = *s++) != 0)
@@ -1492,6 +1579,11 @@ utf8_create(struct unicode_data *data)
unsigned int *um;
int i;

+ if (data->utf8nfkdi) {
+ assert(data->utf8nfkdi[0] == HANGUL);
+ return;
+ }
+
u = utf;
um = data->utf32nfkdi;
if (um) {
@@ -1682,6 +1774,7 @@ verify(struct tree *tree)
utf8leaf_t *leaf;
unsigned int unichar;
char key[4];
+ unsigned char hangul[UTF8HANGULLEAF];
int report;
int nocf;

@@ -1695,7 +1788,8 @@ verify(struct tree *tree)
if (data->correction <= tree->maxage)
data = &unicode_data[unichar];
utf8encode(key,unichar);
- leaf = utf8lookup(tree, key);
+ leaf = utf8lookup(tree, hangul, key);
+
if (!leaf) {
if (data->gen != -1)
report++;
@@ -1709,7 +1803,10 @@ verify(struct tree *tree)
if (data->gen != LEAF_GEN(leaf))
report++;
if (LEAF_CCC(leaf) == DECOMPOSE) {
- if (nocf) {
+ if (HANGUL_SYLLABLE(data->code)) {
+ if (data->utf8nfkdi[0] != HANGUL)
+ report++;
+ } else if (nocf) {
if (!data->utf8nfkdi) {
report++;
} else if (strcmp(data->utf8nfkdi,
@@ -2394,6 +2491,15 @@ hangul_decompose(void)
memcpy(um, mapping, i * sizeof(unsigned int));
unicode_data[unichar].utf32nfkdicf = um;

+ /*
+ * Add a cookie as a reminder that the hangul syllable
+ * decompositions must not be stored in the generated
+ * trie.
+ */
+ unicode_data[unichar].utf8nfkdi = malloc(2);
+ unicode_data[unichar].utf8nfkdi[0] = HANGUL;
+ unicode_data[unichar].utf8nfkdi[1] = '\0';
+
if (verbose > 1)
print_utf32nfkdi(unichar);

@@ -2521,6 +2627,100 @@ int utf8cursor(struct utf8cursor *, struct tree *, const char *);
int utf8ncursor(struct utf8cursor *, struct tree *, const char *, size_t);
int utf8byte(struct utf8cursor *);

+/*
+ * Hangul decomposition (algorithm from Section 3.12 of Unicode 6.3.0)
+ *
+ * AC00;<Hangul Syllable, First>;Lo;0;L;;;;;N;;;;;
+ * D7A3;<Hangul Syllable, Last>;Lo;0;L;;;;;N;;;;;
+ *
+ * SBase = 0xAC00
+ * LBase = 0x1100
+ * VBase = 0x1161
+ * TBase = 0x11A7
+ * LCount = 19
+ * VCount = 21
+ * TCount = 28
+ * NCount = 588 (VCount * TCount)
+ * SCount = 11172 (LCount * NCount)
+ *
+ * Decomposition:
+ * SIndex = s - SBase
+ *
+ * LV (Canonical/Full)
+ * LIndex = SIndex / NCount
+ * VIndex = (Sindex % NCount) / TCount
+ * LPart = LBase + LIndex
+ * VPart = VBase + VIndex
+ *
+ * LVT (Canonical)
+ * LVIndex = (SIndex / TCount) * TCount
+ * TIndex = (Sindex % TCount)
+ * LVPart = SBase + LVIndex
+ * TPart = TBase + TIndex
+ *
+ * LVT (Full)
+ * LIndex = SIndex / NCount
+ * VIndex = (Sindex % NCount) / TCount
+ * TIndex = (Sindex % TCount)
+ * LPart = LBase + LIndex
+ * VPart = VBase + VIndex
+ * if (TIndex == 0) {
+ * d = <LPart, VPart>
+ * } else {
+ * TPart = TBase + TIndex
+ * d = <LPart, VPart, TPart>
+ * }
+ */
+
+/* Constants */
+#define SB (0xAC00)
+#define LB (0x1100)
+#define VB (0x1161)
+#define TB (0x11A7)
+#define LC (19)
+#define VC (21)
+#define TC (28)
+#define NC (VC * TC)
+#define SC (LC * NC)
+
+/* Algorithmic decomposition of hangul syllable. */
+static utf8leaf_t *
+utf8hangul(const char *str, unsigned char *hangul)
+{
+ unsigned int si;
+ unsigned int li;
+ unsigned int vi;
+ unsigned int ti;
+ unsigned char *h;
+
+ /* Calculate the SI, LI, VI, and TI values. */
+ si = utf8decode(str) - SB;
+ li = si / NC;
+ vi = (si % NC) / TC;
+ ti = si % TC;
+
+ /* Fill in base of leaf. */
+ h = hangul;
+ LEAF_GEN(h) = 2;
+ LEAF_CCC(h) = DECOMPOSE;
+ h += 2;
+
+ /* Add LPart, a 3-byte UTF-8 sequence. */
+ h += utf8encode((char *)h, li + LB);
+
+ /* Add VPart, a 3-byte UTF-8 sequence. */
+ h += utf8encode((char *)h, vi + VB);
+
+ /* Add TPart if required, also a 3-byte UTF-8 sequence. */
+ if (ti)
+ h += utf8encode((char *)h, ti + TB);
+
+ /* Terminate string. */
+ h[0] = '\0';
+
+ return hangul;
+}
+
/*
* Use trie to scan s, touching at most len bytes.
* Returns the leaf if one exists, NULL otherwise.
@@ -2530,7 +2730,7 @@ int utf8byte(struct utf8cursor *);
* shorthand for this will be "is valid UTF-8 unicode".
*/
static utf8leaf_t *
-utf8nlookup(struct tree *tree, const char *s, size_t len)
+utf8nlookup(struct tree *tree, unsigned char *hangul, const char *s, size_t len)
{
utf8trie_t *trie = utf8data + tree->index;
int offlen;
@@ -2586,6 +2786,14 @@ utf8nlookup(struct tree *tree, const char *s, size_t len)
}
}
}
+ /*
+ * Hangul decomposition is done algorithmically. These are the
+ * codepoints >= 0xAC00 and <= 0xD7A3. Their UTF-8 encoding is
+ * always 3 bytes long, so s has been advanced twice, and the
+ * start of the sequence is at s-2.
+ */
+ if (LEAF_CCC(trie) == DECOMPOSE && LEAF_STR(trie)[0] == HANGUL)
+ trie = utf8hangul(s - 2, hangul);
return trie;
}

@@ -2596,9 +2804,9 @@ utf8nlookup(struct tree *tree, const char *s, size_t len)
* Forwards to trie_nlookup().
*/
static utf8leaf_t *
-utf8lookup(struct tree *tree, const char *s)
+utf8lookup(struct tree *tree, unsigned char *hangul, const char *s)
{
- return utf8nlookup(tree, s, (size_t)-1);
+ return utf8nlookup(tree, hangul, s, (size_t)-1);
}

/*
@@ -2624,11 +2832,14 @@ utf8agemax(struct tree *tree, const char *s)
utf8leaf_t *leaf;
int age = 0;
int leaf_age;
+ unsigned char hangul[UTF8HANGULLEAF];

if (!tree)
return -1;
+
while (*s) {
- if (!(leaf = utf8lookup(tree, s)))
+ leaf = utf8lookup(tree, hangul, s);
+ if (!leaf)
return -1;
leaf_age = ages[LEAF_GEN(leaf)];
if (leaf_age <= tree->maxage && leaf_age > age)
@@ -2649,12 +2860,14 @@ utf8agemin(struct tree *tree, const char *s)
utf8leaf_t *leaf;
int age;
int leaf_age;
+ unsigned char hangul[UTF8HANGULLEAF];

if (!tree)
return -1;
age = tree->maxage;
while (*s) {
- if (!(leaf = utf8lookup(tree, s)))
+ leaf = utf8lookup(tree, hangul, s);
+ if (!leaf)
return -1;
leaf_age = ages[LEAF_GEN(leaf)];
if (leaf_age <= tree->maxage && leaf_age < age)
@@ -2674,11 +2887,14 @@ utf8nagemax(struct tree *tree, const char *s, size_t len)
utf8leaf_t *leaf;
int age = 0;
int leaf_age;
+ unsigned char hangul[UTF8HANGULLEAF];

if (!tree)
return -1;
+
while (len && *s) {
- if (!(leaf = utf8nlookup(tree, s, len)))
+ leaf = utf8nlookup(tree, hangul, s, len);
+ if (!leaf)
return -1;
leaf_age = ages[LEAF_GEN(leaf)];
if (leaf_age <= tree->maxage && leaf_age > age)
@@ -2699,12 +2915,14 @@ utf8nagemin(struct tree *tree, const char *s, size_t len)
utf8leaf_t *leaf;
int leaf_age;
int age;
+ unsigned char hangul[UTF8HANGULLEAF];

if (!tree)
return -1;
age = tree->maxage;
while (len && *s) {
- if (!(leaf = utf8nlookup(tree, s, len)))
+ leaf = utf8nlookup(tree, hangul, s, len);
+ if (!leaf)
return -1;
leaf_age = ages[LEAF_GEN(leaf)];
if (leaf_age <= tree->maxage && leaf_age < age)
@@ -2726,11 +2944,13 @@ utf8len(struct tree *tree, const char *s)
{
utf8leaf_t *leaf;
size_t ret = 0;
+ unsigned char hangul[UTF8HANGULLEAF];

if (!tree)
return -1;
while (*s) {
- if (!(leaf = utf8lookup(tree, s)))
+ leaf = utf8lookup(tree, hangul, s);
+ if (!leaf)
return -1;
if (ages[LEAF_GEN(leaf)] > tree->maxage)
ret += utf8clen(s);
@@ -2752,11 +2972,13 @@ utf8nlen(struct tree *tree, const char *s, size_t len)
{
utf8leaf_t *leaf;
size_t ret = 0;
+ unsigned char hangul[UTF8HANGULLEAF];

if (!tree)
return -1;
while (len && *s) {
- if (!(leaf = utf8nlookup(tree, s, len)))
+ leaf = utf8nlookup(tree, hangul, s, len);
+ if (!leaf)
return -1;
if (ages[LEAF_GEN(leaf)] > tree->maxage)
ret += utf8clen(s);
@@ -2784,6 +3006,7 @@ struct utf8cursor {
short int ccc;
short int nccc;
unsigned int unichar;
+ unsigned char hangul[UTF8HANGULLEAF];
};

/*
@@ -2900,10 +3123,12 @@ utf8byte(struct utf8cursor *u8c)
}

/* Look up the data for the current character. */
- if (u8c->p)
- leaf = utf8lookup(u8c->tree, u8c->s);
- else
- leaf = utf8nlookup(u8c->tree, u8c->s, u8c->len);
+ if (u8c->p) {
+ leaf = utf8lookup(u8c->tree, u8c->hangul, u8c->s);
+ } else {
+ leaf = utf8nlookup(u8c->tree, u8c->hangul,
+ u8c->s, u8c->len);
+ }

/* No leaf found implies that the input is a binary blob. */
if (!leaf)
@@ -2923,7 +3148,7 @@ utf8byte(struct utf8cursor *u8c)
ccc = STOPPER;
goto ccc_mismatch;
}
- leaf = utf8lookup(u8c->tree, u8c->s);
+ leaf = utf8lookup(u8c->tree, u8c->hangul, u8c->s);
ccc = LEAF_CCC(leaf);
}
u8c->unichar = utf8decode(u8c->s);
--
2.15.1

2018-01-12 07:12:29

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH RFC 08/13] charsets: utf8: Introduce test module for kernel UTF-8 implementation

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
lib/Kconfig | 5 ++
lib/charsets/Makefile | 1 +
lib/charsets/test_ucd.c | 186 ++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 192 insertions(+)
create mode 100644 lib/charsets/test_ucd.c

diff --git a/lib/Kconfig b/lib/Kconfig
index 9ca3bda7b2be..dd4552dae48b 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -593,6 +593,11 @@ config UTF8_NORMALIZATION
depends on CHARSETS
help
Say Y here to enable utf8 normalization support.
+
+config UTF8_SELFTEST
+ tristate "Test UTF-8 normalization support"
+ default n
+ depends on UTF8_NORMALIZATION
endmenu

config GENERIC_ASHLDI3
diff --git a/lib/charsets/Makefile b/lib/charsets/Makefile
index 5e2fa7c20a47..73b16e54ab35 100644
--- a/lib/charsets/Makefile
+++ b/lib/charsets/Makefile
@@ -6,6 +6,7 @@ obj-$(CONFIG_CHARSETS) += ascii.o

utf8-y += utf8_core.o utf8norm.o
obj-$(CONFIG_UTF8_NORMALIZATION) += utf8.o
+obj-$(CONFIG_UTF8_SELFTEST) += test_ucd.o

$(obj)/utf8norm.o: $(obj)/utf8data.h
$(obj)/utf8data.h: $(srctree)/$(src)/ucd/*.txt $(objtree)/scripts/mkutf8data FORCE
diff --git a/lib/charsets/test_ucd.c b/lib/charsets/test_ucd.c
new file mode 100644
index 000000000000..51112ce488ac
--- /dev/null
+++ b/lib/charsets/test_ucd.c
@@ -0,0 +1,186 @@
+/*
+ * Kernel module for testing utf-8 support.
+ *
+ * Copyright 2017 Collabora Ltd. All Rights Reserved
+ *
+ * This software is licensed under the terms of the GNU General Public
+ * License version 2, as published by the Free Software Foundation, and
+ * may be copied, distributed, and modified under those terms.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/module.h>
+#include <linux/printk.h>
+#include <linux/utf8norm.h>
+
+unsigned int failed_tests;
+unsigned int total_tests;
+
+#define _test(cond, func, line, fmt, ...) do { \
+ total_tests++; \
+ if (!cond) { \
+ failed_tests++; \
+ pr_err("test %s:%d Failed: %s%s", \
+ func, line, #cond, (fmt?":":".")); \
+ if (fmt) \
+ pr_err(fmt, ##__VA_ARGS__); \
+ } \
+ } while (0)
+#define test_f(cond, fmt, ...) _test(cond, __func__, __LINE__, fmt, ##__VA_ARGS__)
+#define test(cond) _test(cond, __func__, __LINE__, "")
+
+const static struct {
+ /* UTF-8 strings in this vector _must_ be NULL-terminated. */
+ unsigned char str[10];
+ unsigned char dec[10];
+} nfkdi_test_data[] = {
+ /* Trivial sequence */
+ {
+ .str = {0x41, 0x42, 0x62, 0x61, 0x00}, /* "ABba" decomposes to itself */
+ .dec = {0x41, 0x42, 0x62, 0x61, 0x00}
+ },
+ /* Simple equivalent sequences */
+ {
+ .str = {0xc2, 0xbc, 0x00}, /* 'VULGAR FRACTION ONE QUARTER' decomposes to */
+ .dec = {0x31, 0xe2, 0x81, 0x84, 0x34, 0x00}, /* 'NUMBER 1' + 'FRACTION SLASH' + 'NUMBER 4' */
+ },
+ {
+ .str = {0xc3, 0xa4, 0x00}, /* 'LATIN SMALL LETTER A WITH DIAERESIS' decomposes to */
+ .dec = {0x61, 0xcc, 0x88, 0x00}, /* 'LETTER A' + 'COMBINING DIAERESIS' */
+ },
+ {
+ .str = {0xC7, 0x89, 0x00}, /* 'LATIN SMALL LETTER LJ' decomposes to */
+ .dec = {0x6c, 0x6a, 0x00}, /* 'LETTER L' + 'LETTER J' */
+ },
+ /* Canonical ordering */
+ {
+ .str = {0x41, 0xcc, 0x81, 0xcc, 0xa8, 0x0}, /* A + 'COMBINING ACUTE ACCENT' + 'COMBINING OGONEK' decomposes to */
+ .dec = {0x41, 0xcc, 0xa8, 0xcc, 0x81, 0x0}, /* A + 'COMBINING OGONEK' + 'COMBINING ACUTE ACCENT' */
+ }
+};
+
+const static struct {
+ /* UTF-8 strings in this vector _must_ be NULL-terminated. */
+ unsigned char str[30];
+ unsigned char ncf[30];
+} nfkdicf_test_data[] = {
+ /* Trivial sequences */
+ {
+ .str = {0x41, 0x42, 0x62, 0x61, 0x00}, /* "ABba" folds to lowercase */
+ .ncf = {0x61, 0x62, 0x62, 0x61, 0x00},
+ },
+ {
+ .str = "ABCDEFGHIJKLMNOPRSTUVWXYZ0.1", /* All ASCII folds to lower-case */
+ .ncf = "abcdefghijklmnoprstuvwxyz0.1",
+ },
+ {
+ .str = {0xc3, 0x9f, 0x00}, /* LATIN SMALL LETTER SHARP S folds to */
+ .ncf = {0x73, 0x73, 0x00}, /* LATIN SMALL LETTER S + LATIN SMALL LETTER S */
+ },
+ {
+ .str = {0xC3, 0x85, 0x0}, /* LATIN CAPITAL LETTER A WITH RING ABOVE folds to */
+ .ncf = {0x61, 0xcc, 0x8a, 0x0}, /* LATIN SMALL LETTER A + COMBINING RING ABOVE */
+ }
+};
+
+static void check_utf8_nfkdi(void)
+{
+ int i;
+ struct utf8cursor u8c;
+ const struct utf8data *data = utf8nfkdi(UNICODE_AGE(7, 0, 0));
+
+ for (i = 0; i < ARRAY_SIZE(nfkdi_test_data); i++) {
+ int len = strlen(nfkdi_test_data[i].str);
+ int nlen = strlen(nfkdi_test_data[i].dec);
+ int j = 0;
+ unsigned char c;
+
+ test((utf8len(data, nfkdi_test_data[i].str) == nlen));
+ test((utf8nlen(data, nfkdi_test_data[i].str, len) == nlen));
+
+ if (utf8cursor(&u8c, data, nfkdi_test_data[i].str) < 0)
+ pr_err("can't create cursor\n");
+
+ while ((c = utf8byte(&u8c)) > 0) {
+ test_f((c == nfkdi_test_data[i].dec[j]),
+ "%x %x\n", c, nfkdi_test_data[i].dec[j]);
+ j++;
+ }
+
+ test((j == nlen));
+ }
+}
+
+static void check_utf8_nfkdicf(void)
+{
+ int i;
+ struct utf8cursor u8c;
+ const struct utf8data *data = utf8nfkdicf(UNICODE_AGE(7, 0, 0));
+
+ for (i = 0; i < ARRAY_SIZE(nfkdicf_test_data); i++) {
+ int len = strlen(nfkdicf_test_data[i].str);
+ int nlen = strlen(nfkdicf_test_data[i].ncf);
+ int j = 0;
+ unsigned char c;
+
+ test((utf8len(data, nfkdicf_test_data[i].str) == nlen));
+ test((utf8nlen(data, nfkdicf_test_data[i].str, len) == nlen));
+
+ if (utf8cursor(&u8c, data, nfkdicf_test_data[i].str) < 0)
+ pr_err("can't create cursor\n");
+
+ while ((c = utf8byte(&u8c)) > 0) {
+ test_f((c == nfkdicf_test_data[i].ncf[j]),
+ "0x%x 0x%x\n", c, nfkdicf_test_data[i].ncf[j]);
+ j++;
+ }
+
+ test((j == nlen));
+ }
+}
+
+static void check_supported_versions(void)
+{
+ /* Unicode 7.0.0 should be supported. */
+ test(utf8version_is_supported(UNICODE_AGE(7, 0, 0)));
+
+ /* Unicode 8.0.0 should not be supported. */
+ test(!utf8version_is_supported(UNICODE_AGE(8, 0, 0)));
+
+ /* Next values don't exist. */
+ test(!utf8version_is_supported(UNICODE_AGE(0, 0, 0)));
+ test(!utf8version_is_supported(UNICODE_AGE(-1, -1, -1)));
+}
+
+static int __init init_test_ucd(void)
+{
+ failed_tests = 0;
+ total_tests = 0;
+
+ check_supported_versions();
+ check_utf8_nfkdi();
+ check_utf8_nfkdicf();
+
+ if (!failed_tests)
+ pr_info("All %u tests passed\n", total_tests);
+ else
+ pr_err("%u out of %u tests failed\n", failed_tests,
+ total_tests);
+ return 0;
+}
+
+static void __exit exit_test_ucd(void)
+{
+}
+
+module_init(init_test_ucd);
+module_exit(exit_test_ucd);
+
+MODULE_AUTHOR("Gabriel Krisman Bertazi <[email protected]>");
+MODULE_LICENSE("GPL");
--
2.15.1

2018-01-12 07:12:30

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH RFC 09/13] ext4: Add ignorecase mount option

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
fs/ext4/ext4.h | 3 +++
fs/ext4/super.c | 5 +++++
2 files changed, 8 insertions(+)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 4e091eae38b1..09fb2426c352 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1130,6 +1130,9 @@ struct ext4_inode_info {
#define EXT4_MOUNT2_EXPLICIT_JOURNAL_CHECKSUM 0x00000008 /* User explicitly
specified journal checksum */

+#define EXT4_MOUNT2_CASE_INSENSITIVE 0x00000010 /* Perform case-insensitive
+ lookups */
+
#define clear_opt(sb, opt) EXT4_SB(sb)->s_mount_opt &= \
~EXT4_MOUNT_##opt
#define set_opt(sb, opt) EXT4_SB(sb)->s_mount_opt |= \
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 7c46693a14d7..f737e90a3b4e 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1350,6 +1350,7 @@ enum {
Opt_dioread_nolock, Opt_dioread_lock,
Opt_discard, Opt_nodiscard, Opt_init_itable, Opt_noinit_itable,
Opt_max_dir_size_kb, Opt_nojournal_checksum, Opt_nombcache,
+ Opt_ignore_case,
};

static const match_table_t tokens = {
@@ -1432,6 +1433,7 @@ static const match_table_t tokens = {
{Opt_noinit_itable, "noinit_itable"},
{Opt_max_dir_size_kb, "max_dir_size_kb=%u"},
{Opt_test_dummy_encryption, "test_dummy_encryption"},
+ {Opt_ignore_case, "ignorecase"},
{Opt_nombcache, "nombcache"},
{Opt_nombcache, "no_mbcache"}, /* for backward compatibility */
{Opt_removed, "check=none"}, /* mount option from ext2/3 */
@@ -1687,6 +1689,9 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
case Opt_nolazytime:
sb->s_flags &= ~SB_LAZYTIME;
return 1;
+ case Opt_ignore_case:
+ set_opt2(sb, CASE_INSENSITIVE);
+ return 1;
}

for (m = ext4_mount_opts; m->token != Opt_err; m++)
--
2.15.1

2018-01-12 07:14:12

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH RFC 11/13] fscrypt: Introduce charset-based matching functions

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
include/linux/fscrypt.h | 1 +
include/linux/fscrypt_notsupp.h | 16 ++++++++++++++++
include/linux/fscrypt_supp.h | 27 +++++++++++++++++++++++++++
3 files changed, 44 insertions(+)

diff --git a/include/linux/fscrypt.h b/include/linux/fscrypt.h
index 08b4b40c5aa8..8ca010ef2871 100644
--- a/include/linux/fscrypt.h
+++ b/include/linux/fscrypt.h
@@ -21,6 +21,7 @@
#include <linux/dcache.h>
#include <crypto/skcipher.h>
#include <uapi/linux/fs.h>
+#include <linux/charsets.h>

#define FS_CRYPTO_BLOCK_SIZE 16

diff --git a/include/linux/fscrypt_notsupp.h b/include/linux/fscrypt_notsupp.h
index 63e58808519a..2227db69d6ba 100644
--- a/include/linux/fscrypt_notsupp.h
+++ b/include/linux/fscrypt_notsupp.h
@@ -160,6 +160,22 @@ static inline bool fscrypt_match_name(const struct fscrypt_name *fname,
return !memcmp(de_name, fname->disk_name.name, fname->disk_name.len);
}

+static inline bool fscrypt_charset_match_name(const struct fscrypt_name *fname,
+ const struct charset *charset,
+ const u8 *de_name,
+ u32 de_name_len, bool ignorecase)
+{
+ if (!ignorecase) {
+ return !charset_strncmp(charset, (char *) de_name,
+ fname->disk_name.name,
+ fname->disk_name.len);
+ }
+
+ return !charset_strncasecmp(charset, (char *) de_name,
+ fname->disk_name.name,
+ fname->disk_name.len);
+}
+
/* bio.c */
static inline void fscrypt_decrypt_bio_pages(struct fscrypt_ctx *ctx,
struct bio *bio)
diff --git a/include/linux/fscrypt_supp.h b/include/linux/fscrypt_supp.h
index cf9e9fc02f0a..4e6f25521b99 100644
--- a/include/linux/fscrypt_supp.h
+++ b/include/linux/fscrypt_supp.h
@@ -138,6 +138,33 @@ static inline bool fscrypt_match_name(const struct fscrypt_name *fname,
return !memcmp(de_name, fname->disk_name.name, fname->disk_name.len);
}

+static inline bool fscrypt_charset_match_name(const struct fscrypt_name *fname,
+ const struct charset *charset,
+ const u8 *de_name,
+ u32 de_name_len, bool ignorecase)
+{
+ if (unlikely(!fname->disk_name.name)) {
+ const struct fscrypt_digested_name *n =
+ (const void *)fname->crypto_buf.name;
+ if (WARN_ON_ONCE(fname->usr_fname->name[0] != '_'))
+ return false;
+ if (de_name_len <= FSCRYPT_FNAME_MAX_UNDIGESTED_SIZE)
+ return false;
+ return !memcmp(FSCRYPT_FNAME_DIGEST(de_name, de_name_len),
+ n->digest, FSCRYPT_FNAME_DIGEST_SIZE);
+ }
+
+ if (!ignorecase) {
+ return !charset_strncmp(charset, (char *) de_name,
+ fname->disk_name.name,
+ fname->disk_name.len);
+ }
+
+ return !charset_strncasecmp(charset, (char *) de_name,
+ fname->disk_name.name,
+ fname->disk_name.len);
+}
+
/* bio.c */
extern void fscrypt_decrypt_bio_pages(struct fscrypt_ctx *, struct bio *);
extern void fscrypt_pullback_bio_page(struct page **, bool);
--
2.15.1

2018-01-12 07:12:31

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH RFC 10/13] ext4: Include encoding information on the superblock

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
fs/ext4/ext4.h | 1 +
fs/ext4/super.c | 27 ++++++++++++++++++++++++++-
2 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 09fb2426c352..6c8aaf2dd322 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1369,6 +1369,7 @@ struct ext4_sb_info {
struct kobject s_kobj;
struct completion s_kobj_unregister;
struct super_block *s_sb;
+ const struct charset *encoding;

/* Journaling */
struct journal_s *s_journal;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index f737e90a3b4e..022bf5f274b2 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -40,6 +40,7 @@
#include <linux/dax.h>
#include <linux/cleancache.h>
#include <linux/uaccess.h>
+#include <linux/charsets.h>

#include <linux/kthread.h>
#include <linux/freezer.h>
@@ -1350,7 +1351,7 @@ enum {
Opt_dioread_nolock, Opt_dioread_lock,
Opt_discard, Opt_nodiscard, Opt_init_itable, Opt_noinit_itable,
Opt_max_dir_size_kb, Opt_nojournal_checksum, Opt_nombcache,
- Opt_ignore_case,
+ Opt_ignore_case, Opt_encoding,
};

static const match_table_t tokens = {
@@ -1434,6 +1435,7 @@ static const match_table_t tokens = {
{Opt_max_dir_size_kb, "max_dir_size_kb=%u"},
{Opt_test_dummy_encryption, "test_dummy_encryption"},
{Opt_ignore_case, "ignorecase"},
+ {Opt_encoding, "encoding=%s"},
{Opt_nombcache, "nombcache"},
{Opt_nombcache, "no_mbcache"}, /* for backward compatibility */
{Opt_removed, "check=none"}, /* mount option from ext2/3 */
@@ -1644,6 +1646,7 @@ static const struct mount_opts {
{Opt_max_dir_size_kb, 0, MOPT_GTE0},
{Opt_test_dummy_encryption, 0, MOPT_GTE0},
{Opt_nombcache, EXT4_MOUNT_NO_MBCACHE, MOPT_SET},
+ {Opt_encoding, 0, MOPT_EXT4_ONLY | MOPT_STRING},
{Opt_err, 0, 0}
};

@@ -1882,6 +1885,16 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
sbi->s_mount_opt |= m->mount_opt;
} else if (token == Opt_data_err_ignore) {
sbi->s_mount_opt &= ~m->mount_opt;
+ } else if (token == Opt_encoding) {
+ char *charset = match_strdup(&args[0]);
+
+ sbi->encoding = charset_load(charset);
+ if (!sbi->encoding) {
+ ext4_msg(sb, KERN_ERR, "Encoding %s not supported\n",
+ charset);
+ return -1;
+ }
+ kfree(charset);
} else {
if (!args->from)
arg = 1;
@@ -1968,6 +1981,12 @@ static int parse_options(char *options, struct super_block *sb,
return 0;
}
}
+
+ if (test_opt2(sb, CASE_INSENSITIVE) && !sbi->encoding) {
+ ext4_msg(sb, KERN_WARNING, "Encoding not explicitly specified "
+ "or identified on the superblock. ASCII is assumed.");
+ }
+
return 1;
}

@@ -3598,6 +3617,12 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
&journal_ioprio, 0))
goto failed_mount;

+ if (!sbi->encoding) {
+ sbi->encoding = charset_load("ascii");
+ if (!sbi->encoding)
+ goto failed_mount;
+ }
+
if (test_opt(sb, DATA_FLAGS) == EXT4_MOUNT_JOURNAL_DATA) {
printk_once(KERN_WARNING "EXT4-fs: Warning: mounting "
"with data=journal disables delayed "
--
2.15.1

2018-01-12 07:12:33

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH RFC 12/13] ext4: Support charset name matching

Implement generic caseless lookup based on the mount flags using the
charset library.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
fs/ext4/namei.c | 27 ++++++++++++++++++---------
1 file changed, 18 insertions(+), 9 deletions(-)

diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index e750d68fbcb5..24056172e225 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -1255,8 +1255,10 @@ static void dx_insert_block(struct dx_frame *frame, u32 hash, ext4_lblk_t block)
*
* Return: %true if the directory entry matches, otherwise %false.
*/
-static inline bool ext4_match(const struct ext4_filename *fname,
- const struct ext4_dir_entry_2 *de)
+static inline bool ext4_match(const struct charset *charset,
+ const struct ext4_filename *fname,
+ const struct ext4_dir_entry_2 *de,
+ bool ignorecase)
{
struct fscrypt_name f;

@@ -1268,7 +1270,8 @@ static inline bool ext4_match(const struct ext4_filename *fname,
#ifdef CONFIG_EXT4_FS_ENCRYPTION
f.crypto_buf = fname->crypto_buf;
#endif
- return fscrypt_match_name(&f, de->name, de->name_len);
+ return fscrypt_charset_match_name(&f, charset, de->name, de->name_len,
+ ignorecase);
}

/*
@@ -1281,6 +1284,8 @@ int ext4_search_dir(struct buffer_head *bh, char *search_buf, int buf_size,
struct ext4_dir_entry_2 * de;
char * dlimit;
int de_len;
+ struct super_block *sb = dir->i_sb;
+ bool ignorecase = test_opt2(sb, CASE_INSENSITIVE);

de = (struct ext4_dir_entry_2 *)search_buf;
dlimit = search_buf + buf_size;
@@ -1288,7 +1293,7 @@ int ext4_search_dir(struct buffer_head *bh, char *search_buf, int buf_size,
/* this code is executed quadratically often */
/* do minimal checking `by hand' */
if ((char *) de + de->name_len <= dlimit &&
- ext4_match(fname, de)) {
+ ext4_match(EXT4_SB(sb)->encoding, fname, de, ignorecase)) {
/* found a match - just to be sure, do
* a full check */
if (ext4_check_dir_entry(dir, NULL, de, bh, bh->b_data,
@@ -1389,11 +1394,13 @@ static struct buffer_head * ext4_find_entry (struct inode *dir,
if (is_dx(dir)) {
ret = ext4_dx_find_entry(dir, &fname, res_dir);
/*
- * On success, or if the error was file not found,
- * return. Otherwise, fall back to doing a search the
- * old fashioned way.
+ * On success, or if the file cannot be found, return.
+ * If an error occurred, or if the file was not found
+ * but we can do case-insensitive lookups, fall back to
+ * the linear search.
*/
- if (!IS_ERR(ret) || PTR_ERR(ret) != ERR_BAD_DX_DIR)
+ if ((!IS_ERR(ret) && (ret || !test_opt2(sb, CASE_INSENSITIVE))) ||
+ (IS_ERR(ret) && PTR_ERR(ret) != ERR_BAD_DX_DIR))
goto cleanup_and_exit;
dxtrace(printk(KERN_DEBUG "ext4_find_entry: dx failed, "
"falling back\n"));
@@ -1788,6 +1795,8 @@ int ext4_find_dest_de(struct inode *dir, struct inode *inode,
int nlen, rlen;
unsigned int offset = 0;
char *top;
+ struct super_block *sb = dir->i_sb;
+ bool ignorecase = test_opt2(sb, CASE_INSENSITIVE);

de = (struct ext4_dir_entry_2 *)buf;
top = buf + buf_size - reclen;
@@ -1795,7 +1804,7 @@ int ext4_find_dest_de(struct inode *dir, struct inode *inode,
if (ext4_check_dir_entry(dir, NULL, de, bh,
buf, buf_size, offset))
return -EFSCORRUPTED;
- if (ext4_match(fname, de))
+ if (ext4_match(EXT4_SB(sb)->encoding, fname, de, ignorecase))
return -EEXIST;
nlen = EXT4_DIR_REC_LEN(de->name_len);
rlen = ext4_rec_len_from_disk(de->rec_len, buf_size);
--
2.15.1

2018-01-12 07:14:24

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH RFC 13/13] ext4: Implement ext4 dcache hooks for custom charsets

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
fs/ext4/dir.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/ext4/ext4.h | 2 ++
fs/ext4/super.c | 5 +++++
3 files changed, 70 insertions(+)

diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index d5babc9f222b..8b73a68930d2 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -25,6 +25,7 @@
#include <linux/fs.h>
#include <linux/buffer_head.h>
#include <linux/slab.h>
+#include <linux/charsets.h>
#include "ext4.h"
#include "xattr.h"

@@ -661,3 +662,65 @@ const struct file_operations ext4_dir_operations = {
.open = ext4_dir_open,
.release = ext4_release_dir,
};
+
+static int ext4_d_hash(const struct dentry *dentry, struct qstr *q)
+{
+ unsigned long hash;
+ int i, len;
+ char *str;
+ const struct charset *charset = EXT4_SB(dentry->d_sb)->encoding;
+
+ len = charset_normalize(charset, q->name, q->len, &str);
+
+ hash = init_name_hash(dentry);
+ for (i = 0; i < len; i++)
+ hash = partial_name_hash(str[i], hash);
+ q->hash = end_name_hash(hash);
+
+ kfree(str);
+ return 0;
+}
+
+static int ext4_d_compare(const struct dentry *dentry, unsigned int len,
+ const char *str, const struct qstr *name)
+{
+ const struct charset *charset = EXT4_SB(dentry->d_sb)->encoding;
+
+ return charset_strncmp(charset, str, name->name, len);
+}
+
+const struct dentry_operations ext4_dentry_ops = {
+ .d_hash = ext4_d_hash,
+ .d_compare = ext4_d_compare,
+};
+
+static int ext4_d_ci_hash(const struct dentry *dentry, struct qstr *q)
+{
+ unsigned long hash;
+ int i, len;
+ char *str;
+ const struct charset *charset = EXT4_SB(dentry->d_sb)->encoding;
+
+ len = charset_casefold(charset, q->name, q->len, &str);
+
+ hash = init_name_hash(dentry);
+ for (i = 0; i < len; i++)
+ hash = partial_name_hash(str[i], hash);
+ q->hash = end_name_hash(hash);
+
+ kfree(str);
+ return 0;
+}
+
+static int ext4_d_ci_compare(const struct dentry *dentry, unsigned int len,
+ const char *str, const struct qstr *name)
+{
+ const struct charset *charset = EXT4_SB(dentry->d_sb)->encoding;
+
+ return charset_strncasecmp(charset, str, name->name, len);
+}
+
+const struct dentry_operations ext4_ci_dentry_ops = {
+ .d_hash = ext4_d_ci_hash,
+ .d_compare = ext4_d_ci_compare,
+};
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 6c8aaf2dd322..980968aeb1cf 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2943,6 +2943,8 @@ static inline void ext4_unlock_group(struct super_block *sb,

/* dir.c */
extern const struct file_operations ext4_dir_operations;
+extern const struct dentry_operations ext4_dentry_ops;
+extern const struct dentry_operations ext4_ci_dentry_ops;

/* file.c */
extern const struct inode_operations ext4_file_inode_operations;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 022bf5f274b2..877acb938453 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4358,6 +4358,11 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
if (es->s_error_count)
mod_timer(&sbi->s_err_report, jiffies + 300*HZ); /* 5 minutes */

+ if (test_opt2(sb, CASE_INSENSITIVE))
+ sb->s_d_op = &ext4_ci_dentry_ops;
+ else
+ sb->s_d_op = &ext4_dentry_ops;
+
/* Enable message ratelimiting. Default is 10 messages per 5 secs. */
ratelimit_state_init(&sbi->s_err_ratelimit_state, 5 * HZ, 10);
ratelimit_state_init(&sbi->s_warning_ratelimit_state, 5 * HZ, 10);
--
2.15.1

Subject: RE: [PATCH RFC 07/13] charsets: utf8: Hook-up utf-8 code to charsets library

Hi Gabriel,

A couple of comments inline below.

Olaf Weber

> -----Original Message-----
> From: Gabriel Krisman Bertazi [mailto:[email protected]]
> Sent: Friday, January 12, 2018 08:12
> To: [email protected]; [email protected]; [email protected]; [email protected]
> Cc: [email protected]; [email protected];
> [email protected]; [email protected]; Gabriel
> Krisman Bertazi <[email protected]>
> Subject: [PATCH RFC 07/13] charsets: utf8: Hook-up utf-8 code to charsets
> library
>
> Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
> ---
> lib/charsets/Makefile | 2 +-
> lib/charsets/utf8_core.c | 178
> +++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 179 insertions(+), 1 deletion(-)
> create mode 100644 lib/charsets/utf8_core.c
>
> diff --git a/lib/charsets/Makefile b/lib/charsets/Makefile
> index 95389c4193b0..5e2fa7c20a47 100644
> --- a/lib/charsets/Makefile
> +++ b/lib/charsets/Makefile
> @@ -4,7 +4,7 @@ obj-$(CONFIG_CHARSETS) += charsets.o
>
> obj-$(CONFIG_CHARSETS) += ascii.o
>
> -utf8-y += utf8norm.o
> +utf8-y += utf8_core.o utf8norm.o
> obj-$(CONFIG_UTF8_NORMALIZATION) += utf8.o
>
> $(obj)/utf8norm.o: $(obj)/utf8data.h
> diff --git a/lib/charsets/utf8_core.c b/lib/charsets/utf8_core.c
> new file mode 100644
> index 000000000000..94427670e96e
> --- /dev/null
> +++ b/lib/charsets/utf8_core.c
> @@ -0,0 +1,178 @@
> +/*
> + * Copyright (c) 2017 Collabora Ltd.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it would be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + *
> + */
> +
> +#include <linux/charsets.h>
> +#include <linux/utf8norm.h>
> +#include <linux/slab.h>
> +#include <linux/parser.h>
> +#include <linux/string.h>
> +
> +static int utf8_strncmp(const struct charset *charset, const char *str1,
> + const char *str2, int len)
> +{
> + const struct utf8data *data = utf8nfkdi(charset->version);
> + struct utf8cursor cur1, cur2;
> + unsigned char c1, c2;
> + int r, i;
> +
> + r = utf8cursor(&cur1, data, str1);
> + if (r < 0)
> + return -EIO;
> + r = utf8cursor(&cur2, data, str2);
> + if (r < 0)
> + return -EIO;
> +
> + for (i = 0 ; i < len ; i++) {
> + c1 = utf8byte(&cur1);
> + c2 = utf8byte(&cur2);
> +
> + if (!c1 || !c2 || c1 != c2)
> + return 1;
> +
> + }
> +
> + return 0;
> +}

This function is broken, but the reasons why illustrate the traps and pitfalls of working
with utf8 code and limited length buffers.

As written, if str1 or str2 doesn't trip the len check, then utf8_strncmp() returns 1 (not equal).
It does this even if the strings are equal. The check in the loop would have to be something
like this instead:
if (c1 != c2)
return 1;
if (!c1) /* implies !c2 as well */
return 0;

But this is not the only problem. The 'len' limit applies to the input strings. So you need to tell
the utf8byte() routine that it applies. In other words, use utf8ncursor() which takes an additional
length parameter to set up the cursors.

With this change, utf8byte() will return 0 when it hits the end of the input string due to seeing a
null byte or having consumed all characters, provided that it is not in the middle of a utf8 sequence
or a an incomplete sequence of Unicode characters.

Finally, note that utf8byte() returns an int, not a char. It does this for the same reasons getc() does.

So utf8_strncmp() becomes something like the following. I'm using EINVAL instead of EIO, and note
that -EINVAL does not imply that str1 and str2 are not equal when compared as a sequence of bytes.

static int utf8_strncmp(const struct charset *charset,
const char *str1,
const char *str2,
int len)
{
const struct utf8data *data = utf8nfkdi(charset->version);
struct utf8cursor cur1;
struct utf8cursor cur2;
int c1;
int c2;

if (utf8ncursor(&cur1, data, str1, len) < 0)
return -EINVAL;
if (utf8ncursor(&cur2, data, str2, len) < 0)
return -EINVAL;

do {
c1 = utf8byte(&cur1);
c2 = utf8byte(&cur2);

if (c1 < 0 || c2 < 0)
return -EINVAL;
if (c1 != c2)
return 1;
} while (c1);

return 0;
}


> +
> +static int utf8_strncasecmp(const struct charset *charset, const char *str1,
> + const char *str2, int len)
> +{
> + const struct utf8data *data = utf8nfkdicf(charset->version);
> + struct utf8cursor cur1, cur2;
> + unsigned char c1, c2;
> + int r, i;
> +
> + r = utf8cursor(&cur1, data, str1);
> + if (r < 0)
> + return -EIO;
> +
> + r = utf8cursor(&cur2, data, str2);
> + if (r < 0)
> + return -EIO;
> +
> + for (i = 0 ; i < len ; i++) {
> + c1 = utf8byte(&cur1);
> + c2 = utf8byte(&cur2);
> +
> + if (!c1 || !c2 || c1 != c2)
> + return 1;
> + }
> +
> + return 0;
> +}

Same comments as above apply here.

> +
> +int utf8_casefold(const struct charset *charset, const char *str, int len,
> + char **folded_str)
> +{
> + const struct utf8data *data = utf8nfkdicf(charset->version);
> + struct utf8cursor cur;
> + int i;
> + char buffer[1024];
> +
> + if (utf8cursor(&cur, data, str))
> + return -EIO;
> +
> + for (i = 0; i < (1024-1); i++) {
> + buffer[i] = utf8byte(&cur);
> + if (!buffer[i])
> + break;
> + }
> + buffer[i] = '\0';
> + *folded_str = kstrdup(buffer, GFP_NOFS);
> + if (!*folded_str)
> + return -ENOMEM;
> +
> + return i;
> +}

I'm not sure a 1 buffer on the stack will be welcome. Maybe just use
utf8nlen() to get the target size and eat the cost of doing the normalization
twice. An advantage of using utf8nlen() is that it will validate the input string
as well.

Here too you should use utf8ncursor() to account for the len parameter.

int utf8_casefold(const struct charset *charset,
const char *str, int len,
char **folded_str)
{
const struct utf8data *data = utf8nfkdicf(charset->version);
struct utf8cursor cur;
char *s;
int c;
ssize_t size;

size = utf8nlen(data, str, len);
if (size < 0)
return -EINVAL;
s = kmalloc(size + 1, GFP_NOFS);
if (!s)
return -ENOMEM;
*folded_string = s;
/*
* utf8nlen() verified that str is well-formed, so
* utf8ncursor() and utf8byte() will not fail.
*/
utf8ncursor(&cur, data, str, len);
do {
c = utf8byte(&cur);
*s++ = c;
} while (c);

return size;
}

The do-while loop could be written as follows as well, but IIRC that style is discouraged these days.

while ((*s++ = utf8byte(&cur))
;


> +
> +int utf8_normalize(const struct charset *charset, const char *str, int len,
> + char **normalization)
> +{
> + const struct utf8data *data = utf8nfkdi(charset->version);
> + struct utf8cursor cur;
> + int i;
> + char buffer[1024];
> +
> + if (utf8cursor(&cur, data, str))
> + return -EIO;
> +
> + for (i = 0; i < (1024-1); i++) {
> + buffer[i] = utf8byte(&cur);
> + if (!buffer[i])
> + break;
> + }
> + buffer[i] = '\0';
> + *normalization = kstrdup(buffer, GFP_NOFS);
> + if (!*normalization)
> + return -ENOMEM;
> +
> + return i;
> +}

Similar here.

> +
> +static const struct charset_ops utf8_ops = {
> + .strncmp = utf8_strncmp,
> + .strncasecmp = utf8_strncasecmp,
> + .casefold = utf8_casefold,
> + .normalize = utf8_normalize,
> +};
> +
> +static struct charset *utf8_load_charset(void *pargs)
> +{
> + int maj, min, rev;
> + unsigned int age;
> + struct charset *charset;
> + substring_t *args = pargs;
> +
> + if (match_int(&args[0], &maj) || match_int(&args[1], &min) ||
> + match_int(&args[2], &rev))
> + return NULL;
> +
> + age = UNICODE_AGE(maj, min, rev);
> +
> + if (!utf8version_is_supported(age))
> + return NULL;

Maybe utf8version_is_supported() should be changed to take 'maj', 'min', 'rev' as separate parameters.

Olaf

> +
> + charset = kmalloc(sizeof(struct charset), GFP_KERNEL);
> + if (!charset)
> + return NULL;
> +
> + charset->info = NULL;
> + charset->version = age;
> + charset->ops = &utf8_ops;
> +
> + return charset;
> +}
> +
> +static struct charset_info utf8_info = {
> + .name = "utf8",
> + .match_token = "utf8-%d.%d.%d",
> + .load_charset = utf8_load_charset,
> +};
> +
> +static int __init init_utf8(void)
> +{
> + charset_register(&utf8_info);
> + return 0;
> +}
> +
> +static void __exit exit_utf8(void)
> +{
> +}
> +
> +module_init(init_utf8);
> +module_exit(exit_utf8);
> +MODULE_AUTHOR("Gabriel Krisman Bertazi");
> +MODULE_DESCRIPTION("UTF-8 charset operations for filesystems");
> +MODULE_LICENSE("GPL");
> +
> --
> 2.15.1

Subject: RE: [PATCH RFC 13/13] ext4: Implement ext4 dcache hooks for custom charsets

Hi Gabriel,

Short summary: you're calling functions that can return an error, but do not check for that.

Olaf

> -----Original Message-----
> From: Gabriel Krisman Bertazi [mailto:[email protected]]
> Sent: Friday, January 12, 2018 08:13
> To: [email protected]; [email protected]; [email protected]; [email protected]
> Cc: [email protected]; [email protected];
> [email protected]; [email protected]; Gabriel
> Krisman Bertazi <[email protected]>
> Subject: [PATCH RFC 13/13] ext4: Implement ext4 dcache hooks for custom
> charsets
>
> Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
> ---
> fs/ext4/dir.c | 63
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> fs/ext4/ext4.h | 2 ++
> fs/ext4/super.c | 5 +++++
> 3 files changed, 70 insertions(+)
>
> diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
> index d5babc9f222b..8b73a68930d2 100644
> --- a/fs/ext4/dir.c
> +++ b/fs/ext4/dir.c
> @@ -25,6 +25,7 @@
> #include <linux/fs.h>
> #include <linux/buffer_head.h>
> #include <linux/slab.h>
> +#include <linux/charsets.h>
> #include "ext4.h"
> #include "xattr.h"
>
> @@ -661,3 +662,65 @@ const struct file_operations ext4_dir_operations = {
> .open = ext4_dir_open,
> .release = ext4_release_dir,
> };
> +
> +static int ext4_d_hash(const struct dentry *dentry, struct qstr *q)
> +{
> + unsigned long hash;
> + int i, len;
> + char *str;
> + const struct charset *charset = EXT4_SB(dentry->d_sb)->encoding;
> +
> + len = charset_normalize(charset, q->name, q->len, &str);

What if charset_normalize() returns an error?


> +
> + hash = init_name_hash(dentry);
> + for (i = 0; i < len; i++)
> + hash = partial_name_hash(str[i], hash);
> + q->hash = end_name_hash(hash);
> +
> + kfree(str);
> + return 0;
> +}
> +
> +static int ext4_d_compare(const struct dentry *dentry, unsigned int len,
> + const char *str, const struct qstr *name)
> +{
> + const struct charset *charset = EXT4_SB(dentry->d_sb)->encoding;
> +
> + return charset_strncmp(charset, str, name->name, len);
> +}
> +
> +const struct dentry_operations ext4_dentry_ops = {
> + .d_hash = ext4_d_hash,
> + .d_compare = ext4_d_compare,
> +};
> +
> +static int ext4_d_ci_hash(const struct dentry *dentry, struct qstr *q)
> +{
> + unsigned long hash;
> + int i, len;
> + char *str;
> + const struct charset *charset = EXT4_SB(dentry->d_sb)->encoding;
> +
> + len = charset_casefold(charset, q->name, q->len, &str);

What if charset_casefold() returns an error?


> +
> + hash = init_name_hash(dentry);
> + for (i = 0; i < len; i++)
> + hash = partial_name_hash(str[i], hash);
> + q->hash = end_name_hash(hash);
> +
> + kfree(str);
> + return 0;
> +}
> +
> +static int ext4_d_ci_compare(const struct dentry *dentry, unsigned int len,
> + const char *str, const struct qstr *name)
> +{
> + const struct charset *charset = EXT4_SB(dentry->d_sb)->encoding;
> +
> + return charset_strncasecmp(charset, str, name->name, len);
> +}
> +
> +const struct dentry_operations ext4_ci_dentry_ops = {
> + .d_hash = ext4_d_ci_hash,
> + .d_compare = ext4_d_ci_compare,
> +};
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 6c8aaf2dd322..980968aeb1cf 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -2943,6 +2943,8 @@ static inline void ext4_unlock_group(struct
> super_block *sb,
>
> /* dir.c */
> extern const struct file_operations ext4_dir_operations;
> +extern const struct dentry_operations ext4_dentry_ops;
> +extern const struct dentry_operations ext4_ci_dentry_ops;
>
> /* file.c */
> extern const struct inode_operations ext4_file_inode_operations;
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 022bf5f274b2..877acb938453 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -4358,6 +4358,11 @@ static int ext4_fill_super(struct super_block *sb,
> void *data, int silent)
> if (es->s_error_count)
> mod_timer(&sbi->s_err_report, jiffies + 300*HZ); /* 5
> minutes */
>
> + if (test_opt2(sb, CASE_INSENSITIVE))
> + sb->s_d_op = &ext4_ci_dentry_ops;
> + else
> + sb->s_d_op = &ext4_dentry_ops;
> +
> /* Enable message ratelimiting. Default is 10 messages per 5 secs. */
> ratelimit_state_init(&sbi->s_err_ratelimit_state, 5 * HZ, 10);
> ratelimit_state_init(&sbi->s_warning_ratelimit_state, 5 * HZ, 10);
> --
> 2.15.1

2018-01-12 16:56:57

by Jeremy Allison

[permalink] [raw]
Subject: Re: [PATCH RFC 00/13] UTF-8 case insensitive lookups for EXT4

On Fri, Jan 12, 2018 at 05:12:21AM -0200, Gabriel Krisman Bertazi wrote:
> Hi,
>
> In the past few months, I've been working to support case-insensitive
> lookups of utf8 encoded strings, primarily for EXT4, and then for other
> filesystems. This RFC uses the awesome UTF8 normalization
> implementation done by the SGI guys in 2014, namely Olaf Weber and Ben
> Myers, but it, unfortunately, never went upstream. That SGI effort was
> made of 3 versions of an RFC submitted to this list, and the last
> version was archived below:

Just want to say thanks a *LOT* for this work. It would significantly
help Samba and Windows interop. Now if we can only get the RichACL
patch into the kernel.... :-).

Cheers,

Jeremy.

2018-01-12 17:04:40

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH RFC 03/13] charsets: utf8: Add unicode character database files

On Fri, Jan 12, 2018 at 05:12:24AM -0200, Gabriel Krisman Bertazi wrote:
> From: Olaf Weber <[email protected]>
>
> Add files from the Unicode Character Database, version 7.0.0, to the source.
> A helper program that generates a trie used for normalization from these
> files is part of a separate commit.
>
> Signed-off-by: Olaf Weber <[email protected]>
> Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
> [Move ucd directory to lib/charsets]
> ---
> lib/charsets/ucd/README | 33 +++++++++++++++++++++++++++++++++
> 1 file changed, 33 insertions(+)
> create mode 100644 lib/charsets/ucd/README
>
> diff --git a/lib/charsets/ucd/README b/lib/charsets/ucd/README
> new file mode 100644
> index 000000000000..d713e663cdf9
> --- /dev/null
> +++ b/lib/charsets/ucd/README
> @@ -0,0 +1,33 @@
> +The files in this directory are part of the Unicode Character Database
> +for version 7.0.0 of the Unicode standard.
> +
> +The full set of files can be found here:
> +
> + http://www.unicode.org/Public/7.0.0/ucd/
> +
> +The latest released version of the UCD can be found here:
> +
> + http://www.unicode.org/Public/UCD/latest/
> +
> +The files in this directory are identical, except that they have been
> +renamed with a suffix indicating the unicode version.
> +
> +Individual source links:
> +
> + http://www.unicode.org/Public/7.0.0/ucd/CaseFolding.txt
> + http://www.unicode.org/Public/7.0.0/ucd/DerivedAge.txt
> + http://www.unicode.org/Public/7.0.0/ucd/extracted/DerivedCombiningClass.txt
> + http://www.unicode.org/Public/7.0.0/ucd/DerivedCoreProperties.txt
> + http://www.unicode.org/Public/7.0.0/ucd/NormalizationCorrections.txt
> + http://www.unicode.org/Public/7.0.0/ucd/NormalizationTest.txt
> + http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt
> +
> +md5sums
> +
> + 9a92b2bfe56c6719def926bab524fefd CaseFolding-7.0.0.txt
> + 07b8b1027eb824cf0835314e94f23d2e DerivedAge-7.0.0.txt
> + 90c3340b16821e2f2153acdbe6fc6180 DerivedCombiningClass-7.0.0.txt
> + c41c0601f808116f623de47110ed4f93 DerivedCoreProperties-7.0.0.txt
> + 522720ddfc150d8e63a2518634829bce NormalizationCorrections-7.0.0.txt
> + 1f35175eba4a2ad795db489f789ae352 NormalizationTest-7.0.0.txt
> + c8355655731d75e6a3de8c20d7e601ba UnicodeData-7.0.0.txt

Uh... are these files supposed to be attached to this patch?

--D

> --
> 2.15.1
>

Subject: RE: [PATCH RFC 03/13] charsets: utf8: Add unicode character database files

> -----Original Message-----
> From: Darrick J. Wong [mailto:[email protected]]
> Sent: Friday, January 12, 2018 17:59
> To: Gabriel Krisman Bertazi <[email protected]>
> Cc: [email protected]; [email protected]; [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]
> Subject: Re: [PATCH RFC 03/13] charsets: utf8: Add unicode character
> database files
>
> On Fri, Jan 12, 2018 at 05:12:24AM -0200, Gabriel Krisman Bertazi wrote:
> > From: Olaf Weber <[email protected]>
> >
> > Add files from the Unicode Character Database, version 7.0.0, to the
> source.
> > A helper program that generates a trie used for normalization from
> > these files is part of a separate commit.
> >
> > Signed-off-by: Olaf Weber <[email protected]>
> > Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
> > [Move ucd directory to lib/charsets]
> > ---
> > lib/charsets/ucd/README | 33 +++++++++++++++++++++++++++++++++
> > 1 file changed, 33 insertions(+)
> > create mode 100644 lib/charsets/ucd/README
> >
> > diff --git a/lib/charsets/ucd/README b/lib/charsets/ucd/README new
> > file mode 100644 index 000000000000..d713e663cdf9
> > --- /dev/null
> > +++ b/lib/charsets/ucd/README
> > @@ -0,0 +1,33 @@
> > +The files in this directory are part of the Unicode Character
> > +Database for version 7.0.0 of the Unicode standard.
> > +
> > +The full set of files can be found here:
> > +
> > + http://www.unicode.org/Public/7.0.0/ucd/
> > +
> > +The latest released version of the UCD can be found here:
> > +
> > + http://www.unicode.org/Public/UCD/latest/
> > +
> > +The files in this directory are identical, except that they have been
> > +renamed with a suffix indicating the unicode version.
> > +
> > +Individual source links:
> > +
> > + http://www.unicode.org/Public/7.0.0/ucd/CaseFolding.txt
> > + http://www.unicode.org/Public/7.0.0/ucd/DerivedAge.txt
> > +
> > +
> http://www.unicode.org/Public/7.0.0/ucd/extracted/DerivedCombiningCl
> > + ass.txt
> > + http://www.unicode.org/Public/7.0.0/ucd/DerivedCoreProperties.txt
> > +
> > + http://www.unicode.org/Public/7.0.0/ucd/NormalizationCorrections.txt
> > + http://www.unicode.org/Public/7.0.0/ucd/NormalizationTest.txt
> > + http://www.unicode.org/Public/7.0.0/ucd/UnicodeData.txt
> > +
> > +md5sums
> > +
> > + 9a92b2bfe56c6719def926bab524fefd CaseFolding-7.0.0.txt
> > + 07b8b1027eb824cf0835314e94f23d2e DerivedAge-7.0.0.txt
> > + 90c3340b16821e2f2153acdbe6fc6180 DerivedCombiningClass-7.0.0.txt
> > + c41c0601f808116f623de47110ed4f93 DerivedCoreProperties-7.0.0.txt
> > + 522720ddfc150d8e63a2518634829bce NormalizationCorrections-7.0.0.txt
> > + 1f35175eba4a2ad795db489f789ae352 NormalizationTest-7.0.0.txt
> > + c8355655731d75e6a3de8c20d7e601ba UnicodeData-7.0.0.txt
>
> Uh... are these files supposed to be attached to this patch?

Actually, no, as was explained in the 1st message:

" Like the original submission from Ben, I excluded the commit that includes the
" generated header file and unicode files because they are too big and would
" bounce the list. Instead, instructions on fetching and generating the files are
" documented in the commit message.

One issue we (SGI) anticipated is that we were proposing the inclusion of a large binary blob into
the kernel. And people here do dislike opaque binary blobs. So instead we proposed including the
program that generated the blob in question plus the source files it uses. On the one hand, a
sizable increase of the kernel source tree, on the other hand, no argument about the provenance
of the blob as both source and generator are right there.

An alternative might be to include the generated blob itself but retain the instructions so people can
verify it, providing they cared to do so. If someone was really ambitious, they could even automate
grabbing the source files from unicode.org as part of a verification build. If they were even more
ambitious, they could add such a verification build as an option to the linux kernel build system. (In
other words, I am not the one who's going to implement this if it turns out that people on this list
believe this to be a good idea.)

Olaf

2018-01-13 00:24:07

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH RFC 03/13] charsets: utf8: Add unicode character database files

On Fri, Jan 12, 2018 at 05:12:24AM -0200, Gabriel Krisman Bertazi wrote:
> From: Olaf Weber <[email protected]>
>
> Add files from the Unicode Character Database, version 7.0.0, to the source.
> A helper program that generates a trie used for normalization from these
> files is part of a separate commit.

It looks like the latest version of Unicode is 10.0.0. Once we pick a
Unicode version, changing will be painful; but in the absence of
interop requirements, is there a reason to stick with Unicode 7? Why
not take the latest version of Unicode and then freeze on it?

- Ted

2018-01-13 04:28:12

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: Re: [PATCH RFC 03/13] charsets: utf8: Add unicode character database files

Theodore Ts'o <[email protected]> writes:

> On Fri, Jan 12, 2018 at 05:12:24AM -0200, Gabriel Krisman Bertazi wrote:
>> From: Olaf Weber <[email protected]>
>>
>> Add files from the Unicode Character Database, version 7.0.0, to the source.
>> A helper program that generates a trie used for normalization from these
>> files is part of a separate commit.
>
> It looks like the latest version of Unicode is 10.0.0. Once we pick a
> Unicode version, changing will be painful; but in the absence of
> interop requirements, is there a reason to stick with Unicode 7? Why
> not take the latest version of Unicode and then freeze on it?
>

Hi Ted,

No, there isn't a specific reason for unicode 7 and I forgot to mention
this in my cover letter. I have successfully generated the data file
for 10.0.0 with the mkutf8data script, but I couldn't validate it
entirely yet. I walked through changelogs to make sure any relevant
changes where there, but I'm not done yet. You can definitely expect
new versions of the patchset to support 10.0.0.

Thanks,

--
Gabriel Krisman Bertazi

2018-01-16 16:50:33

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: Re: [PATCH RFC 07/13] charsets: utf8: Hook-up utf-8 code to charsets library

"Weber, Olaf (HPC Data Management & Storage)" <[email protected]>
writes:

> But this is not the only problem. The 'len' limit applies to the input strings. So you need to tell
> the utf8byte() routine that it applies. In other words, use utf8ncursor() which takes an additional
> length parameter to set up the cursors.
>
> With this change, utf8byte() will return 0 when it hits the end of the input string due to seeing a
> null byte or having consumed all characters, provided that it is not in the middle of a utf8 sequence
> or a an incomplete sequence of Unicode characters.
>
> Finally, note that utf8byte() returns an int, not a char. It does this for the same reasons getc() does.
>
> So utf8_strncmp() becomes something like the following. I'm using EINVAL instead of EIO, and note
> that -EINVAL does not imply that str1 and str2 are not equal when compared as a sequence of bytes.
>
> static int utf8_strncmp(const struct charset *charset,
> const char *str1,
> const char *str2,
> int len)
> {
> const struct utf8data *data = utf8nfkdi(charset->version);
> struct utf8cursor cur1;
> struct utf8cursor cur2;
> int c1;
> int c2;
>
> if (utf8ncursor(&cur1, data, str1, len) < 0)
> return -EINVAL;
> if (utf8ncursor(&cur2, data, str2, len) < 0)
> return -EINVAL;
>
> do {
> c1 = utf8byte(&cur1);
> c2 = utf8byte(&cur2);
>
> if (c1 < 0 || c2 < 0)
> return -EINVAL;
> if (c1 != c2)
> return 1;
> } while (c1);
>
> return 0;
> }

Hi Olaf,

Thanks for your review and deep explanations.

I get your point and I've added a test case to trigger it in the
test_ucd module.

One question that I have, on the other hand: Take the version you
shared, I want to avoid the -EINVAL for the case when strings s1
and s2 should match as equal, but strlen(s1) < strlen (s2). In this
case:

strncmp (s1, s2, strlen (s2)) => Returns 0. Matches Ok
strncmp (s1, s2, strlen (s1)) => Returns -EINVAL

I know -EINVAL doesn't mean they don't match, but this case seems too
error prone.

I suppose we just could:

(1) let the caller deal with it, which is error prone. Or,

(2) Require two lens on strncmp, one for each string, Or,

(3) use utf8cursor for the second string, which plays bad with non-null
terminated strings, which is important for filesystems.

Do you see an alternative? I'm pending towards option 2. Are you ok
with that?

--
Gabriel Krisman Bertazi

Subject: RE: [PATCH RFC 07/13] charsets: utf8: Hook-up utf-8 code to charsets library

Hi Gabriel,

Comments inline

> -----Original Message-----
> From: Gabriel Krisman Bertazi [mailto:[email protected]]
> Sent: Tuesday, January 16, 2018 17:51
> To: Weber, Olaf (HPC Data Management & Storage) <[email protected]>
> Cc: [email protected]; [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]
> Subject: Re: [PATCH RFC 07/13] charsets: utf8: Hook-up utf-8 code to
> charsets library
>
> "Weber, Olaf (HPC Data Management & Storage)" <[email protected]> writes:
>
> > But this is not the only problem. The 'len' limit applies to the input strings. So you need to tell
> > the utf8byte() routine that it applies. In other words, use utf8ncursor() which takes an additional
> > length parameter to set up the cursors.
> >
> > With this change, utf8byte() will return 0 when it hits the end of the input string due to seeing a
> > null byte or having consumed all characters, provided that it is not in the middle of a utf8 sequence
> > or a an incomplete sequence of Unicode characters.
> >
> > Finally, note that utf8byte() returns an int, not a char. It does this for the same reasons getc() does.
> >
> > So utf8_strncmp() becomes something like the following. I'm using EINVAL instead of EIO, and note
> > that -EINVAL does not imply that str1 and str2 are not equal when compared as a sequence of bytes.
> >
> > static int utf8_strncmp(const struct charset *charset,
> > const char *str1,
> > const char *str2,
> > int len)
> > {
> > const struct utf8data *data = utf8nfkdi(charset->version);
> > struct utf8cursor cur1;
> > struct utf8cursor cur2;
> > int c1;
> > int c2;
> >
> > if (utf8ncursor(&cur1, data, str1, len) < 0)
> > return -EINVAL;
> > if (utf8ncursor(&cur2, data, str2, len) < 0)
> > return -EINVAL;
> >
> > do {
> > c1 = utf8byte(&cur1);
> > c2 = utf8byte(&cur2);
> >
> > if (c1 < 0 || c2 < 0)
> > return -EINVAL;
> > if (c1 != c2)
> > return 1;
> > } while (c1);
> >
> > return 0;
> > }
>
> Hi Olaf,
>
> Thanks for your review and deep explanations.
>
> I get your point and I've added a test case to trigger it in the
> test_ucd module.
>
> One question that I have, on the other hand: Take the version you
> shared, I want to avoid the -EINVAL for the case when strings s1
> and s2 should match as equal, but strlen(s1) < strlen (s2). In this
> case:
>
> strncmp (s1, s2, strlen (s2)) => Returns 0. Matches Ok
> strncmp (s1, s2, strlen (s1)) => Returns -EINVAL
>
> I know -EINVAL doesn't mean they don't match, but this case seems too
> error prone.

If I understand your question correctly, the case of interest is

strncmp(s1, s2, len), where len <= strlen(s1) and len <= strlen(s2)

As far as I can tell the code I sketched above handles that case in the
way you expect/want, when taking the complications introduced by
Unicode into account. Using utf8ncursor() ensures we do get an -EINVAL
if, and only if, we read beyond the end (len) of the source string as part of
the normalization process. But if we are at an acceptable boundary in the
source string when we see the end of the string, utf8byte() returns 0,
indicating a normal/non-error end of the scan.

I think it may be worth to write some tests to (hopefully) confirm that
the code really does what I intended it to do. The most likely case to
fail would be where you hit the len-imposed end after a codepoint
with CCC != 0.

> I suppose we just could:
>
> (1) let the caller deal with it, which is error prone. Or,

The caller does have to do something when it gets -EINVAL. You have to
define the desired semantics of that case.

In the original XFS-filesystem code my choice was to treat invalid UTF-8
sequences as binary blobs for the sake of comparisons.

> (2) Require two lens on strncmp, one for each string, Or,

As a general rule this is certainly correct: each string has its
own associated maximum length within which it should have
been null-terminated. So whether you need one len per
string depends on the sources of the strings. In the original
XFS-based code there are some tricks related to this.

> (3) use utf8cursor for the second string, which plays bad with non-null
> terminated strings, which is important for filesystems.

I agree, if the string is not null-terminated, then utf8ncursor() is the only
Viable interface.

> Do you see an alternative? I'm pending towards option 2. Are you ok
> with that?

I'd say the proposed XFS code did a variant of your option 2. It also used
the available interfaces in a way that attempted to avoid memory
allocation unless absolutely necessary.

At the time I worked under the assumptions that both allocating
memory and normalizing a string were expensive, but also that I
could not permanently or even semi-permanently store normalized
forms of directory entries.

So the XFS code as written made a copy of the normalized form of the entry
being worked on, but gets the normalized bytes of the on-disk directory
entries on the fly.

This also drove the design of utf8(n)cursor()/utf8byte(): I wanted to
avoid having to do memory management as much as possible, and also
needed to work with a limited (and fixed-size) stack footprint. Basically
these interfaces were written the way they are to make that (micro-)
optimization possible. Within those constraints I tried to make them
easy to use. I may not have succeeded.

Olaf

2018-01-23 03:33:06

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: Re: [PATCH RFC 07/13] charsets: utf8: Hook-up utf-8 code to charsets library

"Weber, Olaf (HPC Data Management & Storage)" <[email protected]>
writes:


>> One question that I have, on the other hand: Take the version you
>> shared, I want to avoid the -EINVAL for the case when strings s1
>> and s2 should match as equal, but strlen(s1) < strlen (s2). In this
>> case:
>>
>> strncmp (s1, s2, strlen (s2)) => Returns 0. Matches Ok
>> strncmp (s1, s2, strlen (s1)) => Returns -EINVAL
>>
>> I know -EINVAL doesn't mean they don't match, but this case seems too
>> error prone.
>
> If I understand your question correctly, the case of interest is
>
> strncmp(s1, s2, len), where len <= strlen(s1) and len <= strlen(s2)
>
> As far as I can tell the code I sketched above handles that case in the
> way you expect/want, when taking the complications introduced by
> Unicode into account. Using utf8ncursor() ensures we do get an -EINVAL
> if, and only if, we read beyond the end (len) of the source string as part of
> the normalization process. But if we are at an acceptable boundary in the
> source string when we see the end of the string, utf8byte() returns 0,
> indicating a normal/non-error end of the scan.

Hey Olaf,

Sorry for the delay.

It is not quite that scenario. The version that requires only 1 length
fails when utf8ncursor receives a len that is smaller than one of the
strings, which is a common case when something decomposes to a larger
string:

Take this case, for instance:

s1 = {0xc2, 0xbc, 0x00}, /* 'VULGAR FRACTION ONE QUARTER' decomposes to */
s2 = {0x31, 0xe2, 0x81, 0x84, 0x34, 0x00}, /* 'NUMBER 1' + 'FRACTION SLASH' + 'NUMBER 4' */

If we do strncmp(s1, s2, strlen(s2)), it works fine. But if we use
strlen(s1) on the third parameter, it fails. As far as I understand,
the issue happens because utf8lookup will read to the maximum of len
characters, aborting the lookup in the middle of a sequence. Since we
don't hit a leaf for that code-point, it assumes an invalid sequence and
utf8byte aborts.

The easiest way to solve it is by receiving the two lens in strncmp.

> I think it may be worth to write some tests to (hopefully) confirm that
> the code really does what I intended it to do. The most likely case to
> fail would be where you hit the len-imposed end after a codepoint
> with CCC != 0.

The only test I had for this scenario happened to have strlen(s1) ==
strlen(s2). I added the following, which I think catches this scenario:

/* 'LATIN SMALL LETTER A WITH DIAERESIS' + 'COMBINING OGONEK' decomposes to */
/* 'LETTER A' + 'COMBINING OGONEK' + 'COMBINING DIAERESIS' */
s1 = {0xc3, 0xa4, 0xCC, 0xA8, 0x00},
s2 = {0x61, 0xCC, 0xA8, 0xcc, 0x88, 0x00},

I tested your version, and it works correctly for this scenario too, as
long as we set the len parameter to use the largest string, s2, instead
of s1.

>> I suppose we just could:
>>
>> (1) let the caller deal with it, which is error prone. Or,
>
> The caller does have to do something when it gets -EINVAL. You have to
> define the desired semantics of that case.
>
> In the original XFS-filesystem code my choice was to treat invalid UTF-8
> sequences as binary blobs for the sake of comparisons.
>
>> (2) Require two lens on strncmp, one for each string, Or,
>
> As a general rule this is certainly correct: each string has its
> own associated maximum length within which it should have
> been null-terminated. So whether you need one len per
> string depends on the sources of the strings. In the original
> XFS-based code there are some tricks related to this.

I've applied this solution and it is solving every test case correctly,
including those I mentioned above. Since it looks like the best
approach, I applied the other things you commented, and modified the
comparison functions code to receive 2 lens. I should submit a v2
shortly, once I'm done with dealing with some changes to the fs part.

Thanks!

--
Gabriel Krisman Bertazi