2018-01-25 02:54:34

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH RFC v2 00/13] NLS/UTF-8 Case-Insensitive lookups for ext4 and VFS proposal

Hi,

Along with the patch series, I am very interested in getting feedback on
the two items below, regarding VFS and NLS changes.

This is a v2 of the unicode + ext4 case-insensitive support which
extends support to Unicode 10.0.0, and applies the fixes suggested by
Olaf in the previous iteration. For the same reason as mentioned
before, the ucd files are not included in the RFC, but the relevant
patch file explains how to fetch them.

If you'd rather pull everything in this RFC at once, including the UCD
files, you can clone from:

https://gitlab.collabora.com/krisman/linux.git -b charset-lib

The original cover letter, with explanations on some of the design
decisions made in this RFC, is documented in the archive below:

https://www.spinics.net/lists/linux-ext4/msg59457.html

In addition to this RFC, I am making two new proposals (no code in this
RFC) for VFS and NLS, which I would like to hear feedback from you
before turning this from an RFC into a final patch submission:

(1) integrate the charset lib into the NLS system.

Basically, this requires introducing new higher-level hooks for string
comparison, like the ones we have in the charset patch, into the NLS
subsystem.

NLS also has to support versions of the same encoding, my idea is to
separate the information to register the encoding with the NLS system
into a separate structure, which is restricted to the NLS system. The
nls_table or a similar structure, which is then passed to users of the
library, will then be specific to a given version of the charset and
carry pointers to the functions specific to that version.

One final important point for NLS is that we need to prevent users from
mounting CI filesystems with encodings that don't support
normalization/comparison functions and try not the break compatibility
of filesystems that already do toupper/tolower without normalization.
These points are important to keep in mind but are quite trivial to
implement.

The second proposal is related to the VFS layer:

(2) Enable Insensitive lookup support on a per-mountpoint basis,
via a MS_CASEFOLD flag, with the ultimate goal of supporting a
case-insensitive bind mount of a subtree, side-by-side with a
sensitive version of the filesystem.

I have a prototype code at

https://gitlab.collabora.com/krisman/linux.git -b vfs-ms_casefold

Which is *not fully functional*, since it confuses the dentry cache when
multiple mountpoints are installed, but it gives an idea of the design,
if anyone wants to review it. Basically, I want to:

- Add a new MS_CASEFOLD mount option, which flips a flag in struct
vfsmount

- When this flag is enabled, a LOOKUP_CASEFOLD flag is submitted to
the fs .lookup() hook, asking it to perform a case-folded lookup.

- LOOKUP_CASEFOLD also replaces .d_hash() and d_compare() with
insensitive versions, provided by filesystems.

- Allow "mount -o remount,bind" to flip the MNT_CASEFOLD flag, similar
to what is done with the read-only setting.

- filesystems that support the MS_CASEFOLD flag need to advertise
support in struct file_system_type. There will be no generic
implementation of casefolding in the VFS layer for now. Either the
FS acknowledges support for it, or MS_CASEFOLD fails the mount
operation.

This is implemented in the branch above (along with the required
modifications for EXT4) except for the issue in the dentry cache, that I
am still working on.

Do these changes to VFS seem acceptable?

Thanks,

Gabriel Krisman Bertazi (9):
charsets: Introduce middle-layer for character encoding
charsets: ascii: Wrap ascii functions to charsets library
charsets: utf8: Hook-up utf-8 code to charsets library
charsets: utf8: Introduce test module for kernel UTF-8 implementation
ext4: Add ignorecase mount option
ext4: Include encoding information on the superblock
fscrypt: Introduce charset-based matching functions
ext4: Support charset name matching
ext4: Implement ext4 dcache hooks for custom charsets

Olaf Weber (4):
charsets: utf8: Add unicode character database files
scripts: add trie generator for UTF-8
charsets: utf8: Introduce code for UTF-8 normalization
charsets: utf8: reduce the size of utf8data[]

fs/ext4/dir.c | 63 +
fs/ext4/ext4.h | 6 +
fs/ext4/namei.c | 27 +-
fs/ext4/super.c | 35 +
include/linux/charsets.h | 73 +
include/linux/fscrypt.h | 1 +
include/linux/fscrypt_notsupp.h | 16 +
include/linux/fscrypt_supp.h | 27 +
include/linux/utf8norm.h | 116 ++
lib/Kconfig | 16 +
lib/Makefile | 2 +
lib/charsets/Makefile | 24 +
lib/charsets/ascii.c | 98 ++
lib/charsets/core.c | 68 +
lib/charsets/test_ucd.c | 186 +++
lib/charsets/ucd/README | 33 +
lib/charsets/utf8_core.c | 178 ++
lib/charsets/utf8norm.c | 794 +++++++++
scripts/Makefile | 1 +
scripts/mkutf8data.c | 3464 +++++++++++++++++++++++++++++++++++++++
20 files changed, 5219 insertions(+), 9 deletions(-)
create mode 100644 include/linux/charsets.h
create mode 100644 include/linux/utf8norm.h
create mode 100644 lib/charsets/Makefile
create mode 100644 lib/charsets/ascii.c
create mode 100644 lib/charsets/core.c
create mode 100644 lib/charsets/test_ucd.c
create mode 100644 lib/charsets/ucd/README
create mode 100644 lib/charsets/utf8_core.c
create mode 100644 lib/charsets/utf8norm.c
create mode 100644 scripts/mkutf8data.c

--
2.15.1


2018-01-25 02:53:37

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH RFC v2 01/13] charsets: Introduce middle-layer for character encoding

This implements an abstraction for high-level encoding-wise string
manipulation functions. It defines some hooks that encoding modules must
implement, which will be used by filesystem code to support lookups that
consider normalization and case-folding.

Changes since RFC v1:
- Export charset_load symbol.
- Include length parameter for second string on comparison functions.
- Changed length type to size_t.
- Fix bad memory access when trying to load invalid charset

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
include/linux/charsets.h | 75 ++++++++++++++++++++++++++++++++++++++++++++++++
lib/Kconfig | 2 ++
lib/Makefile | 2 ++
lib/charsets/Makefile | 3 ++
lib/charsets/core.c | 69 ++++++++++++++++++++++++++++++++++++++++++++
5 files changed, 151 insertions(+)
create mode 100644 include/linux/charsets.h
create mode 100644 lib/charsets/Makefile
create mode 100644 lib/charsets/core.c

diff --git a/include/linux/charsets.h b/include/linux/charsets.h
new file mode 100644
index 000000000000..3abe92cc0bc6
--- /dev/null
+++ b/include/linux/charsets.h
@@ -0,0 +1,75 @@
+/*
+ * Copyright (c) 2017 Collabora Ltd.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#ifndef _CHARSET_H
+#define _CHARSET_H
+
+#include <linux/types.h>
+
+struct charset_info;
+struct charset;
+
+struct charset_ops {
+ int (*strncmp)(const struct charset *charset, const char *str1,
+ size_t len1, const char *str2, size_t len2);
+ int (*strncasecmp)(const struct charset *charset, const char *str1,
+ size_t len1, const char *str2, size_t len2);
+ int (*casefold)(const struct charset *charset, const char *str,
+ int len, char **folded);
+ int (*normalize)(const struct charset *charset, const char *str,
+ int len, char **normalization);
+};
+
+struct charset {
+ const struct charset_info *info;
+ unsigned int version;
+ const struct charset_ops *ops;
+};
+
+struct charset_info {
+ char *name;
+ char *match_token;
+ struct charset* (*load_charset)(void *args);
+};
+
+static inline int charset_strncmp(const struct charset *charset,
+ const char *str1, size_t len1,
+ const char *str2, size_t len2)
+{
+ return charset->ops->strncmp(charset, str1, len1, str2, len2);
+}
+
+static inline int charset_strncasecmp(const struct charset *charset,
+ const char *str1, size_t len1,
+ const char *str2, size_t len2)
+{
+ return charset->ops->strncasecmp(charset, str1, len1, str2, len2);
+}
+
+static inline int charset_casefold(const struct charset *charset,
+ const char *str, int len, char **folded)
+{
+ return charset->ops->casefold(charset, str, len, folded);
+}
+
+static inline int charset_normalize(const struct charset *charset,
+ const char *str, int len,
+ char **normalization)
+{
+ return charset->ops->normalize(charset, str, len, normalization);
+}
+
+int charset_register(struct charset_info *charset);
+const struct charset *charset_load(char *charset);
+#endif
diff --git a/lib/Kconfig b/lib/Kconfig
index c5e84fbcb30b..bf5c751cfb8a 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -582,6 +582,8 @@ config PRIME_NUMBERS
config STRING_SELFTEST
tristate "Test string functions"

+config CHARSETS
+ tristate "Character encoding sets"
endmenu

config GENERIC_ASHLDI3
diff --git a/lib/Makefile b/lib/Makefile
index d11c48ec8ffd..f6b2360fedfa 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -258,3 +258,5 @@ obj-$(CONFIG_GENERIC_LSHRDI3) += lshrdi3.o
obj-$(CONFIG_GENERIC_MULDI3) += muldi3.o
obj-$(CONFIG_GENERIC_CMPDI2) += cmpdi2.o
obj-$(CONFIG_GENERIC_UCMPDI2) += ucmpdi2.o
+
+obj-$(CONFIG_CHARSETS) += charsets/
diff --git a/lib/charsets/Makefile b/lib/charsets/Makefile
new file mode 100644
index 000000000000..01ff9fd09f98
--- /dev/null
+++ b/lib/charsets/Makefile
@@ -0,0 +1,3 @@
+charsets-y += core.o
+
+obj-$(CONFIG_CHARSETS) += charsets.o
diff --git a/lib/charsets/core.c b/lib/charsets/core.c
new file mode 100644
index 000000000000..238088cbb641
--- /dev/null
+++ b/lib/charsets/core.c
@@ -0,0 +1,69 @@
+/*
+ * Copyright (c) 2017 Collabora Ltd.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#include <linux/module.h>
+#include <linux/string.h>
+#include <linux/charsets.h>
+#include <linux/parser.h>
+
+#define MAX_ENCODINGS 10
+
+static struct match_token encoding_tokens[MAX_ENCODINGS + 1];
+static struct charset_info *charsets[MAX_ENCODINGS];
+static int n_encodings;
+
+const struct charset *charset_load(char *charset)
+{
+ substring_t args[MAX_OPT_ARGS];
+ int token;
+
+ args[0].to = args[0].from = NULL;
+ token = match_token(charset, encoding_tokens, args);
+
+ if (!encoding_tokens[token].pattern)
+ return NULL;
+
+ return charsets[token]->load_charset(args);
+}
+EXPORT_SYMBOL(charset_load);
+
+int charset_register(struct charset_info *charset)
+{
+ encoding_tokens[n_encodings].token = n_encodings;
+ encoding_tokens[n_encodings].pattern = charset->match_token;
+
+ charsets[n_encodings] = charset;
+ n_encodings += 1;
+ return 0;
+}
+EXPORT_SYMBOL(charset_register);
+
+static int __init init_charset(void)
+{
+ memset(encoding_tokens, 0, sizeof(encoding_tokens));
+ n_encodings = 0;
+
+ return 0;
+}
+
+static void __exit exit_charset(void)
+{
+}
+
+module_init(init_charset);
+module_exit(exit_charset);
+
+MODULE_AUTHOR("Gabriel Krisman Bertazi");
+MODULE_DESCRIPTION("charset abstraction for filesystems");
+MODULE_LICENSE("GPL");
--
2.15.1

2018-01-25 02:53:39

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH RFC v2 03/13] charsets: utf8: Add unicode character database files

From: Olaf Weber <[email protected]>

Add files from the Unicode Character Database, version 10.0.0, to the source.
A helper program that generates a trie used for normalization from these
files is part of a separate commit.

- Notes on the update from 8.0.0 and 10.0.0:

The structure of ucd files and special cases have not experienced any
changes between versions 8.0.0 and 10.0.0. 8.0.0 saw the addition of
Cherokee LC characters, which is an interesting case for case-folding.
The update is accompanied by new tests on the test_ucd module to catch
specific cases. No changes to mkutf8data script was required for the
update.

Signed-off-by: Olaf Weber <[email protected]>
Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
[Move ucd directory to lib/charsets]
[Update to ucd-10.0.0]
---
lib/charsets/ucd/README | 33 +++++++++++++++++++++++++++++++++
1 file changed, 33 insertions(+)
create mode 100644 lib/charsets/ucd/README

diff --git a/lib/charsets/ucd/README b/lib/charsets/ucd/README
new file mode 100644
index 000000000000..67f2075d1fca
--- /dev/null
+++ b/lib/charsets/ucd/README
@@ -0,0 +1,33 @@
+The files in this directory are part of the Unicode Character Database
+for version 10.0.0 of the Unicode standard.
+
+The full set of files can be found here:
+
+ http://www.unicode.org/Public/10.0.0/ucd/
+
+The latest released version of the UCD can be found here:
+
+ http://www.unicode.org/Public/UCD/latest/
+
+The files in this directory are identical, except that they have been
+renamed with a suffix indicating the unicode version.
+
+Individual source links:
+
+ http://www.unicode.org/Public/10.0.0/ucd/CaseFolding.txt
+ http://www.unicode.org/Public/10.0.0/ucd/DerivedAge.txt
+ http://www.unicode.org/Public/10.0.0/ucd/extracted/DerivedCombiningClass.txt
+ http://www.unicode.org/Public/10.0.0/ucd/DerivedCoreProperties.txt
+ http://www.unicode.org/Public/10.0.0/ucd/NormalizationCorrections.txt
+ http://www.unicode.org/Public/10.0.0/ucd/NormalizationTest.txt
+ http://www.unicode.org/Public/10.0.0/ucd/UnicodeData.txt
+
+md5sums
+
+ 7893b6e005c5a521319a0d12062ae122 CaseFolding-10.0.0.txt
+ a602e4b44de3350087e40f2eb2184898 DerivedAge-10.0.0.txt
+ 5abdeb21af4edcc5d1e4c0b5802fc7a7 DerivedCombiningClass-10.0.0.txt
+ eda11c2c2e3c308d9d3b90e2b3282024 DerivedCoreProperties-10.0.0.txt
+ 425ece5ffbecd0140d98c13ce05724aa NormalizationCorrections-10.0.0.txt
+ 7296fe7aa07d7d288e65d559af2ad49b NormalizationTest-10.0.0.txt
+ 2a52f30695dcc821f0f224650552beaf UnicodeData-10.0.0.txt
--
2.15.1

2018-01-25 02:54:55

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH RFC v2 04/13] scripts: add trie generator for UTF-8

From: Olaf Weber <[email protected]>

mkutf8data.c is the source for a program that generates utf8data.h, which
contains the trie that utf8norm.c uses. The trie is generated from the
Unicode 10.0.0 data files. The format of the utf8data[] table is described
in utf8norm.c, which is added in the next patch.

Signed-off-by: Olaf Weber <[email protected]>
Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
[Rebased to mainline]
[Fix out-of-tree-build]
[Fix checkpatch warnings]
[Merge back robustness fixes from original patch. Requested by Dave Chinner]
[Update makefile to build 10.0.0 ucd files]
---
lib/Kconfig | 9 +
lib/charsets/Makefile | 14 +
scripts/Makefile | 1 +
scripts/mkutf8data.c | 3239 +++++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 3263 insertions(+)
create mode 100644 scripts/mkutf8data.c

diff --git a/lib/Kconfig b/lib/Kconfig
index bf5c751cfb8a..9ca3bda7b2be 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -584,6 +584,15 @@ config STRING_SELFTEST

config CHARSETS
tristate "Character encoding sets"
+
+#
+# utf8 normalization module
+#
+config UTF8_NORMALIZATION
+ tristate "UTF-8 normalization support"
+ depends on CHARSETS
+ help
+ Say Y here to enable utf8 normalization support.
endmenu

config GENERIC_ASHLDI3
diff --git a/lib/charsets/Makefile b/lib/charsets/Makefile
index 2184e7ff25de..8ab7931969ca 100644
--- a/lib/charsets/Makefile
+++ b/lib/charsets/Makefile
@@ -3,3 +3,17 @@ charsets-y += core.o
obj-$(CONFIG_CHARSETS) += charsets.o

obj-$(CONFIG_CHARSETS) += ascii.o
+
+$(obj)/utf8data.h: $(srctree)/$(src)/ucd/*.txt $(objtree)/scripts/mkutf8data FORCE
+ $(call cmd,mkutf8data)
+quiet_cmd_mkutf8data = MKUTF8DATA $@
+ cmd_mkutf8data = $(objtree)/scripts/mkutf8data \
+ -a $(srctree)/$(src)/ucd/DerivedAge-10.0.0.txt \
+ -c $(srctree)/$(src)/ucd/DerivedCombiningClass-10.0.0.txt \
+ -p $(srctree)/$(src)/ucd/DerivedCoreProperties-10.0.0.txt \
+ -d $(srctree)/$(src)/ucd/UnicodeData-10.0.0.txt \
+ -f $(srctree)/$(src)/ucd/CaseFolding-10.0.0.txt \
+ -n $(srctree)/$(src)/ucd/NormalizationCorrections-10.0.0.txt \
+ -t $(srctree)/$(src)/ucd/NormalizationTest-10.0.0.txt \
+ -o $@
+
diff --git a/scripts/Makefile b/scripts/Makefile
index 25ab143cbe14..f681601aa15c 100644
--- a/scripts/Makefile
+++ b/scripts/Makefile
@@ -19,6 +19,7 @@ hostprogs-$(CONFIG_ASN1) += asn1_compiler
hostprogs-$(CONFIG_MODULE_SIG) += sign-file
hostprogs-$(CONFIG_SYSTEM_TRUSTED_KEYRING) += extract-cert
hostprogs-$(CONFIG_SYSTEM_EXTRA_CERTIFICATE) += insert-sys-cert
+hostprogs-$(CONFIG_UTF8_NORMALIZATION) += mkutf8data

HOSTCFLAGS_sortextable.o = -I$(srctree)/tools/include
HOSTCFLAGS_asn1_compiler.o = -I$(srctree)/include
diff --git a/scripts/mkutf8data.c b/scripts/mkutf8data.c
new file mode 100644
index 000000000000..700b41c0cb66
--- /dev/null
+++ b/scripts/mkutf8data.c
@@ -0,0 +1,3239 @@
+/*
+ * Copyright (c) 2014 SGI.
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+/* Generator for a compact trie for unicode normalization */
+
+#include <sys/types.h>
+#include <stddef.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <assert.h>
+#include <string.h>
+#include <unistd.h>
+#include <errno.h>
+
+/* Default names of the in- and output files. */
+
+#define AGE_NAME "DerivedAge.txt"
+#define CCC_NAME "DerivedCombiningClass.txt"
+#define PROP_NAME "DerivedCoreProperties.txt"
+#define DATA_NAME "UnicodeData.txt"
+#define FOLD_NAME "CaseFolding.txt"
+#define NORM_NAME "NormalizationCorrections.txt"
+#define TEST_NAME "NormalizationTest.txt"
+#define UTF8_NAME "utf8data.h"
+
+const char *age_name = AGE_NAME;
+const char *ccc_name = CCC_NAME;
+const char *prop_name = PROP_NAME;
+const char *data_name = DATA_NAME;
+const char *fold_name = FOLD_NAME;
+const char *norm_name = NORM_NAME;
+const char *test_name = TEST_NAME;
+const char *utf8_name = UTF8_NAME;
+
+int verbose = 0;
+
+/* An arbitrary line size limit on input lines. */
+
+#define LINESIZE 1024
+char line[LINESIZE];
+char buf0[LINESIZE];
+char buf1[LINESIZE];
+char buf2[LINESIZE];
+char buf3[LINESIZE];
+
+const char *argv0;
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * Unicode version numbers consist of three parts: major, minor, and a
+ * revision. These numbers are packed into an unsigned int to obtain
+ * a single version number.
+ *
+ * To save space in the generated trie, the unicode version is not
+ * stored directly, instead we calculate a generation number from the
+ * unicode versions seen in the DerivedAge file, and use that as an
+ * index into a table of unicode versions.
+ */
+#define UNICODE_MAJ_SHIFT (16)
+#define UNICODE_MIN_SHIFT (8)
+
+#define UNICODE_MAJ_MAX ((unsigned short)-1)
+#define UNICODE_MIN_MAX ((unsigned char)-1)
+#define UNICODE_REV_MAX ((unsigned char)-1)
+
+#define UNICODE_AGE(MAJ,MIN,REV) \
+ (((unsigned int)(MAJ) << UNICODE_MAJ_SHIFT) | \
+ ((unsigned int)(MIN) << UNICODE_MIN_SHIFT) | \
+ ((unsigned int)(REV)))
+
+unsigned int *ages;
+int ages_count;
+
+unsigned int unicode_maxage;
+
+static int
+age_valid(unsigned int major, unsigned int minor, unsigned int revision)
+{
+ if (major > UNICODE_MAJ_MAX)
+ return 0;
+ if (minor > UNICODE_MIN_MAX)
+ return 0;
+ if (revision > UNICODE_REV_MAX)
+ return 0;
+ return 1;
+}
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * utf8trie_t
+ *
+ * A compact binary tree, used to decode UTF-8 characters.
+ *
+ * Internal nodes are one byte for the node itself, and up to three
+ * bytes for an offset into the tree. The first byte contains the
+ * following information:
+ * NEXTBYTE - flag - advance to next byte if set
+ * BITNUM - 3 bit field - the bit number to tested
+ * OFFLEN - 2 bit field - number of bytes in the offset
+ * if offlen == 0 (non-branching node)
+ * RIGHTPATH - 1 bit field - set if the following node is for the
+ * right-hand path (tested bit is set)
+ * TRIENODE - 1 bit field - set if the following node is an internal
+ * node, otherwise it is a leaf node
+ * if offlen != 0 (branching node)
+ * LEFTNODE - 1 bit field - set if the left-hand node is internal
+ * RIGHTNODE - 1 bit field - set if the right-hand node is internal
+ *
+ * Due to the way utf8 works, there cannot be branching nodes with
+ * NEXTBYTE set, and moreover those nodes always have a righthand
+ * descendant.
+ */
+typedef unsigned char utf8trie_t;
+#define BITNUM 0x07
+#define NEXTBYTE 0x08
+#define OFFLEN 0x30
+#define OFFLEN_SHIFT 4
+#define RIGHTPATH 0x40
+#define TRIENODE 0x80
+#define RIGHTNODE 0x40
+#define LEFTNODE 0x80
+
+/*
+ * utf8leaf_t
+ *
+ * The leaves of the trie are embedded in the trie, and so the same
+ * underlying datatype, unsigned char.
+ *
+ * leaf[0]: The unicode version, stored as a generation number that is
+ * an index into utf8agetab[]. With this we can filter code
+ * points based on the unicode version in which they were
+ * defined. The CCC of a non-defined code point is 0.
+ * leaf[1]: Canonical Combining Class. During normalization, we need
+ * to do a stable sort into ascending order of all characters
+ * with a non-zero CCC that occur between two characters with
+ * a CCC of 0, or at the begin or end of a string.
+ * The unicode standard guarantees that all CCC values are
+ * between 0 and 254 inclusive, which leaves 255 available as
+ * a special value.
+ * Code points with CCC 0 are known as stoppers.
+ * leaf[2]: Decomposition. If leaf[1] == 255, then leaf[2] is the
+ * start of a NUL-terminated string that is the decomposition
+ * of the character.
+ * The CCC of a decomposable character is the same as the CCC
+ * of the first character of its decomposition.
+ * Some characters decompose as the empty string: these are
+ * characters with the Default_Ignorable_Code_Point property.
+ * These do affect normalization, as they all have CCC 0.
+ *
+ * The decompositions in the trie have been fully expanded.
+ *
+ * Casefolding, if applicable, is also done using decompositions.
+ */
+typedef unsigned char utf8leaf_t;
+
+#define LEAF_GEN(LEAF) ((LEAF)[0])
+#define LEAF_CCC(LEAF) ((LEAF)[1])
+#define LEAF_STR(LEAF) ((const char*)((LEAF) + 2))
+
+#define MAXGEN (255)
+
+#define MINCCC (0)
+#define MAXCCC (254)
+#define STOPPER (0)
+#define DECOMPOSE (255)
+
+struct tree;
+static utf8leaf_t *utf8nlookup(struct tree *, const char *, size_t);
+static utf8leaf_t *utf8lookup(struct tree *, const char *);
+
+unsigned char *utf8data;
+size_t utf8data_size;
+
+utf8trie_t *nfkdi;
+utf8trie_t *nfkdicf;
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * UTF8 valid ranges.
+ *
+ * The UTF-8 encoding spreads the bits of a 32bit word over several
+ * bytes. This table gives the ranges that can be held and how they'd
+ * be represented.
+ *
+ * 0x00000000 0x0000007F: 0xxxxxxx
+ * 0x00000000 0x000007FF: 110xxxxx 10xxxxxx
+ * 0x00000000 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ *
+ * There is an additional requirement on UTF-8, in that only the
+ * shortest representation of a 32bit value is to be used. A decoder
+ * must not decode sequences that do not satisfy this requirement.
+ * Thus the allowed ranges have a lower bound.
+ *
+ * 0x00000000 0x0000007F: 0xxxxxxx
+ * 0x00000080 0x000007FF: 110xxxxx 10xxxxxx
+ * 0x00000800 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
+ * 0x00010000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00200000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x04000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ *
+ * Actual unicode characters are limited to the range 0x0 - 0x10FFFF,
+ * 17 planes of 65536 values. This limits the sequences actually seen
+ * even more, to just the following.
+ *
+ * 0 - 0x7f: 0 0x7f
+ * 0x80 - 0x7ff: 0xc2 0x80 0xdf 0xbf
+ * 0x800 - 0xffff: 0xe0 0xa0 0x80 0xef 0xbf 0xbf
+ * 0x10000 - 0x10ffff: 0xf0 0x90 0x80 0x80 0xf4 0x8f 0xbf 0xbf
+ *
+ * Even within those ranges not all values are allowed: the surrogates
+ * 0xd800 - 0xdfff should never be seen.
+ *
+ * Note that the longest sequence seen with valid usage is 4 bytes,
+ * the same a single UTF-32 character. This makes the UTF-8
+ * representation of Unicode strictly smaller than UTF-32.
+ *
+ * The shortest sequence requirement was introduced by:
+ * Corrigendum #1: UTF-8 Shortest Form
+ * It can be found here:
+ * http://www.unicode.org/versions/corrigendum1.html
+ *
+ */
+
+#define UTF8_2_BITS 0xC0
+#define UTF8_3_BITS 0xE0
+#define UTF8_4_BITS 0xF0
+#define UTF8_N_BITS 0x80
+#define UTF8_2_MASK 0xE0
+#define UTF8_3_MASK 0xF0
+#define UTF8_4_MASK 0xF8
+#define UTF8_N_MASK 0xC0
+#define UTF8_V_MASK 0x3F
+#define UTF8_V_SHIFT 6
+
+static int
+utf8encode(char *str, unsigned int val)
+{
+ int len;
+
+ if (val < 0x80) {
+ str[0] = val;
+ len = 1;
+ } else if (val < 0x800) {
+ str[1] = val & UTF8_V_MASK;
+ str[1] |= UTF8_N_BITS;
+ val >>= UTF8_V_SHIFT;
+ str[0] = val;
+ str[0] |= UTF8_2_BITS;
+ len = 2;
+ } else if (val < 0x10000) {
+ str[2] = val & UTF8_V_MASK;
+ str[2] |= UTF8_N_BITS;
+ val >>= UTF8_V_SHIFT;
+ str[1] = val & UTF8_V_MASK;
+ str[1] |= UTF8_N_BITS;
+ val >>= UTF8_V_SHIFT;
+ str[0] = val;
+ str[0] |= UTF8_3_BITS;
+ len = 3;
+ } else if (val < 0x110000) {
+ str[3] = val & UTF8_V_MASK;
+ str[3] |= UTF8_N_BITS;
+ val >>= UTF8_V_SHIFT;
+ str[2] = val & UTF8_V_MASK;
+ str[2] |= UTF8_N_BITS;
+ val >>= UTF8_V_SHIFT;
+ str[1] = val & UTF8_V_MASK;
+ str[1] |= UTF8_N_BITS;
+ val >>= UTF8_V_SHIFT;
+ str[0] = val;
+ str[0] |= UTF8_4_BITS;
+ len = 4;
+ } else {
+ printf("%#x: illegal val\n", val);
+ len = 0;
+ }
+ return len;
+}
+
+static unsigned int
+utf8decode(const char *str)
+{
+ const unsigned char *s = (const unsigned char*)str;
+ unsigned int unichar = 0;
+
+ if (*s < 0x80) {
+ unichar = *s;
+ } else if (*s < UTF8_3_BITS) {
+ unichar = *s++ & 0x1F;
+ unichar <<= UTF8_V_SHIFT;
+ unichar |= *s & 0x3F;
+ } else if (*s < UTF8_4_BITS) {
+ unichar = *s++ & 0x0F;
+ unichar <<= UTF8_V_SHIFT;
+ unichar |= *s++ & 0x3F;
+ unichar <<= UTF8_V_SHIFT;
+ unichar |= *s & 0x3F;
+ } else {
+ unichar = *s++ & 0x0F;
+ unichar <<= UTF8_V_SHIFT;
+ unichar |= *s++ & 0x3F;
+ unichar <<= UTF8_V_SHIFT;
+ unichar |= *s++ & 0x3F;
+ unichar <<= UTF8_V_SHIFT;
+ unichar |= *s & 0x3F;
+ }
+ return unichar;
+}
+
+static int
+utf32valid(unsigned int unichar)
+{
+ return unichar < 0x110000;
+}
+
+#define NODE 1
+#define LEAF 0
+
+struct tree {
+ void *root;
+ int childnode;
+ const char *type;
+ unsigned int maxage;
+ struct tree *next;
+ int (*leaf_equal)(void *, void *);
+ void (*leaf_print)(void *, int);
+ int (*leaf_mark)(void *);
+ int (*leaf_size)(void *);
+ int *(*leaf_index)(struct tree *, void *);
+ unsigned char *(*leaf_emit)(void *, unsigned char *);
+ int leafindex[0x110000];
+ int index;
+};
+
+struct node {
+ int index;
+ int offset;
+ int mark;
+ int size;
+ struct node *parent;
+ void *left;
+ void *right;
+ unsigned char bitnum;
+ unsigned char nextbyte;
+ unsigned char leftnode;
+ unsigned char rightnode;
+ unsigned int keybits;
+ unsigned int keymask;
+};
+
+/*
+ * Example lookup function for a tree.
+ */
+static void *
+lookup(struct tree *tree, const char *key)
+{
+ struct node *node;
+ void *leaf = NULL;
+
+ node = tree->root;
+ while (!leaf && node) {
+ if (node->nextbyte)
+ key++;
+ if (*key & (1 << (node->bitnum & 7))) {
+ /* Right leg */
+ if (node->rightnode == NODE) {
+ node = node->right;
+ } else if (node->rightnode == LEAF) {
+ leaf = node->right;
+ } else {
+ node = NULL;
+ }
+ } else {
+ /* Left leg */
+ if (node->leftnode == NODE) {
+ node = node->left;
+ } else if (node->leftnode == LEAF) {
+ leaf = node->left;
+ } else {
+ node = NULL;
+ }
+ }
+ }
+
+ return leaf;
+}
+
+/*
+ * A simple non-recursive tree walker: keep track of visits to the
+ * left and right branches in the leftmask and rightmask.
+ */
+static void
+tree_walk(struct tree *tree)
+{
+ struct node *node;
+ unsigned int leftmask;
+ unsigned int rightmask;
+ unsigned int bitmask;
+ int indent = 1;
+ int nodes, singletons, leaves;
+
+ nodes = singletons = leaves = 0;
+
+ printf("%s_%x root %p\n", tree->type, tree->maxage, tree->root);
+ if (tree->childnode == LEAF) {
+ assert(tree->root);
+ tree->leaf_print(tree->root, indent);
+ leaves = 1;
+ } else {
+ assert(tree->childnode == NODE);
+ node = tree->root;
+ leftmask = rightmask = 0;
+ while (node) {
+ printf("%*snode @ %p bitnum %d nextbyte %d"
+ " left %p right %p mask %x bits %x\n",
+ indent, "", node,
+ node->bitnum, node->nextbyte,
+ node->left, node->right,
+ node->keymask, node->keybits);
+ nodes += 1;
+ if (!(node->left && node->right))
+ singletons += 1;
+
+ while (node) {
+ bitmask = 1 << node->bitnum;
+ if ((leftmask & bitmask) == 0) {
+ leftmask |= bitmask;
+ if (node->leftnode == LEAF) {
+ assert(node->left);
+ tree->leaf_print(node->left,
+ indent+1);
+ leaves += 1;
+ } else if (node->left) {
+ assert(node->leftnode == NODE);
+ indent += 1;
+ node = node->left;
+ break;
+ }
+ }
+ if ((rightmask & bitmask) == 0) {
+ rightmask |= bitmask;
+ if (node->rightnode == LEAF) {
+ assert(node->right);
+ tree->leaf_print(node->right,
+ indent+1);
+ leaves += 1;
+ } else if (node->right) {
+ assert(node->rightnode==NODE);
+ indent += 1;
+ node = node->right;
+ break;
+ }
+ }
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ node = node->parent;
+ indent -= 1;
+ }
+ }
+ }
+ printf("nodes %d leaves %d singletons %d\n",
+ nodes, leaves, singletons);
+}
+
+/*
+ * Allocate an initialize a new internal node.
+ */
+static struct node *
+alloc_node(struct node *parent)
+{
+ struct node *node;
+ int bitnum;
+
+ node = malloc(sizeof(*node));
+ node->left = node->right = NULL;
+ node->parent = parent;
+ node->leftnode = NODE;
+ node->rightnode = NODE;
+ node->keybits = 0;
+ node->keymask = 0;
+ node->mark = 0;
+ node->index = 0;
+ node->offset = -1;
+ node->size = 4;
+
+ if (node->parent) {
+ bitnum = parent->bitnum;
+ if ((bitnum & 7) == 0) {
+ node->bitnum = bitnum + 7 + 8;
+ node->nextbyte = 1;
+ } else {
+ node->bitnum = bitnum - 1;
+ node->nextbyte = 0;
+ }
+ } else {
+ node->bitnum = 7;
+ node->nextbyte = 0;
+ }
+
+ return node;
+}
+
+/*
+ * Insert a new leaf into the tree, and collapse any subtrees that are
+ * fully populated and end in identical leaves. A nextbyte tagged
+ * internal node will not be removed to preserve the tree's integrity.
+ * Note that due to the structure of utf8, no nextbyte tagged node
+ * will be a candidate for removal.
+ */
+static int
+insert(struct tree *tree, char *key, int keylen, void *leaf)
+{
+ struct node *node;
+ struct node *parent;
+ void **cursor;
+ int keybits;
+
+ assert(keylen >= 1 && keylen <= 4);
+
+ node = NULL;
+ cursor = &tree->root;
+ keybits = 8 * keylen;
+
+ /* Insert, creating path along the way. */
+ while (keybits) {
+ if (!*cursor)
+ *cursor = alloc_node(node);
+ node = *cursor;
+ if (node->nextbyte)
+ key++;
+ if (*key & (1 << (node->bitnum & 7)))
+ cursor = &node->right;
+ else
+ cursor = &node->left;
+ keybits--;
+ }
+ *cursor = leaf;
+
+ /* Merge subtrees if possible. */
+ while (node) {
+ if (*key & (1 << (node->bitnum & 7)))
+ node->rightnode = LEAF;
+ else
+ node->leftnode = LEAF;
+ if (node->nextbyte)
+ break;
+ if (node->leftnode == NODE || node->rightnode == NODE)
+ break;
+ assert(node->left);
+ assert(node->right);
+ /* Compare */
+ if (! tree->leaf_equal(node->left, node->right))
+ break;
+ /* Keep left, drop right leaf. */
+ leaf = node->left;
+ /* Check in parent */
+ parent = node->parent;
+ if (!parent) {
+ /* root of tree! */
+ tree->root = leaf;
+ tree->childnode = LEAF;
+ } else if (parent->left == node) {
+ parent->left = leaf;
+ parent->leftnode = LEAF;
+ if (parent->right) {
+ parent->keymask = 0;
+ parent->keybits = 0;
+ } else {
+ parent->keymask |= (1 << node->bitnum);
+ }
+ } else if (parent->right == node) {
+ parent->right = leaf;
+ parent->rightnode = LEAF;
+ if (parent->left) {
+ parent->keymask = 0;
+ parent->keybits = 0;
+ } else {
+ parent->keymask |= (1 << node->bitnum);
+ parent->keybits |= (1 << node->bitnum);
+ }
+ } else {
+ /* internal tree error */
+ assert(0);
+ }
+ free(node);
+ node = parent;
+ }
+
+ /* Propagate keymasks up along singleton chains. */
+ while (node) {
+ parent = node->parent;
+ if (!parent)
+ break;
+ /* Nix the mask for parents with two children. */
+ if (node->keymask == 0) {
+ parent->keymask = 0;
+ parent->keybits = 0;
+ } else if (parent->left && parent->right) {
+ parent->keymask = 0;
+ parent->keybits = 0;
+ } else {
+ assert((parent->keymask & node->keymask) == 0);
+ parent->keymask |= node->keymask;
+ parent->keymask |= (1 << parent->bitnum);
+ parent->keybits |= node->keybits;
+ if (parent->right)
+ parent->keybits |= (1 << parent->bitnum);
+ }
+ node = parent;
+ }
+
+ return 0;
+}
+
+/*
+ * Prune internal nodes.
+ *
+ * Fully populated subtrees that end at the same leaf have already
+ * been collapsed. There are still internal nodes that have for both
+ * their left and right branches a sequence of singletons that make
+ * identical choices and end in identical leaves. The keymask and
+ * keybits collected in the nodes describe the choices made in these
+ * singleton chains. When they are identical for the left and right
+ * branch of a node, and the two leaves comare identical, the node in
+ * question can be removed.
+ *
+ * Note that nodes with the nextbyte tag set will not be removed by
+ * this to ensure tree integrity. Note as well that the structure of
+ * utf8 ensures that these nodes would not have been candidates for
+ * removal in any case.
+ */
+static void
+prune(struct tree *tree)
+{
+ struct node *node;
+ struct node *left;
+ struct node *right;
+ struct node *parent;
+ void *leftleaf;
+ void *rightleaf;
+ unsigned int leftmask;
+ unsigned int rightmask;
+ unsigned int bitmask;
+ int count;
+
+ if (verbose > 0)
+ printf("Pruning %s_%x\n", tree->type, tree->maxage);
+
+ count = 0;
+ if (tree->childnode == LEAF)
+ return;
+ if (!tree->root)
+ return;
+
+ leftmask = rightmask = 0;
+ node = tree->root;
+ while (node) {
+ if (node->nextbyte)
+ goto advance;
+ if (node->leftnode == LEAF)
+ goto advance;
+ if (node->rightnode == LEAF)
+ goto advance;
+ if (!node->left)
+ goto advance;
+ if (!node->right)
+ goto advance;
+ left = node->left;
+ right = node->right;
+ if (left->keymask == 0)
+ goto advance;
+ if (right->keymask == 0)
+ goto advance;
+ if (left->keymask != right->keymask)
+ goto advance;
+ if (left->keybits != right->keybits)
+ goto advance;
+ leftleaf = NULL;
+ while (!leftleaf) {
+ assert(left->left || left->right);
+ if (left->leftnode == LEAF)
+ leftleaf = left->left;
+ else if (left->rightnode == LEAF)
+ leftleaf = left->right;
+ else if (left->left)
+ left = left->left;
+ else if (left->right)
+ left = left->right;
+ else
+ assert(0);
+ }
+ rightleaf = NULL;
+ while (!rightleaf) {
+ assert(right->left || right->right);
+ if (right->leftnode == LEAF)
+ rightleaf = right->left;
+ else if (right->rightnode == LEAF)
+ rightleaf = right->right;
+ else if (right->left)
+ right = right->left;
+ else if (right->right)
+ right = right->right;
+ else
+ assert(0);
+ }
+ if (! tree->leaf_equal(leftleaf, rightleaf))
+ goto advance;
+ /*
+ * This node has identical singleton-only subtrees.
+ * Remove it.
+ */
+ parent = node->parent;
+ left = node->left;
+ right = node->right;
+ if (parent->left == node)
+ parent->left = left;
+ else if (parent->right == node)
+ parent->right = left;
+ else
+ assert(0);
+ left->parent = parent;
+ left->keymask |= (1 << node->bitnum);
+ node->left = NULL;
+ while (node) {
+ bitmask = 1 << node->bitnum;
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ if (node->leftnode == NODE && node->left) {
+ left = node->left;
+ free(node);
+ count++;
+ node = left;
+ } else if (node->rightnode == NODE && node->right) {
+ right = node->right;
+ free(node);
+ count++;
+ node = right;
+ } else {
+ node = NULL;
+ }
+ }
+ /* Propagate keymasks up along singleton chains. */
+ node = parent;
+ /* Force re-check */
+ bitmask = 1 << node->bitnum;
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ for (;;) {
+ if (node->left && node->right)
+ break;
+ if (node->left) {
+ left = node->left;
+ node->keymask |= left->keymask;
+ node->keybits |= left->keybits;
+ }
+ if (node->right) {
+ right = node->right;
+ node->keymask |= right->keymask;
+ node->keybits |= right->keybits;
+ }
+ node->keymask |= (1 << node->bitnum);
+ node = node->parent;
+ /* Force re-check */
+ bitmask = 1 << node->bitnum;
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ }
+ advance:
+ bitmask = 1 << node->bitnum;
+ if ((leftmask & bitmask) == 0 &&
+ node->leftnode == NODE &&
+ node->left) {
+ leftmask |= bitmask;
+ node = node->left;
+ } else if ((rightmask & bitmask) == 0 &&
+ node->rightnode == NODE &&
+ node->right) {
+ rightmask |= bitmask;
+ node = node->right;
+ } else {
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ node = node->parent;
+ }
+ }
+ if (verbose > 0)
+ printf("Pruned %d nodes\n", count);
+}
+
+/*
+ * Mark the nodes in the tree that lead to leaves that must be
+ * emitted.
+ */
+static void
+mark_nodes(struct tree *tree)
+{
+ struct node *node;
+ struct node *n;
+ unsigned int leftmask;
+ unsigned int rightmask;
+ unsigned int bitmask;
+ int marked;
+
+ marked = 0;
+ if (verbose > 0)
+ printf("Marking %s_%x\n", tree->type, tree->maxage);
+ if (tree->childnode == LEAF)
+ goto done;
+
+ assert(tree->childnode == NODE);
+ node = tree->root;
+ leftmask = rightmask = 0;
+ while (node) {
+ bitmask = 1 << node->bitnum;
+ if ((leftmask & bitmask) == 0) {
+ leftmask |= bitmask;
+ if (node->leftnode == LEAF) {
+ assert(node->left);
+ if (tree->leaf_mark(node->left)) {
+ n = node;
+ while (n && !n->mark) {
+ marked++;
+ n->mark = 1;
+ n = n->parent;
+ }
+ }
+ } else if (node->left) {
+ assert(node->leftnode == NODE);
+ node = node->left;
+ continue;
+ }
+ }
+ if ((rightmask & bitmask) == 0) {
+ rightmask |= bitmask;
+ if (node->rightnode == LEAF) {
+ assert(node->right);
+ if (tree->leaf_mark(node->right)) {
+ n = node;
+ while (n && !n->mark) {
+ marked++;
+ n->mark = 1;
+ n = n->parent;
+ }
+ }
+ } else if (node->right) {
+ assert(node->rightnode==NODE);
+ node = node->right;
+ continue;
+ }
+ }
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ node = node->parent;
+ }
+
+ /* second pass: left siblings and singletons */
+
+ assert(tree->childnode == NODE);
+ node = tree->root;
+ leftmask = rightmask = 0;
+ while (node) {
+ bitmask = 1 << node->bitnum;
+ if ((leftmask & bitmask) == 0) {
+ leftmask |= bitmask;
+ if (node->leftnode == LEAF) {
+ assert(node->left);
+ if (tree->leaf_mark(node->left)) {
+ n = node;
+ while (n && !n->mark) {
+ marked++;
+ n->mark = 1;
+ n = n->parent;
+ }
+ }
+ } else if (node->left) {
+ assert(node->leftnode == NODE);
+ node = node->left;
+ if (!node->mark && node->parent->mark) {
+ marked++;
+ node->mark = 1;
+ }
+ continue;
+ }
+ }
+ if ((rightmask & bitmask) == 0) {
+ rightmask |= bitmask;
+ if (node->rightnode == LEAF) {
+ assert(node->right);
+ if (tree->leaf_mark(node->right)) {
+ n = node;
+ while (n && !n->mark) {
+ marked++;
+ n->mark = 1;
+ n = n->parent;
+ }
+ }
+ } else if (node->right) {
+ assert(node->rightnode==NODE);
+ node = node->right;
+ if (!node->mark && node->parent->mark &&
+ !node->parent->left) {
+ marked++;
+ node->mark = 1;
+ }
+ continue;
+ }
+ }
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ node = node->parent;
+ }
+done:
+ if (verbose > 0)
+ printf("Marked %d nodes\n", marked);
+}
+
+/*
+ * Compute the index of each node and leaf, which is the offset in the
+ * emitted trie. These values must be pre-computed because relative
+ * offsets between nodes are used to navigate the tree.
+ */
+static int
+index_nodes(struct tree *tree, int index)
+{
+ struct node *node;
+ unsigned int leftmask;
+ unsigned int rightmask;
+ unsigned int bitmask;
+ int count;
+ int indent;
+
+ /* Align to a cache line (or half a cache line?). */
+ while (index % 64)
+ index++;
+ tree->index = index;
+ indent = 1;
+ count = 0;
+
+ if (verbose > 0)
+ printf("Indexing %s_%x: %d\n", tree->type, tree->maxage, index);
+ if (tree->childnode == LEAF) {
+ index += tree->leaf_size(tree->root);
+ goto done;
+ }
+
+ assert(tree->childnode == NODE);
+ node = tree->root;
+ leftmask = rightmask = 0;
+ while (node) {
+ if (!node->mark)
+ goto skip;
+ count++;
+ if (node->index != index)
+ node->index = index;
+ index += node->size;
+skip:
+ while (node) {
+ bitmask = 1 << node->bitnum;
+ if (node->mark && (leftmask & bitmask) == 0) {
+ leftmask |= bitmask;
+ if (node->leftnode == LEAF) {
+ assert(node->left);
+ *tree->leaf_index(tree, node->left) =
+ index;
+ index += tree->leaf_size(node->left);
+ count++;
+ } else if (node->left) {
+ assert(node->leftnode == NODE);
+ indent += 1;
+ node = node->left;
+ break;
+ }
+ }
+ if (node->mark && (rightmask & bitmask) == 0) {
+ rightmask |= bitmask;
+ if (node->rightnode == LEAF) {
+ assert(node->right);
+ *tree->leaf_index(tree, node->right) = index;
+ index += tree->leaf_size(node->right);
+ count++;
+ } else if (node->right) {
+ assert(node->rightnode==NODE);
+ indent += 1;
+ node = node->right;
+ break;
+ }
+ }
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ node = node->parent;
+ indent -= 1;
+ }
+ }
+done:
+ /* Round up to a multiple of 16 */
+ while (index % 16)
+ index++;
+ if (verbose > 0)
+ printf("Final index %d\n", index);
+ return index;
+}
+
+/*
+ * Compute the size of nodes and leaves. We start by assuming that
+ * each node needs to store a three-byte offset. The indexes of the
+ * nodes are calculated based on that, and then this function is
+ * called to see if the sizes of some nodes can be reduced. This is
+ * repeated until no more changes are seen.
+ */
+static int
+size_nodes(struct tree *tree)
+{
+ struct tree *next;
+ struct node *node;
+ struct node *right;
+ struct node *n;
+ unsigned int leftmask;
+ unsigned int rightmask;
+ unsigned int bitmask;
+ unsigned int pathbits;
+ unsigned int pathmask;
+ int changed;
+ int offset;
+ int size;
+ int indent;
+
+ indent = 1;
+ changed = 0;
+ size = 0;
+
+ if (verbose > 0)
+ printf("Sizing %s_%x\n", tree->type, tree->maxage);
+ if (tree->childnode == LEAF)
+ goto done;
+
+ assert(tree->childnode == NODE);
+ pathbits = 0;
+ pathmask = 0;
+ node = tree->root;
+ leftmask = rightmask = 0;
+ while (node) {
+ if (!node->mark)
+ goto skip;
+ offset = 0;
+ if (!node->left || !node->right) {
+ size = 1;
+ } else {
+ if (node->rightnode == NODE) {
+ right = node->right;
+ next = tree->next;
+ while (!right->mark) {
+ assert(next);
+ n = next->root;
+ while (n->bitnum != node->bitnum) {
+ if (pathbits & (1<<n->bitnum))
+ n = n->right;
+ else
+ n = n->left;
+ }
+ n = n->right;
+ assert(right->bitnum == n->bitnum);
+ right = n;
+ next = next->next;
+ }
+ offset = right->index - node->index;
+ } else {
+ offset = *tree->leaf_index(tree, node->right);
+ offset -= node->index;
+ }
+ assert(offset >= 0);
+ assert(offset <= 0xffffff);
+ if (offset <= 0xff) {
+ size = 2;
+ } else if (offset <= 0xffff) {
+ size = 3;
+ } else { /* offset <= 0xffffff */
+ size = 4;
+ }
+ }
+ if (node->size != size || node->offset != offset) {
+ node->size = size;
+ node->offset = offset;
+ changed++;
+ }
+skip:
+ while (node) {
+ bitmask = 1 << node->bitnum;
+ pathmask |= bitmask;
+ if (node->mark && (leftmask & bitmask) == 0) {
+ leftmask |= bitmask;
+ if (node->leftnode == LEAF) {
+ assert(node->left);
+ } else if (node->left) {
+ assert(node->leftnode == NODE);
+ indent += 1;
+ node = node->left;
+ break;
+ }
+ }
+ if (node->mark && (rightmask & bitmask) == 0) {
+ rightmask |= bitmask;
+ pathbits |= bitmask;
+ if (node->rightnode == LEAF) {
+ assert(node->right);
+ } else if (node->right) {
+ assert(node->rightnode==NODE);
+ indent += 1;
+ node = node->right;
+ break;
+ }
+ }
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ pathmask &= ~bitmask;
+ pathbits &= ~bitmask;
+ node = node->parent;
+ indent -= 1;
+ }
+ }
+done:
+ if (verbose > 0)
+ printf("Found %d changes\n", changed);
+ return changed;
+}
+
+/*
+ * Emit a trie for the given tree into the data array.
+ */
+static void
+emit(struct tree *tree, unsigned char *data)
+{
+ struct node *node;
+ unsigned int leftmask;
+ unsigned int rightmask;
+ unsigned int bitmask;
+ int offlen;
+ int offset;
+ int index;
+ int indent;
+ unsigned char byte;
+
+ index = tree->index;
+ data += index;
+ indent = 1;
+ if (verbose > 0)
+ printf("Emitting %s_%x\n", tree->type, tree->maxage);
+ if (tree->childnode == LEAF) {
+ assert(tree->root);
+ tree->leaf_emit(tree->root, data);
+ return;
+ }
+
+ assert(tree->childnode == NODE);
+ node = tree->root;
+ leftmask = rightmask = 0;
+ while (node) {
+ if (!node->mark)
+ goto skip;
+ assert(node->offset != -1);
+ assert(node->index == index);
+
+ byte = 0;
+ if (node->nextbyte)
+ byte |= NEXTBYTE;
+ byte |= (node->bitnum & BITNUM);
+ if (node->left && node->right) {
+ if (node->leftnode == NODE)
+ byte |= LEFTNODE;
+ if (node->rightnode == NODE)
+ byte |= RIGHTNODE;
+ if (node->offset <= 0xff)
+ offlen = 1;
+ else if (node->offset <= 0xffff)
+ offlen = 2;
+ else
+ offlen = 3;
+ offset = node->offset;
+ byte |= offlen << OFFLEN_SHIFT;
+ *data++ = byte;
+ index++;
+ while (offlen--) {
+ *data++ = offset & 0xff;
+ index++;
+ offset >>= 8;
+ }
+ } else if (node->left) {
+ if (node->leftnode == NODE)
+ byte |= TRIENODE;
+ *data++ = byte;
+ index++;
+ } else if (node->right) {
+ byte |= RIGHTNODE;
+ if (node->rightnode == NODE)
+ byte |= TRIENODE;
+ *data++ = byte;
+ index++;
+ } else {
+ assert(0);
+ }
+skip:
+ while (node) {
+ bitmask = 1 << node->bitnum;
+ if (node->mark && (leftmask & bitmask) == 0) {
+ leftmask |= bitmask;
+ if (node->leftnode == LEAF) {
+ assert(node->left);
+ data = tree->leaf_emit(node->left,
+ data);
+ index += tree->leaf_size(node->left);
+ } else if (node->left) {
+ assert(node->leftnode == NODE);
+ indent += 1;
+ node = node->left;
+ break;
+ }
+ }
+ if (node->mark && (rightmask & bitmask) == 0) {
+ rightmask |= bitmask;
+ if (node->rightnode == LEAF) {
+ assert(node->right);
+ data = tree->leaf_emit(node->right,
+ data);
+ index += tree->leaf_size(node->right);
+ } else if (node->right) {
+ assert(node->rightnode==NODE);
+ indent += 1;
+ node = node->right;
+ break;
+ }
+ }
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ node = node->parent;
+ indent -= 1;
+ }
+ }
+}
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * Unicode data.
+ *
+ * We need to keep track of the Canonical Combining Class, the Age,
+ * and decompositions for a code point.
+ *
+ * For the Age, we store the index into the ages table. Effectively
+ * this is a generation number that the table maps to a unicode
+ * version.
+ *
+ * The correction field is used to indicate that this entry is in the
+ * corrections array, which contains decompositions that were
+ * corrected in later revisions. The value of the correction field is
+ * the Unicode version in which the mapping was corrected.
+ */
+struct unicode_data {
+ unsigned int code;
+ int ccc;
+ int gen;
+ int correction;
+ unsigned int *utf32nfkdi;
+ unsigned int *utf32nfkdicf;
+ char *utf8nfkdi;
+ char *utf8nfkdicf;
+};
+
+struct unicode_data unicode_data[0x110000];
+struct unicode_data *corrections;
+int corrections_count;
+
+struct tree *nfkdi_tree;
+struct tree *nfkdicf_tree;
+
+struct tree *trees;
+int trees_count;
+
+/*
+ * Check the corrections array to see if this entry was corrected at
+ * some point.
+ */
+static struct unicode_data *
+corrections_lookup(struct unicode_data *u)
+{
+ int i;
+
+ for (i = 0; i != corrections_count; i++)
+ if (u->code == corrections[i].code)
+ return &corrections[i];
+ return u;
+}
+
+static int
+nfkdi_equal(void *l, void *r)
+{
+ struct unicode_data *left = l;
+ struct unicode_data *right = r;
+
+ if (left->gen != right->gen)
+ return 0;
+ if (left->ccc != right->ccc)
+ return 0;
+ if (left->utf8nfkdi && right->utf8nfkdi &&
+ strcmp(left->utf8nfkdi, right->utf8nfkdi) == 0)
+ return 1;
+ if (left->utf8nfkdi || right->utf8nfkdi)
+ return 0;
+ return 1;
+}
+
+static int
+nfkdicf_equal(void *l, void *r)
+{
+ struct unicode_data *left = l;
+ struct unicode_data *right = r;
+
+ if (left->gen != right->gen)
+ return 0;
+ if (left->ccc != right->ccc)
+ return 0;
+ if (left->utf8nfkdicf && right->utf8nfkdicf &&
+ strcmp(left->utf8nfkdicf, right->utf8nfkdicf) == 0)
+ return 1;
+ if (left->utf8nfkdicf && right->utf8nfkdicf)
+ return 0;
+ if (left->utf8nfkdicf || right->utf8nfkdicf)
+ return 0;
+ if (left->utf8nfkdi && right->utf8nfkdi &&
+ strcmp(left->utf8nfkdi, right->utf8nfkdi) == 0)
+ return 1;
+ if (left->utf8nfkdi || right->utf8nfkdi)
+ return 0;
+ return 1;
+}
+
+static void
+nfkdi_print(void *l, int indent)
+{
+ struct unicode_data *leaf = l;
+
+ printf("%*sleaf @ %p code %X ccc %d gen %d", indent, "", leaf,
+ leaf->code, leaf->ccc, leaf->gen);
+ if (leaf->utf8nfkdi)
+ printf(" nfkdi \"%s\"", (const char*)leaf->utf8nfkdi);
+ printf("\n");
+}
+
+static void
+nfkdicf_print(void *l, int indent)
+{
+ struct unicode_data *leaf = l;
+
+ printf("%*sleaf @ %p code %X ccc %d gen %d", indent, "", leaf,
+ leaf->code, leaf->ccc, leaf->gen);
+ if (leaf->utf8nfkdicf)
+ printf(" nfkdicf \"%s\"", (const char*)leaf->utf8nfkdicf);
+ else if (leaf->utf8nfkdi)
+ printf(" nfkdi \"%s\"", (const char*)leaf->utf8nfkdi);
+ printf("\n");
+}
+
+static int
+nfkdi_mark(void *l)
+{
+ return 1;
+}
+
+static int
+nfkdicf_mark(void *l)
+{
+ struct unicode_data *leaf = l;
+
+ if (leaf->utf8nfkdicf)
+ return 1;
+ return 0;
+}
+
+static int
+correction_mark(void *l)
+{
+ struct unicode_data *leaf = l;
+
+ return leaf->correction;
+}
+
+static int
+nfkdi_size(void *l)
+{
+ struct unicode_data *leaf = l;
+
+ int size = 2;
+ if (leaf->utf8nfkdi)
+ size += strlen(leaf->utf8nfkdi) + 1;
+ return size;
+}
+
+static int
+nfkdicf_size(void *l)
+{
+ struct unicode_data *leaf = l;
+
+ int size = 2;
+ if (leaf->utf8nfkdicf)
+ size += strlen(leaf->utf8nfkdicf) + 1;
+ else if (leaf->utf8nfkdi)
+ size += strlen(leaf->utf8nfkdi) + 1;
+ return size;
+}
+
+static int *
+nfkdi_index(struct tree *tree, void *l)
+{
+ struct unicode_data *leaf = l;
+
+ return &tree->leafindex[leaf->code];
+}
+
+static int *
+nfkdicf_index(struct tree *tree, void *l)
+{
+ struct unicode_data *leaf = l;
+
+ return &tree->leafindex[leaf->code];
+}
+
+static unsigned char *
+nfkdi_emit(void *l, unsigned char *data)
+{
+ struct unicode_data *leaf = l;
+ unsigned char *s;
+
+ *data++ = leaf->gen;
+ if (leaf->utf8nfkdi) {
+ *data++ = DECOMPOSE;
+ s = (unsigned char*)leaf->utf8nfkdi;
+ while ((*data++ = *s++) != 0)
+ ;
+ } else {
+ *data++ = leaf->ccc;
+ }
+ return data;
+}
+
+static unsigned char *
+nfkdicf_emit(void *l, unsigned char *data)
+{
+ struct unicode_data *leaf = l;
+ unsigned char *s;
+
+ *data++ = leaf->gen;
+ if (leaf->utf8nfkdicf) {
+ *data++ = DECOMPOSE;
+ s = (unsigned char*)leaf->utf8nfkdicf;
+ while ((*data++ = *s++) != 0)
+ ;
+ } else if (leaf->utf8nfkdi) {
+ *data++ = DECOMPOSE;
+ s = (unsigned char*)leaf->utf8nfkdi;
+ while ((*data++ = *s++) != 0)
+ ;
+ } else {
+ *data++ = leaf->ccc;
+ }
+ return data;
+}
+
+static void
+utf8_create(struct unicode_data *data)
+{
+ char utf[18*4+1];
+ char *u;
+ unsigned int *um;
+ int i;
+
+ u = utf;
+ um = data->utf32nfkdi;
+ if (um) {
+ for (i = 0; um[i]; i++)
+ u += utf8encode(u, um[i]);
+ *u = '\0';
+ data->utf8nfkdi = strdup(utf);
+ }
+ u = utf;
+ um = data->utf32nfkdicf;
+ if (um) {
+ for (i = 0; um[i]; i++)
+ u += utf8encode(u, um[i]);
+ *u = '\0';
+ if (!data->utf8nfkdi || strcmp(data->utf8nfkdi, utf))
+ data->utf8nfkdicf = strdup(utf);
+ }
+}
+
+static void
+utf8_init(void)
+{
+ unsigned int unichar;
+ int i;
+
+ for (unichar = 0; unichar != 0x110000; unichar++)
+ utf8_create(&unicode_data[unichar]);
+
+ for (i = 0; i != corrections_count; i++)
+ utf8_create(&corrections[i]);
+}
+
+static void
+trees_init(void)
+{
+ struct unicode_data *data;
+ unsigned int maxage;
+ unsigned int nextage;
+ int count;
+ int i;
+ int j;
+
+ /* Count the number of different ages. */
+ count = 0;
+ nextage = (unsigned int)-1;
+ do {
+ maxage = nextage;
+ nextage = 0;
+ for (i = 0; i <= corrections_count; i++) {
+ data = &corrections[i];
+ if (nextage < data->correction &&
+ data->correction < maxage)
+ nextage = data->correction;
+ }
+ count++;
+ } while (nextage);
+
+ /* Two trees per age: nfkdi and nfkdicf */
+ trees_count = count * 2;
+ trees = calloc(trees_count, sizeof(struct tree));
+
+ /* Assign ages to the trees. */
+ count = trees_count;
+ nextage = (unsigned int)-1;
+ do {
+ maxage = nextage;
+ trees[--count].maxage = maxage;
+ trees[--count].maxage = maxage;
+ nextage = 0;
+ for (i = 0; i <= corrections_count; i++) {
+ data = &corrections[i];
+ if (nextage < data->correction &&
+ data->correction < maxage)
+ nextage = data->correction;
+ }
+ } while (nextage);
+
+ /* The ages assigned above are off by one. */
+ for (i = 0; i != trees_count; i++) {
+ j = 0;
+ while (ages[j] < trees[i].maxage)
+ j++;
+ trees[i].maxage = ages[j-1];
+ }
+
+ /* Set up the forwarding between trees. */
+ trees[trees_count-2].next = &trees[trees_count-1];
+ trees[trees_count-1].leaf_mark = nfkdi_mark;
+ trees[trees_count-2].leaf_mark = nfkdicf_mark;
+ for (i = 0; i != trees_count-2; i += 2) {
+ trees[i].next = &trees[trees_count-2];
+ trees[i].leaf_mark = correction_mark;
+ trees[i+1].next = &trees[trees_count-1];
+ trees[i+1].leaf_mark = correction_mark;
+ }
+
+ /* Assign the callouts. */
+ for (i = 0; i != trees_count; i += 2) {
+ trees[i].type = "nfkdicf";
+ trees[i].leaf_equal = nfkdicf_equal;
+ trees[i].leaf_print = nfkdicf_print;
+ trees[i].leaf_size = nfkdicf_size;
+ trees[i].leaf_index = nfkdicf_index;
+ trees[i].leaf_emit = nfkdicf_emit;
+
+ trees[i+1].type = "nfkdi";
+ trees[i+1].leaf_equal = nfkdi_equal;
+ trees[i+1].leaf_print = nfkdi_print;
+ trees[i+1].leaf_size = nfkdi_size;
+ trees[i+1].leaf_index = nfkdi_index;
+ trees[i+1].leaf_emit = nfkdi_emit;
+ }
+
+ /* Finish init. */
+ for (i = 0; i != trees_count; i++)
+ trees[i].childnode = NODE;
+}
+
+static void
+trees_populate(void)
+{
+ struct unicode_data *data;
+ unsigned int unichar;
+ char keyval[4];
+ int keylen;
+ int i;
+
+ for (i = 0; i != trees_count; i++) {
+ if (verbose > 0) {
+ printf("Populating %s_%x\n",
+ trees[i].type, trees[i].maxage);
+ }
+ for (unichar = 0; unichar != 0x110000; unichar++) {
+ if (unicode_data[unichar].gen < 0)
+ continue;
+ keylen = utf8encode(keyval, unichar);
+ data = corrections_lookup(&unicode_data[unichar]);
+ if (data->correction <= trees[i].maxage)
+ data = &unicode_data[unichar];
+ insert(&trees[i], keyval, keylen, data);
+ }
+ }
+}
+
+static void
+trees_reduce(void)
+{
+ int i;
+ int size;
+ int changed;
+
+ for (i = 0; i != trees_count; i++)
+ prune(&trees[i]);
+ for (i = 0; i != trees_count; i++)
+ mark_nodes(&trees[i]);
+ do {
+ size = 0;
+ for (i = 0; i != trees_count; i++)
+ size = index_nodes(&trees[i], size);
+ changed = 0;
+ for (i = 0; i != trees_count; i++)
+ changed += size_nodes(&trees[i]);
+ } while (changed);
+
+ utf8data = calloc(size, 1);
+ utf8data_size = size;
+ for (i = 0; i != trees_count; i++)
+ emit(&trees[i], utf8data);
+
+ if (verbose > 0) {
+ for (i = 0; i != trees_count; i++) {
+ printf("%s_%x idx %d\n",
+ trees[i].type, trees[i].maxage, trees[i].index);
+ }
+ }
+
+ nfkdi = utf8data + trees[trees_count-1].index;
+ nfkdicf = utf8data + trees[trees_count-2].index;
+
+ nfkdi_tree = &trees[trees_count-1];
+ nfkdicf_tree = &trees[trees_count-2];
+}
+
+static void
+verify(struct tree *tree)
+{
+ struct unicode_data *data;
+ utf8leaf_t *leaf;
+ unsigned int unichar;
+ char key[4];
+ int report;
+ int nocf;
+
+ if (verbose > 0)
+ printf("Verifying %s_%x\n", tree->type, tree->maxage);
+ nocf = strcmp(tree->type, "nfkdicf");
+
+ for (unichar = 0; unichar != 0x110000; unichar++) {
+ report = 0;
+ data = corrections_lookup(&unicode_data[unichar]);
+ if (data->correction <= tree->maxage)
+ data = &unicode_data[unichar];
+ utf8encode(key,unichar);
+ leaf = utf8lookup(tree, key);
+ if (!leaf) {
+ if (data->gen != -1)
+ report++;
+ if (unichar < 0xd800 || unichar > 0xdfff)
+ report++;
+ } else {
+ if (unichar >= 0xd800 && unichar <= 0xdfff)
+ report++;
+ if (data->gen == -1)
+ report++;
+ if (data->gen != LEAF_GEN(leaf))
+ report++;
+ if (LEAF_CCC(leaf) == DECOMPOSE) {
+ if (nocf) {
+ if (!data->utf8nfkdi) {
+ report++;
+ } else if (strcmp(data->utf8nfkdi,
+ LEAF_STR(leaf))) {
+ report++;
+ }
+ } else {
+ if (!data->utf8nfkdicf &&
+ !data->utf8nfkdi) {
+ report++;
+ } else if (data->utf8nfkdicf) {
+ if (strcmp(data->utf8nfkdicf,
+ LEAF_STR(leaf)))
+ report++;
+ } else if (strcmp(data->utf8nfkdi,
+ LEAF_STR(leaf))) {
+ report++;
+ }
+ }
+ } else if (data->ccc != LEAF_CCC(leaf)) {
+ report++;
+ }
+ }
+ if (report) {
+ printf("%X code %X gen %d ccc %d"
+ " nfkdi -> \"%s\"",
+ unichar, data->code, data->gen,
+ data->ccc,
+ data->utf8nfkdi);
+ if (leaf) {
+ printf(" gen %d ccc %d"
+ " nfkdi -> \"%s\"",
+ LEAF_GEN(leaf),
+ LEAF_CCC(leaf),
+ LEAF_CCC(leaf) == DECOMPOSE ?
+ LEAF_STR(leaf) : "");
+ }
+ printf("\n");
+ }
+ }
+}
+
+static void
+trees_verify(void)
+{
+ int i;
+
+ for (i = 0; i != trees_count; i++)
+ verify(&trees[i]);
+}
+
+/* ------------------------------------------------------------------ */
+
+static void
+help(void)
+{
+ printf("Usage: %s [options]\n", argv0);
+ printf("\n");
+ printf("This program creates an a data trie used for parsing and\n");
+ printf("normalization of UTF-8 strings. The trie is derived from\n");
+ printf("a set of input files from the Unicode character database\n");
+ printf("found at: http://www.unicode.org/Public/UCD/latest/ucd/\n");
+ printf("\n");
+ printf("The generated tree supports two normalization forms:\n");
+ printf("\n");
+ printf("\tnfkdi:\n");
+ printf("\t- Apply unicode normalization form NFKD.\n");
+ printf("\t- Remove any Default_Ignorable_Code_Point.\n");
+ printf("\n");
+ printf("\tnfkdicf:\n");
+ printf("\t- Apply unicode normalization form NFKD.\n");
+ printf("\t- Remove any Default_Ignorable_Code_Point.\n");
+ printf("\t- Apply a full casefold (C + F).\n");
+ printf("\n");
+ printf("These forms were chosen as being most useful when dealing\n");
+ printf("with file names: NFKD catches most cases where characters\n");
+ printf("should be considered equivalent. The ignorables are mostly\n");
+ printf("invisible, making names hard to type.\n");
+ printf("\n");
+ printf("The options to specify the files to be used are listed\n");
+ printf("below with their default values, which are the names used\n");
+ printf("by version 7.0.0 of the Unicode Character Database.\n");
+ printf("\n");
+ printf("The input files:\n");
+ printf("\t-a %s\n", AGE_NAME);
+ printf("\t-c %s\n", CCC_NAME);
+ printf("\t-p %s\n", PROP_NAME);
+ printf("\t-d %s\n", DATA_NAME);
+ printf("\t-f %s\n", FOLD_NAME);
+ printf("\t-n %s\n", NORM_NAME);
+ printf("\n");
+ printf("Additionally, the generated tables are tested using:\n");
+ printf("\t-t %s\n", TEST_NAME);
+ printf("\n");
+ printf("Finally, the output file:\n");
+ printf("\t-o %s\n", UTF8_NAME);
+ printf("\n");
+}
+
+static void
+usage(void)
+{
+ help();
+ exit(1);
+}
+
+static void
+open_fail(const char *name, int error)
+{
+ printf("Error %d opening %s: %s\n", error, name, strerror(error));
+ exit(1);
+}
+
+static void
+file_fail(const char *filename)
+{
+ printf("Error parsing %s\n", filename);
+ exit(1);
+}
+
+static void
+line_fail(const char *filename, const char *line)
+{
+ printf("Error parsing %s:%s\n", filename, line);
+ exit(1);
+}
+
+/* ------------------------------------------------------------------ */
+
+static void
+print_utf32(unsigned int *utf32str)
+{
+ int i;
+
+ for (i = 0; utf32str[i]; i++)
+ printf(" %X", utf32str[i]);
+}
+
+static void
+print_utf32nfkdi(unsigned int unichar)
+{
+ printf(" %X ->", unichar);
+ print_utf32(unicode_data[unichar].utf32nfkdi);
+ printf("\n");
+}
+
+static void
+print_utf32nfkdicf(unsigned int unichar)
+{
+ printf(" %X ->", unichar);
+ print_utf32(unicode_data[unichar].utf32nfkdicf);
+ printf("\n");
+}
+
+/* ------------------------------------------------------------------ */
+
+static void
+age_init(void)
+{
+ FILE *file;
+ unsigned int first;
+ unsigned int last;
+ unsigned int unichar;
+ unsigned int major;
+ unsigned int minor;
+ unsigned int revision;
+ int gen;
+ int count;
+ int ret;
+
+ if (verbose > 0)
+ printf("Parsing %s\n", age_name);
+
+ file = fopen(age_name, "r");
+ if (!file)
+ open_fail(age_name, errno);
+ count = 0;
+
+ gen = 0;
+ while (fgets(line, LINESIZE, file)) {
+ ret = sscanf(line, "# Age=V%d_%d_%d",
+ &major, &minor, &revision);
+ if (ret == 3) {
+ ages_count++;
+ if (verbose > 1)
+ printf(" Age V%d_%d_%d\n",
+ major, minor, revision);
+ if (!age_valid(major, minor, revision))
+ line_fail(age_name, line);
+ continue;
+ }
+ ret = sscanf(line, "# Age=V%d_%d", &major, &minor);
+ if (ret == 2) {
+ ages_count++;
+ if (verbose > 1)
+ printf(" Age V%d_%d\n", major, minor);
+ if (!age_valid(major, minor, 0))
+ line_fail(age_name, line);
+ continue;
+ }
+ }
+
+ /* We must have found something above. */
+ if (verbose > 1)
+ printf("%d age entries\n", ages_count);
+ if (ages_count == 0 || ages_count > MAXGEN)
+ file_fail(age_name);
+
+ /* There is a 0 entry. */
+ ages_count++;
+ ages = calloc(ages_count + 1, sizeof(*ages));
+ /* And a guard entry. */
+ ages[ages_count] = (unsigned int)-1;
+
+ rewind(file);
+ count = 0;
+ gen = 0;
+ while (fgets(line, LINESIZE, file)) {
+ ret = sscanf(line, "# Age=V%d_%d_%d",
+ &major, &minor, &revision);
+ if (ret == 3) {
+ ages[++gen] =
+ UNICODE_AGE(major, minor, revision);
+ if (verbose > 1)
+ printf(" Age V%d_%d_%d = gen %d\n",
+ major, minor, revision, gen);
+ if (!age_valid(major, minor, revision))
+ line_fail(age_name, line);
+ continue;
+ }
+ ret = sscanf(line, "# Age=V%d_%d", &major, &minor);
+ if (ret == 2) {
+ ages[++gen] = UNICODE_AGE(major, minor, 0);
+ if (verbose > 1)
+ printf(" Age V%d_%d = %d\n",
+ major, minor, gen);
+ if (!age_valid(major, minor, 0))
+ line_fail(age_name, line);
+ continue;
+ }
+ ret = sscanf(line, "%X..%X ; %d.%d #",
+ &first, &last, &major, &minor);
+ if (ret == 4) {
+ for (unichar = first; unichar <= last; unichar++)
+ unicode_data[unichar].gen = gen;
+ count += 1 + last - first;
+ if (verbose > 1)
+ printf(" %X..%X gen %d\n", first, last, gen);
+ if (!utf32valid(first) || !utf32valid(last))
+ line_fail(age_name, line);
+ continue;
+ }
+ ret = sscanf(line, "%X ; %d.%d #", &unichar, &major, &minor);
+ if (ret == 3) {
+ unicode_data[unichar].gen = gen;
+ count++;
+ if (verbose > 1)
+ printf(" %X gen %d\n", unichar, gen);
+ if (!utf32valid(unichar))
+ line_fail(age_name, line);
+ continue;
+ }
+ }
+ unicode_maxage = ages[gen];
+ fclose(file);
+
+ /* Nix surrogate block */
+ if (verbose > 1)
+ printf(" Removing surrogate block D800..DFFF\n");
+ for (unichar = 0xd800; unichar <= 0xdfff; unichar++)
+ unicode_data[unichar].gen = -1;
+
+ if (verbose > 0)
+ printf("Found %d entries\n", count);
+ if (count == 0)
+ file_fail(age_name);
+}
+
+static void
+ccc_init(void)
+{
+ FILE *file;
+ unsigned int first;
+ unsigned int last;
+ unsigned int unichar;
+ unsigned int value;
+ int count;
+ int ret;
+
+ if (verbose > 0)
+ printf("Parsing %s\n", ccc_name);
+
+ file = fopen(ccc_name, "r");
+ if (!file)
+ open_fail(ccc_name, errno);
+
+ count = 0;
+ while (fgets(line, LINESIZE, file)) {
+ ret = sscanf(line, "%X..%X ; %d #", &first, &last, &value);
+ if (ret == 3) {
+ for (unichar = first; unichar <= last; unichar++) {
+ unicode_data[unichar].ccc = value;
+ count++;
+ }
+ if (verbose > 1)
+ printf(" %X..%X ccc %d\n", first, last, value);
+ if (!utf32valid(first) || !utf32valid(last))
+ line_fail(ccc_name, line);
+ continue;
+ }
+ ret = sscanf(line, "%X ; %d #", &unichar, &value);
+ if (ret == 2) {
+ unicode_data[unichar].ccc = value;
+ count++;
+ if (verbose > 1)
+ printf(" %X ccc %d\n", unichar, value);
+ if (!utf32valid(unichar))
+ line_fail(ccc_name, line);
+ continue;
+ }
+ }
+ fclose(file);
+
+ if (verbose > 0)
+ printf("Found %d entries\n", count);
+ if (count == 0)
+ file_fail(ccc_name);
+}
+
+static void
+nfkdi_init(void)
+{
+ FILE *file;
+ unsigned int unichar;
+ unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+ char *s;
+ unsigned int *um;
+ int count;
+ int i;
+ int ret;
+
+ if (verbose > 0)
+ printf("Parsing %s\n", data_name);
+ file = fopen(data_name, "r");
+ if (!file)
+ open_fail(data_name, errno);
+
+ count = 0;
+ while (fgets(line, LINESIZE, file)) {
+ ret = sscanf(line, "%X;%*[^;];%*[^;];%*[^;];%*[^;];%[^;];",
+ &unichar, buf0);
+ if (ret != 2)
+ continue;
+ if (!utf32valid(unichar))
+ line_fail(data_name, line);
+
+ s = buf0;
+ /* skip over <tag> */
+ if (*s == '<')
+ while (*s++ != ' ')
+ ;
+ /* decode the decomposition into UTF-32 */
+ i = 0;
+ while (*s) {
+ mapping[i] = strtoul(s, &s, 16);
+ if (!utf32valid(mapping[i]))
+ line_fail(data_name, line);
+ i++;
+ }
+ mapping[i++] = 0;
+
+ um = malloc(i * sizeof(unsigned int));
+ memcpy(um, mapping, i * sizeof(unsigned int));
+ unicode_data[unichar].utf32nfkdi = um;
+
+ if (verbose > 1)
+ print_utf32nfkdi(unichar);
+ count++;
+ }
+ fclose(file);
+ if (verbose > 0)
+ printf("Found %d entries\n", count);
+ if (count == 0)
+ file_fail(data_name);
+}
+
+static void
+nfkdicf_init(void)
+{
+ FILE *file;
+ unsigned int unichar;
+ unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+ char status;
+ char *s;
+ unsigned int *um;
+ int i;
+ int count;
+ int ret;
+
+ if (verbose > 0)
+ printf("Parsing %s\n", fold_name);
+ file = fopen(fold_name, "r");
+ if (!file)
+ open_fail(fold_name, errno);
+
+ count = 0;
+ while (fgets(line, LINESIZE, file)) {
+ ret = sscanf(line, "%X; %c; %[^;];", &unichar, &status, buf0);
+ if (ret != 3)
+ continue;
+ if (!utf32valid(unichar))
+ line_fail(fold_name, line);
+ /* Use the C+F casefold. */
+ if (status != 'C' && status != 'F')
+ continue;
+ s = buf0;
+ if (*s == '<')
+ while (*s++ != ' ')
+ ;
+ i = 0;
+ while (*s) {
+ mapping[i] = strtoul(s, &s, 16);
+ if (!utf32valid(mapping[i]))
+ line_fail(fold_name, line);
+ i++;
+ }
+ mapping[i++] = 0;
+
+ um = malloc(i * sizeof(unsigned int));
+ memcpy(um, mapping, i * sizeof(unsigned int));
+ unicode_data[unichar].utf32nfkdicf = um;
+
+ if (verbose > 1)
+ print_utf32nfkdicf(unichar);
+ count++;
+ }
+ fclose(file);
+ if (verbose > 0)
+ printf("Found %d entries\n", count);
+ if (count == 0)
+ file_fail(fold_name);
+}
+
+static void
+ignore_init(void)
+{
+ FILE *file;
+ unsigned int unichar;
+ unsigned int first;
+ unsigned int last;
+ unsigned int *um;
+ int count;
+ int ret;
+
+ if (verbose > 0)
+ printf("Parsing %s\n", prop_name);
+ file = fopen(prop_name, "r");
+ if (!file)
+ open_fail(prop_name, errno);
+ assert(file);
+ count = 0;
+ while (fgets(line, LINESIZE, file)) {
+ ret = sscanf(line, "%X..%X ; %s # ", &first, &last, buf0);
+ if (ret == 3) {
+ if (strcmp(buf0, "Default_Ignorable_Code_Point"))
+ continue;
+ if (!utf32valid(first) || !utf32valid(last))
+ line_fail(prop_name, line);
+ for (unichar = first; unichar <= last; unichar++) {
+ free(unicode_data[unichar].utf32nfkdi);
+ um = malloc(sizeof(unsigned int));
+ *um = 0;
+ unicode_data[unichar].utf32nfkdi = um;
+ free(unicode_data[unichar].utf32nfkdicf);
+ um = malloc(sizeof(unsigned int));
+ *um = 0;
+ unicode_data[unichar].utf32nfkdicf = um;
+ count++;
+ }
+ if (verbose > 1)
+ printf(" %X..%X Default_Ignorable_Code_Point\n",
+ first, last);
+ continue;
+ }
+ ret = sscanf(line, "%X ; %s # ", &unichar, buf0);
+ if (ret == 2) {
+ if (strcmp(buf0, "Default_Ignorable_Code_Point"))
+ continue;
+ if (!utf32valid(unichar))
+ line_fail(prop_name, line);
+ free(unicode_data[unichar].utf32nfkdi);
+ um = malloc(sizeof(unsigned int));
+ *um = 0;
+ unicode_data[unichar].utf32nfkdi = um;
+ free(unicode_data[unichar].utf32nfkdicf);
+ um = malloc(sizeof(unsigned int));
+ *um = 0;
+ unicode_data[unichar].utf32nfkdicf = um;
+ if (verbose > 1)
+ printf(" %X Default_Ignorable_Code_Point\n",
+ unichar);
+ count++;
+ continue;
+ }
+ }
+ fclose(file);
+
+ if (verbose > 0)
+ printf("Found %d entries\n", count);
+ if (count == 0)
+ file_fail(prop_name);
+}
+
+static void
+corrections_init(void)
+{
+ FILE *file;
+ unsigned int unichar;
+ unsigned int major;
+ unsigned int minor;
+ unsigned int revision;
+ unsigned int age;
+ unsigned int *um;
+ unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+ char *s;
+ int i;
+ int count;
+ int ret;
+
+ if (verbose > 0)
+ printf("Parsing %s\n", norm_name);
+ file = fopen(norm_name, "r");
+ if (!file)
+ open_fail(norm_name, errno);
+
+ count = 0;
+ while (fgets(line, LINESIZE, file)) {
+ ret = sscanf(line, "%X;%[^;];%[^;];%d.%d.%d #",
+ &unichar, buf0, buf1,
+ &major, &minor, &revision);
+ if (ret != 6)
+ continue;
+ if (!utf32valid(unichar) || !age_valid(major, minor, revision))
+ line_fail(norm_name, line);
+ count++;
+ }
+ corrections = calloc(count, sizeof(struct unicode_data));
+ corrections_count = count;
+ rewind(file);
+
+ count = 0;
+ while (fgets(line, LINESIZE, file)) {
+ ret = sscanf(line, "%X;%[^;];%[^;];%d.%d.%d #",
+ &unichar, buf0, buf1,
+ &major, &minor, &revision);
+ if (ret != 6)
+ continue;
+ if (!utf32valid(unichar) || !age_valid(major, minor, revision))
+ line_fail(norm_name, line);
+ corrections[count] = unicode_data[unichar];
+ assert(corrections[count].code == unichar);
+ age = UNICODE_AGE(major, minor, revision);
+ corrections[count].correction = age;
+
+ i = 0;
+ s = buf0;
+ while (*s) {
+ mapping[i] = strtoul(s, &s, 16);
+ if (!utf32valid(mapping[i]))
+ line_fail(norm_name, line);
+ i++;
+ }
+ mapping[i++] = 0;
+
+ um = malloc(i * sizeof(unsigned int));
+ memcpy(um, mapping, i * sizeof(unsigned int));
+ corrections[count].utf32nfkdi = um;
+
+ if (verbose > 1)
+ printf(" %X -> %s -> %s V%d_%d_%d\n",
+ unichar, buf0, buf1, major, minor, revision);
+ count++;
+ }
+ fclose(file);
+
+ if (verbose > 0)
+ printf("Found %d entries\n", count);
+ if (count == 0)
+ file_fail(norm_name);
+}
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * Hangul decomposition (algorithm from Section 3.12 of Unicode 6.3.0)
+ *
+ * AC00;<Hangul Syllable, First>;Lo;0;L;;;;;N;;;;;
+ * D7A3;<Hangul Syllable, Last>;Lo;0;L;;;;;N;;;;;
+ *
+ * SBase = 0xAC00
+ * LBase = 0x1100
+ * VBase = 0x1161
+ * TBase = 0x11A7
+ * LCount = 19
+ * VCount = 21
+ * TCount = 28
+ * NCount = 588 (VCount * TCount)
+ * SCount = 11172 (LCount * NCount)
+ *
+ * Decomposition:
+ * SIndex = s - SBase
+ *
+ * LV (Canonical/Full)
+ * LIndex = SIndex / NCount
+ * VIndex = (Sindex % NCount) / TCount
+ * LPart = LBase + LIndex
+ * VPart = VBase + VIndex
+ *
+ * LVT (Canonical)
+ * LVIndex = (SIndex / TCount) * TCount
+ * TIndex = (Sindex % TCount)
+ * LVPart = SBase + LVIndex
+ * TPart = TBase + TIndex
+ *
+ * LVT (Full)
+ * LIndex = SIndex / NCount
+ * VIndex = (Sindex % NCount) / TCount
+ * TIndex = (Sindex % TCount)
+ * LPart = LBase + LIndex
+ * VPart = VBase + VIndex
+ * if (TIndex == 0) {
+ * d = <LPart, VPart>
+ * } else {
+ * TPart = TBase + TIndex
+ * d = <LPart, VPart, TPart>
+ * }
+ *
+ */
+
+static void
+hangul_decompose(void)
+{
+ unsigned int sb = 0xAC00;
+ unsigned int lb = 0x1100;
+ unsigned int vb = 0x1161;
+ unsigned int tb = 0x11a7;
+ /* unsigned int lc = 19; */
+ unsigned int vc = 21;
+ unsigned int tc = 28;
+ unsigned int nc = (vc * tc);
+ /* unsigned int sc = (lc * nc); */
+ unsigned int unichar;
+ unsigned int mapping[4];
+ unsigned int *um;
+ int count;
+ int i;
+
+ if (verbose > 0)
+ printf("Decomposing hangul\n");
+ /* Hangul */
+ count = 0;
+ for (unichar = 0xAC00; unichar <= 0xD7A3; unichar++) {
+ unsigned int si = unichar - sb;
+ unsigned int li = si / nc;
+ unsigned int vi = (si % nc) / tc;
+ unsigned int ti = si % tc;
+
+ i = 0;
+ mapping[i++] = lb + li;
+ mapping[i++] = vb + vi;
+ if (ti)
+ mapping[i++] = tb + ti;
+ mapping[i++] = 0;
+
+ assert(!unicode_data[unichar].utf32nfkdi);
+ um = malloc(i * sizeof(unsigned int));
+ memcpy(um, mapping, i * sizeof(unsigned int));
+ unicode_data[unichar].utf32nfkdi = um;
+
+ assert(!unicode_data[unichar].utf32nfkdicf);
+ um = malloc(i * sizeof(unsigned int));
+ memcpy(um, mapping, i * sizeof(unsigned int));
+ unicode_data[unichar].utf32nfkdicf = um;
+
+ if (verbose > 1)
+ print_utf32nfkdi(unichar);
+
+ count++;
+ }
+ if (verbose > 0)
+ printf("Created %d entries\n", count);
+}
+
+static void
+nfkdi_decompose(void)
+{
+ unsigned int unichar;
+ unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+ unsigned int *um;
+ unsigned int *dc;
+ int count;
+ int i;
+ int j;
+ int ret;
+
+ if (verbose > 0)
+ printf("Decomposing nfkdi\n");
+
+ count = 0;
+ for (unichar = 0; unichar != 0x110000; unichar++) {
+ if (!unicode_data[unichar].utf32nfkdi)
+ continue;
+ for (;;) {
+ ret = 1;
+ i = 0;
+ um = unicode_data[unichar].utf32nfkdi;
+ while (*um) {
+ dc = unicode_data[*um].utf32nfkdi;
+ if (dc) {
+ for (j = 0; dc[j]; j++)
+ mapping[i++] = dc[j];
+ ret = 0;
+ } else {
+ mapping[i++] = *um;
+ }
+ um++;
+ }
+ mapping[i++] = 0;
+ if (ret)
+ break;
+ free(unicode_data[unichar].utf32nfkdi);
+ um = malloc(i * sizeof(unsigned int));
+ memcpy(um, mapping, i * sizeof(unsigned int));
+ unicode_data[unichar].utf32nfkdi = um;
+ }
+ /* Add this decomposition to nfkdicf if there is no entry. */
+ if (!unicode_data[unichar].utf32nfkdicf) {
+ um = malloc(i * sizeof(unsigned int));
+ memcpy(um, mapping, i * sizeof(unsigned int));
+ unicode_data[unichar].utf32nfkdicf = um;
+ }
+ if (verbose > 1)
+ print_utf32nfkdi(unichar);
+ count++;
+ }
+ if (verbose > 0)
+ printf("Processed %d entries\n", count);
+}
+
+static void
+nfkdicf_decompose(void)
+{
+ unsigned int unichar;
+ unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+ unsigned int *um;
+ unsigned int *dc;
+ int count;
+ int i;
+ int j;
+ int ret;
+
+ if (verbose > 0)
+ printf("Decomposing nfkdicf\n");
+ count = 0;
+ for (unichar = 0; unichar != 0x110000; unichar++) {
+ if (!unicode_data[unichar].utf32nfkdicf)
+ continue;
+ for (;;) {
+ ret = 1;
+ i = 0;
+ um = unicode_data[unichar].utf32nfkdicf;
+ while (*um) {
+ dc = unicode_data[*um].utf32nfkdicf;
+ if (dc) {
+ for (j = 0; dc[j]; j++)
+ mapping[i++] = dc[j];
+ ret = 0;
+ } else {
+ mapping[i++] = *um;
+ }
+ um++;
+ }
+ mapping[i++] = 0;
+ if (ret)
+ break;
+ free(unicode_data[unichar].utf32nfkdicf);
+ um = malloc(i * sizeof(unsigned int));
+ memcpy(um, mapping, i * sizeof(unsigned int));
+ unicode_data[unichar].utf32nfkdicf = um;
+ }
+ if (verbose > 1)
+ print_utf32nfkdicf(unichar);
+ count++;
+ }
+ if (verbose > 0)
+ printf("Processed %d entries\n", count);
+}
+
+/* ------------------------------------------------------------------ */
+
+int utf8agemax(struct tree *, const char *);
+int utf8nagemax(struct tree *, const char *, size_t);
+int utf8agemin(struct tree *, const char *);
+int utf8nagemin(struct tree *, const char *, size_t);
+ssize_t utf8len(struct tree *, const char *);
+ssize_t utf8nlen(struct tree *, const char *, size_t);
+struct utf8cursor;
+int utf8cursor(struct utf8cursor *, struct tree *, const char *);
+int utf8ncursor(struct utf8cursor *, struct tree *, const char *, size_t);
+int utf8byte(struct utf8cursor *);
+
+/*
+ * Use trie to scan s, touching at most len bytes.
+ * Returns the leaf if one exists, NULL otherwise.
+ *
+ * A non-NULL return guarantees that the UTF-8 sequence starting at s
+ * is well-formed and corresponds to a known unicode code point. The
+ * shorthand for this will be "is valid UTF-8 unicode".
+ */
+static utf8leaf_t *
+utf8nlookup(struct tree *tree, const char *s, size_t len)
+{
+ utf8trie_t *trie = utf8data + tree->index;
+ int offlen;
+ int offset;
+ int mask;
+ int node;
+
+ if (!tree)
+ return NULL;
+ if (len == 0)
+ return NULL;
+ node = 1;
+ while (node) {
+ offlen = (*trie & OFFLEN) >> OFFLEN_SHIFT;
+ if (*trie & NEXTBYTE) {
+ if (--len == 0)
+ return NULL;
+ s++;
+ }
+ mask = 1 << (*trie & BITNUM);
+ if (*s & mask) {
+ /* Right leg */
+ if (offlen) {
+ /* Right node at offset of trie */
+ node = (*trie & RIGHTNODE);
+ offset = trie[offlen];
+ while (--offlen) {
+ offset <<= 8;
+ offset |= trie[offlen];
+ }
+ trie += offset;
+ } else if (*trie & RIGHTPATH) {
+ /* Right node after this node */
+ node = (*trie & TRIENODE);
+ trie++;
+ } else {
+ /* No right node. */
+ return NULL;
+ }
+ } else {
+ /* Left leg */
+ if (offlen) {
+ /* Left node after this node. */
+ node = (*trie & LEFTNODE);
+ trie += offlen + 1;
+ } else if (*trie & RIGHTPATH) {
+ /* No left node. */
+ return NULL;
+ } else {
+ /* Left node after this node */
+ node = (*trie & TRIENODE);
+ trie++;
+ }
+ }
+ }
+ return trie;
+}
+
+/*
+ * Use trie to scan s.
+ * Returns the leaf if one exists, NULL otherwise.
+ *
+ * Forwards to trie_nlookup().
+ */
+static utf8leaf_t *
+utf8lookup(struct tree *tree, const char *s)
+{
+ return utf8nlookup(tree, s, (size_t)-1);
+}
+
+/*
+ * Return the number of bytes used by the current UTF-8 sequence.
+ * Assumes the input points to the first byte of a valid UTF-8
+ * sequence.
+ */
+static inline int
+utf8clen(const char *s)
+{
+ unsigned char c = *s;
+ return 1 + (c >= 0xC0) + (c >= 0xE0) + (c >= 0xF0);
+}
+
+/*
+ * Maximum age of any character in s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ * Return 0 if only non-assigned code points are used.
+ */
+int
+utf8agemax(struct tree *tree, const char *s)
+{
+ utf8leaf_t *leaf;
+ int age = 0;
+ int leaf_age;
+
+ if (!tree)
+ return -1;
+ while (*s) {
+ if (!(leaf = utf8lookup(tree, s)))
+ return -1;
+ leaf_age = ages[LEAF_GEN(leaf)];
+ if (leaf_age <= tree->maxage && leaf_age > age)
+ age = leaf_age;
+ s += utf8clen(s);
+ }
+ return age;
+}
+
+/*
+ * Minimum age of any character in s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ * Return 0 if non-assigned code points are used.
+ */
+int
+utf8agemin(struct tree *tree, const char *s)
+{
+ utf8leaf_t *leaf;
+ int age;
+ int leaf_age;
+
+ if (!tree)
+ return -1;
+ age = tree->maxage;
+ while (*s) {
+ if (!(leaf = utf8lookup(tree, s)))
+ return -1;
+ leaf_age = ages[LEAF_GEN(leaf)];
+ if (leaf_age <= tree->maxage && leaf_age < age)
+ age = leaf_age;
+ s += utf8clen(s);
+ }
+ return age;
+}
+
+/*
+ * Maximum age of any character in s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+int
+utf8nagemax(struct tree *tree, const char *s, size_t len)
+{
+ utf8leaf_t *leaf;
+ int age = 0;
+ int leaf_age;
+
+ if (!tree)
+ return -1;
+ while (len && *s) {
+ if (!(leaf = utf8nlookup(tree, s, len)))
+ return -1;
+ leaf_age = ages[LEAF_GEN(leaf)];
+ if (leaf_age <= tree->maxage && leaf_age > age)
+ age = leaf_age;
+ len -= utf8clen(s);
+ s += utf8clen(s);
+ }
+ return age;
+}
+
+/*
+ * Maximum age of any character in s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+int
+utf8nagemin(struct tree *tree, const char *s, size_t len)
+{
+ utf8leaf_t *leaf;
+ int leaf_age;
+ int age;
+
+ if (!tree)
+ return -1;
+ age = tree->maxage;
+ while (len && *s) {
+ if (!(leaf = utf8nlookup(tree, s, len)))
+ return -1;
+ leaf_age = ages[LEAF_GEN(leaf)];
+ if (leaf_age <= tree->maxage && leaf_age < age)
+ age = leaf_age;
+ len -= utf8clen(s);
+ s += utf8clen(s);
+ }
+ return age;
+}
+
+/*
+ * Length of the normalization of s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ *
+ * A string of Default_Ignorable_Code_Point has length 0.
+ */
+ssize_t
+utf8len(struct tree *tree, const char *s)
+{
+ utf8leaf_t *leaf;
+ size_t ret = 0;
+
+ if (!tree)
+ return -1;
+ while (*s) {
+ if (!(leaf = utf8lookup(tree, s)))
+ return -1;
+ if (ages[LEAF_GEN(leaf)] > tree->maxage)
+ ret += utf8clen(s);
+ else if (LEAF_CCC(leaf) == DECOMPOSE)
+ ret += strlen(LEAF_STR(leaf));
+ else
+ ret += utf8clen(s);
+ s += utf8clen(s);
+ }
+ return ret;
+}
+
+/*
+ * Length of the normalization of s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+ssize_t
+utf8nlen(struct tree *tree, const char *s, size_t len)
+{
+ utf8leaf_t *leaf;
+ size_t ret = 0;
+
+ if (!tree)
+ return -1;
+ while (len && *s) {
+ if (!(leaf = utf8nlookup(tree, s, len)))
+ return -1;
+ if (ages[LEAF_GEN(leaf)] > tree->maxage)
+ ret += utf8clen(s);
+ else if (LEAF_CCC(leaf) == DECOMPOSE)
+ ret += strlen(LEAF_STR(leaf));
+ else
+ ret += utf8clen(s);
+ len -= utf8clen(s);
+ s += utf8clen(s);
+ }
+ return ret;
+}
+
+/*
+ * Cursor structure used by the normalizer.
+ */
+struct utf8cursor {
+ struct tree *tree;
+ const char *s;
+ const char *p;
+ const char *ss;
+ const char *sp;
+ unsigned int len;
+ unsigned int slen;
+ short int ccc;
+ short int nccc;
+ unsigned int unichar;
+};
+
+/*
+ * Set up an utf8cursor for use by utf8byte().
+ *
+ * s : string.
+ * len : length of s.
+ * u8c : pointer to cursor.
+ * trie : utf8trie_t to use for normalization.
+ *
+ * Returns -1 on error, 0 on success.
+ */
+int
+utf8ncursor(
+ struct utf8cursor *u8c,
+ struct tree *tree,
+ const char *s,
+ size_t len)
+{
+ if (!tree)
+ return -1;
+ if (!s)
+ return -1;
+ u8c->tree = tree;
+ u8c->s = s;
+ u8c->p = NULL;
+ u8c->ss = NULL;
+ u8c->sp = NULL;
+ u8c->len = len;
+ u8c->slen = 0;
+ u8c->ccc = STOPPER;
+ u8c->nccc = STOPPER;
+ u8c->unichar = 0;
+ /* Check we didn't clobber the maximum length. */
+ if (u8c->len != len)
+ return -1;
+ /* The first byte of s may not be an utf8 continuation. */
+ if (len > 0 && (*s & 0xC0) == 0x80)
+ return -1;
+ return 0;
+}
+
+/*
+ * Set up an utf8cursor for use by utf8byte().
+ *
+ * s : NUL-terminated string.
+ * u8c : pointer to cursor.
+ * trie : utf8trie_t to use for normalization.
+ *
+ * Returns -1 on error, 0 on success.
+ */
+int
+utf8cursor(
+ struct utf8cursor *u8c,
+ struct tree *tree,
+ const char *s)
+{
+ return utf8ncursor(u8c, tree, s, (unsigned int)-1);
+}
+
+/*
+ * Get one byte from the normalized form of the string described by u8c.
+ *
+ * Returns the byte cast to an unsigned char on succes, and -1 on failure.
+ *
+ * The cursor keeps track of the location in the string in u8c->s.
+ * When a character is decomposed, the current location is stored in
+ * u8c->p, and u8c->s is set to the start of the decomposition. Note
+ * that bytes from a decomposition do not count against u8c->len.
+ *
+ * Characters are emitted if they match the current CCC in u8c->ccc.
+ * Hitting end-of-string while u8c->ccc == STOPPER means we're done,
+ * and the function returns 0 in that case.
+ *
+ * Sorting by CCC is done by repeatedly scanning the string. The
+ * values of u8c->s and u8c->p are stored in u8c->ss and u8c->sp at
+ * the start of the scan. The first pass finds the lowest CCC to be
+ * emitted and stores it in u8c->nccc, the second pass emits the
+ * characters with this CCC and finds the next lowest CCC. This limits
+ * the number of passes to 1 + the number of different CCCs in the
+ * sequence being scanned.
+ *
+ * Therefore:
+ * u8c->p != NULL -> a decomposition is being scanned.
+ * u8c->ss != NULL -> this is a repeating scan.
+ * u8c->ccc == -1 -> this is the first scan of a repeating scan.
+ */
+int
+utf8byte(struct utf8cursor *u8c)
+{
+ utf8leaf_t *leaf;
+ int ccc;
+
+ for (;;) {
+ /* Check for the end of a decomposed character. */
+ if (u8c->p && *u8c->s == '\0') {
+ u8c->s = u8c->p;
+ u8c->p = NULL;
+ }
+
+ /* Check for end-of-string. */
+ if (!u8c->p && (u8c->len == 0 || *u8c->s == '\0')) {
+ /* There is no next byte. */
+ if (u8c->ccc == STOPPER)
+ return 0;
+ /* End-of-string during a scan counts as a stopper. */
+ ccc = STOPPER;
+ goto ccc_mismatch;
+ } else if ((*u8c->s & 0xC0) == 0x80) {
+ /* This is a continuation of the current character. */
+ if (!u8c->p)
+ u8c->len--;
+ return (unsigned char)*u8c->s++;
+ }
+
+ /* Look up the data for the current character. */
+ if (u8c->p)
+ leaf = utf8lookup(u8c->tree, u8c->s);
+ else
+ leaf = utf8nlookup(u8c->tree, u8c->s, u8c->len);
+
+ /* No leaf found implies that the input is a binary blob. */
+ if (!leaf)
+ return -1;
+
+ /* Characters that are too new have CCC 0. */
+ if (ages[LEAF_GEN(leaf)] > u8c->tree->maxage) {
+ ccc = STOPPER;
+ } else if ((ccc = LEAF_CCC(leaf)) == DECOMPOSE) {
+ u8c->len -= utf8clen(u8c->s);
+ u8c->p = u8c->s + utf8clen(u8c->s);
+ u8c->s = LEAF_STR(leaf);
+ /* Empty decomposition implies CCC 0. */
+ if (*u8c->s == '\0') {
+ if (u8c->ccc == STOPPER)
+ continue;
+ ccc = STOPPER;
+ goto ccc_mismatch;
+ }
+ leaf = utf8lookup(u8c->tree, u8c->s);
+ ccc = LEAF_CCC(leaf);
+ }
+ u8c->unichar = utf8decode(u8c->s);
+
+ /*
+ * If this is not a stopper, then see if it updates
+ * the next canonical class to be emitted.
+ */
+ if (ccc != STOPPER && u8c->ccc < ccc && ccc < u8c->nccc)
+ u8c->nccc = ccc;
+
+ /*
+ * Return the current byte if this is the current
+ * combining class.
+ */
+ if (ccc == u8c->ccc) {
+ if (!u8c->p)
+ u8c->len--;
+ return (unsigned char)*u8c->s++;
+ }
+
+ /* Current combining class mismatch. */
+ ccc_mismatch:
+ if (u8c->nccc == STOPPER) {
+ /*
+ * Scan forward for the first canonical class
+ * to be emitted. Save the position from
+ * which to restart.
+ */
+ assert(u8c->ccc == STOPPER);
+ u8c->ccc = MINCCC - 1;
+ u8c->nccc = ccc;
+ u8c->sp = u8c->p;
+ u8c->ss = u8c->s;
+ u8c->slen = u8c->len;
+ if (!u8c->p)
+ u8c->len -= utf8clen(u8c->s);
+ u8c->s += utf8clen(u8c->s);
+ } else if (ccc != STOPPER) {
+ /* Not a stopper, and not the ccc we're emitting. */
+ if (!u8c->p)
+ u8c->len -= utf8clen(u8c->s);
+ u8c->s += utf8clen(u8c->s);
+ } else if (u8c->nccc != MAXCCC + 1) {
+ /* At a stopper, restart for next ccc. */
+ u8c->ccc = u8c->nccc;
+ u8c->nccc = MAXCCC + 1;
+ u8c->s = u8c->ss;
+ u8c->p = u8c->sp;
+ u8c->len = u8c->slen;
+ } else {
+ /* All done, proceed from here. */
+ u8c->ccc = STOPPER;
+ u8c->nccc = STOPPER;
+ u8c->sp = NULL;
+ u8c->ss = NULL;
+ u8c->slen = 0;
+ }
+ }
+}
+
+/* ------------------------------------------------------------------ */
+
+static int
+normalize_line(struct tree *tree)
+{
+ char *s;
+ char *t;
+ int c;
+ struct utf8cursor u8c;
+
+ /* First test: null-terminated string. */
+ s = buf2;
+ t = buf3;
+ if (utf8cursor(&u8c, tree, s))
+ return -1;
+ while ((c = utf8byte(&u8c)) > 0)
+ if (c != (unsigned char)*t++)
+ return -1;
+ if (c < 0)
+ return -1;
+ if (*t != 0)
+ return -1;
+
+ /* Second test: length-limited string. */
+ s = buf2;
+ /* Replace NUL with a value that will cause an error if seen. */
+ s[strlen(s) + 1] = -1;
+ t = buf3;
+ if (utf8cursor(&u8c, tree, s))
+ return -1;
+ while ((c = utf8byte(&u8c)) > 0)
+ if (c != (unsigned char)*t++)
+ return -1;
+ if (c < 0)
+ return -1;
+ if (*t != 0)
+ return -1;
+
+ return 0;
+}
+
+static void
+normalization_test(void)
+{
+ FILE *file;
+ unsigned int unichar;
+ struct unicode_data *data;
+ char *s;
+ char *t;
+ int ret;
+ int ignorables;
+ int tests = 0;
+ int failures = 0;
+
+ if (verbose > 0)
+ printf("Parsing %s\n", test_name);
+ /* Step one, read data from file. */
+ file = fopen(test_name, "r");
+ if (!file)
+ open_fail(test_name, errno);
+
+ while (fgets(line, LINESIZE, file)) {
+ ret = sscanf(line, "%[^;];%*[^;];%*[^;];%*[^;];%[^;];",
+ buf0, buf1);
+ if (ret != 2 || *line == '#')
+ continue;
+ s = buf0;
+ t = buf2;
+ while (*s) {
+ unichar = strtoul(s, &s, 16);
+ t += utf8encode(t, unichar);
+ }
+ *t = '\0';
+
+ ignorables = 0;
+ s = buf1;
+ t = buf3;
+ while (*s) {
+ unichar = strtoul(s, &s, 16);
+ data = &unicode_data[unichar];
+ if (data->utf8nfkdi && !*data->utf8nfkdi)
+ ignorables = 1;
+ else
+ t += utf8encode(t, unichar);
+ }
+ *t = '\0';
+
+ tests++;
+ if (normalize_line(nfkdi_tree) < 0) {
+ printf("Line %s -> %s", buf0, buf1);
+ if (ignorables)
+ printf(" (ignorables removed)");
+ printf(" failure\n");
+ failures++;
+ }
+ }
+ fclose(file);
+ if (verbose > 0)
+ printf("Ran %d tests with %d failures\n", tests, failures);
+ if (failures)
+ file_fail(test_name);
+}
+
+/* ------------------------------------------------------------------ */
+
+static void
+write_file(void)
+{
+ FILE *file;
+ int i;
+ int j;
+ int t;
+ int gen;
+
+ if (verbose > 0)
+ printf("Writing %s\n", utf8_name);
+ file = fopen(utf8_name, "w");
+ if (!file)
+ open_fail(utf8_name, errno);
+
+ fprintf(file, "/* This file is generated code, do not edit. */\n");
+ fprintf(file, "#ifndef __INCLUDED_FROM_UTF8NORM_C__\n");
+ fprintf(file, "#error Only xfs_utf8.c may include this file.\n");
+ fprintf(file, "#endif\n");
+ fprintf(file, "\n");
+ fprintf(file, "static const unsigned int utf8vers = %#x;\n",
+ unicode_maxage);
+ fprintf(file, "\n");
+ fprintf(file, "static const unsigned int utf8agetab[] = {\n");
+ for (i = 0; i != ages_count; i++)
+ fprintf(file, "\t%#x%s\n", ages[i],
+ ages[i] == unicode_maxage ? "" : ",");
+ fprintf(file, "};\n");
+ fprintf(file, "\n");
+ fprintf(file, "static const struct utf8data utf8nfkdicfdata[] = {\n");
+ t = 0;
+ for (gen = 0; gen < ages_count; gen++) {
+ fprintf(file, "\t{ %#x, %d }%s\n",
+ ages[gen], trees[t].index,
+ ages[gen] == unicode_maxage ? "" : ",");
+ if (trees[t].maxage == ages[gen])
+ t += 2;
+ }
+ fprintf(file, "};\n");
+ fprintf(file, "\n");
+ fprintf(file, "static const struct utf8data utf8nfkdidata[] = {\n");
+ t = 1;
+ for (gen = 0; gen < ages_count; gen++) {
+ fprintf(file, "\t{ %#x, %d }%s\n",
+ ages[gen], trees[t].index,
+ ages[gen] == unicode_maxage ? "" : ",");
+ if (trees[t].maxage == ages[gen])
+ t += 2;
+ }
+ fprintf(file, "};\n");
+ fprintf(file, "\n");
+ fprintf(file, "static const unsigned char utf8data[%zd] = {\n",
+ utf8data_size);
+ t = 0;
+ for (i = 0; i != utf8data_size; i += 16) {
+ if (i == trees[t].index) {
+ fprintf(file, "\t/* %s_%x */\n",
+ trees[t].type, trees[t].maxage);
+ if (t < trees_count-1)
+ t++;
+ }
+ fprintf(file, "\t");
+ for (j = i; j != i + 16; j++)
+ fprintf(file, "0x%.2x%s", utf8data[j],
+ (j < utf8data_size -1 ? "," : ""));
+ fprintf(file, "\n");
+ }
+ fprintf(file, "};\n");
+ fclose(file);
+}
+
+/* ------------------------------------------------------------------ */
+
+int
+main(int argc, char *argv[])
+{
+ unsigned int unichar;
+ int opt;
+
+ argv0 = argv[0];
+
+ while ((opt = getopt(argc, argv, "a:c:d:f:hn:o:p:t:v")) != -1) {
+ switch (opt) {
+ case 'a':
+ age_name = optarg;
+ break;
+ case 'c':
+ ccc_name = optarg;
+ break;
+ case 'd':
+ data_name = optarg;
+ break;
+ case 'f':
+ fold_name = optarg;
+ break;
+ case 'n':
+ norm_name = optarg;
+ break;
+ case 'o':
+ utf8_name = optarg;
+ break;
+ case 'p':
+ prop_name = optarg;
+ break;
+ case 't':
+ test_name = optarg;
+ break;
+ case 'v':
+ verbose++;
+ break;
+ case 'h':
+ help();
+ exit(0);
+ default:
+ usage();
+ }
+ }
+
+ if (verbose > 1)
+ help();
+ for (unichar = 0; unichar != 0x110000; unichar++)
+ unicode_data[unichar].code = unichar;
+ age_init();
+ ccc_init();
+ nfkdi_init();
+ nfkdicf_init();
+ ignore_init();
+ corrections_init();
+ hangul_decompose();
+ nfkdi_decompose();
+ nfkdicf_decompose();
+ utf8_init();
+ trees_init();
+ trees_populate();
+ trees_reduce();
+ trees_verify();
+ /* Prevent "unused function" warning. */
+ (void)lookup(nfkdi_tree, " ");
+ if (verbose > 2)
+ tree_walk(nfkdi_tree);
+ if (verbose > 2)
+ tree_walk(nfkdicf_tree);
+ normalization_test();
+ write_file();
+
+ return 0;
+}
--
2.15.1

2018-01-25 02:53:38

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH RFC v2 02/13] charsets: ascii: Wrap ascii functions to charsets library

This allows filesystems to always use the charsets interfaces, even when
not caring about encoding.

Changes since RFC v1:
- Include length parameter for second string on comparison functions.
- Cast length type to size_t.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
lib/charsets/Makefile | 2 +
lib/charsets/ascii.c | 102 ++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 104 insertions(+)
create mode 100644 lib/charsets/ascii.c

diff --git a/lib/charsets/Makefile b/lib/charsets/Makefile
index 01ff9fd09f98..2184e7ff25de 100644
--- a/lib/charsets/Makefile
+++ b/lib/charsets/Makefile
@@ -1,3 +1,5 @@
charsets-y += core.o

obj-$(CONFIG_CHARSETS) += charsets.o
+
+obj-$(CONFIG_CHARSETS) += ascii.o
diff --git a/lib/charsets/ascii.c b/lib/charsets/ascii.c
new file mode 100644
index 000000000000..42f3ec00e41c
--- /dev/null
+++ b/lib/charsets/ascii.c
@@ -0,0 +1,102 @@
+/*
+ * Copyright (c) 2017 Collabora Ltd.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ */
+#include <linux/types.h>
+#include <linux/module.h>
+#include <linux/charsets.h>
+#include <linux/string.h>
+#include <linux/ctype.h>
+#include <linux/parser.h>
+
+static struct charset_info ascii_info;
+
+int ascii_strncmp(const struct charset *charset, const char *str1,
+ size_t len1, const char *str2, size_t len2)
+{
+ size_t len = (len1 < len2)? len1:len2;
+
+ return strncmp(str1, str2, len);
+}
+
+int ascii_strncasecmp(const struct charset *charset, const char *str1,
+ size_t len1, const char *str2, size_t len2)
+{
+ size_t len = (len1 < len2)? len1:len2;
+
+ return strncasecmp(str1, str2, len);
+}
+
+int ascii_normalize(const struct charset *charset, const char *str,
+ int len, char **normalization)
+{
+ *normalization = kstrdup(str, GFP_NOFS);
+ return (*normalization) ? len : -ENOMEM;
+}
+
+int ascii_casefold(const struct charset *charset, const char *str,
+ int len, char **folded_str)
+{
+ int i;
+ char *fold;
+
+ fold = kstrdup(str, GFP_NOFS);
+ if (!fold)
+ return -ENOMEM;
+
+ for (i = 0; i < len; i++)
+ fold[i] = tolower(fold[i]);
+
+ *folded_str = fold;
+ return len;
+}
+
+static const struct charset_ops ascii_ops = {
+ .strncmp = ascii_strncmp,
+ .strncasecmp = ascii_strncasecmp,
+ .casefold = ascii_casefold,
+ .normalize = ascii_normalize,
+};
+
+static struct charset ascii_charset = {
+ .version = 0,
+ .info = &ascii_info,
+ .ops = &ascii_ops
+};
+
+static struct charset *ascii_load_charset(void *pargs)
+{
+ return &ascii_charset;
+}
+
+static struct charset_info ascii_info = {
+ .name = "ascii",
+ .match_token = "ascii",
+ .load_charset = ascii_load_charset,
+};
+
+static int __init init_ascii(void)
+{
+ charset_register(&ascii_info);
+ return 0;
+}
+
+static void __exit exit_ascii(void)
+{
+}
+
+module_init(init_ascii);
+module_exit(exit_ascii);
+MODULE_AUTHOR("Gabriel Krisman Bertazi");
+MODULE_DESCRIPTION("ASCII charset for filesystems");
+MODULE_LICENSE("GPL");
+
--
2.15.1

2018-01-25 02:53:41

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH RFC v2 05/13] charsets: utf8: Introduce code for UTF-8 normalization

From: Olaf Weber <[email protected]>

Supporting functions for UTF-8 normalization are in utf8norm.c with the
header utf8norm.h. Two normalization forms are supported: nfkdi and nfkdicf.

nfkdi:
- Apply unicode normalization form NFKD.
- Remove any Default_Ignorable_Code_Point.

nfkdicf:
- Apply unicode normalization form NFKD.
- Remove any Default_Ignorable_Code_Point.
- Apply a full casefold (C + F).

For the purposes of the code, a string is valid UTF-8 if:

- The values encoded are 0x1..0x10FFFF.
- The surrogate codepoints 0xD800..0xDFFFF are not encoded.
- The shortest possible encoding is used for all values.

The supporting functions work on null-terminated strings (utf8 prefix) and
on length-limited strings (utf8n prefix).

>From the original SGI patch and for conformity with coding standards,
the utf8data_t typedef was dropped, since it was just masking the struct
keyword. On other occasions, namely utf8leaf_t and utf8trie_t, I
decided to keep it, since they are simple pointers to memory buffers,
and using uchars here wouldn't provide any more meaningful information.

Changes since RFC v1:
- utf8_version_is_supported receives maj, min and rev as separate
arguments. (Olaf Weber)

Signed-off-by: Olaf Weber <[email protected]>
Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
[Rebase to Mainline]
[Fix up checkpatch.pl warnings]
[Drop typedefs]
[Merge with charsets subsystem]
---
include/linux/utf8norm.h | 112 +++++++++
lib/charsets/Makefile | 4 +
lib/charsets/utf8norm.c | 644 +++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 760 insertions(+)
create mode 100644 include/linux/utf8norm.h
create mode 100644 lib/charsets/utf8norm.c

diff --git a/include/linux/utf8norm.h b/include/linux/utf8norm.h
new file mode 100644
index 000000000000..0f5fc14d4fd2
--- /dev/null
+++ b/include/linux/utf8norm.h
@@ -0,0 +1,112 @@
+/*
+ * Copyright (c) 2014 SGI.
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#ifndef UTF8NORM_H
+#define UTF8NORM_H
+
+#include <linux/types.h>
+#include <linux/export.h>
+#include <linux/string.h>
+#include <linux/module.h>
+
+/* Encoding a unicode version number as a single unsigned int. */
+#define UNICODE_MAJ_SHIFT (16)
+#define UNICODE_MIN_SHIFT (8)
+
+#define UNICODE_AGE(MAJ, MIN, REV) \
+ (((unsigned int)(MAJ) << UNICODE_MAJ_SHIFT) | \
+ ((unsigned int)(MIN) << UNICODE_MIN_SHIFT) | \
+ ((unsigned int)(REV)))
+
+/* Highest unicode version supported by the data tables. */
+extern int utf8version_is_supported(u8 maj, u8 min, u8 rev);
+
+/*
+ * Look for the correct const struct utf8data for a unicode version.
+ * Returns NULL if the version requested is too new.
+ *
+ * Two normalization forms are supported: nfkdi and nfkdicf.
+ *
+ * nfkdi:
+ * - Apply unicode normalization form NFKD.
+ * - Remove any Default_Ignorable_Code_Point.
+ *
+ * nfkdicf:
+ * - Apply unicode normalization form NFKD.
+ * - Remove any Default_Ignorable_Code_Point.
+ * - Apply a full casefold (C + F).
+ */
+extern const struct utf8data *utf8nfkdi(unsigned int maxage);
+extern const struct utf8data *utf8nfkdicf(unsigned int maxage);
+
+/*
+ * Determine the maximum age of any unicode character in the string.
+ * Returns 0 if only unassigned code points are present.
+ * Returns -1 if the input is not valid UTF-8.
+ */
+extern int utf8agemax(const struct utf8data *data, const char *s);
+extern int utf8nagemax(const struct utf8data *data, const char *s, size_t len);
+
+/*
+ * Determine the minimum age of any unicode character in the string.
+ * Returns 0 if any unassigned code points are present.
+ * Returns -1 if the input is not valid UTF-8.
+ */
+extern int utf8agemin(const struct utf8data *data, const char *s);
+extern int utf8nagemin(const struct utf8data *data, const char *s, size_t len);
+
+/*
+ * Determine the length of the normalized from of the string,
+ * excluding any terminating NULL byte.
+ * Returns 0 if only ignorable code points are present.
+ * Returns -1 if the input is not valid UTF-8.
+ */
+extern ssize_t utf8len(const struct utf8data *data, const char *s);
+extern ssize_t utf8nlen(const struct utf8data *data, const char *s, size_t len);
+
+/*
+ * Cursor structure used by the normalizer.
+ */
+struct utf8cursor {
+ const struct utf8data *data;
+ const char *s;
+ const char *p;
+ const char *ss;
+ const char *sp;
+ unsigned int len;
+ unsigned int slen;
+ short int ccc;
+ short int nccc;
+};
+
+/*
+ * Initialize a utf8cursor to normalize a string.
+ * Returns 0 on success.
+ * Returns -1 on failure.
+ */
+extern int utf8cursor(struct utf8cursor *u8c, const struct utf8data *data,
+ const char *s);
+extern int utf8ncursor(struct utf8cursor *u8c, const struct utf8data *data,
+ const char *s, size_t len);
+
+/*
+ * Get the next byte in the normalization.
+ * Returns a value > 0 && < 256 on success.
+ * Returns 0 when the end of the normalization is reached.
+ * Returns -1 if the string being normalized is not valid UTF-8.
+ */
+extern int utf8byte(struct utf8cursor *u8c);
+
+#endif /* UTF8NORM_H */
diff --git a/lib/charsets/Makefile b/lib/charsets/Makefile
index 8ab7931969ca..d33ac8643ef9 100644
--- a/lib/charsets/Makefile
+++ b/lib/charsets/Makefile
@@ -4,6 +4,10 @@ obj-$(CONFIG_CHARSETS) += charsets.o

obj-$(CONFIG_CHARSETS) += ascii.o

+utf8-y += utf8norm.o
+obj-$(CONFIG_UTF8_NORMALIZATION) += utf8.o
+
+$(obj)/utf8norm.o: $(obj)/utf8data.h
$(obj)/utf8data.h: $(srctree)/$(src)/ucd/*.txt $(objtree)/scripts/mkutf8data FORCE
$(call cmd,mkutf8data)
quiet_cmd_mkutf8data = MKUTF8DATA $@
diff --git a/lib/charsets/utf8norm.c b/lib/charsets/utf8norm.c
new file mode 100644
index 000000000000..c225a13f6eea
--- /dev/null
+++ b/lib/charsets/utf8norm.c
@@ -0,0 +1,644 @@
+/*
+ * Copyright (c) 2014 SGI.
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#include <linux/utf8norm.h>
+
+struct utf8data {
+ unsigned int maxage;
+ unsigned int offset;
+};
+
+#define __INCLUDED_FROM_UTF8NORM_C__
+#include "utf8data.h"
+#undef __INCLUDED_FROM_UTF8NORM_C__
+
+int utf8version_is_supported(u8 maj, u8 min, u8 rev)
+{
+ int i = ARRAY_SIZE(utf8agetab) - 1;
+ unsigned int sb_utf8version = UNICODE_AGE(maj, min, rev);
+
+ while (i >= 0 && utf8agetab[i] != 0) {
+ if (sb_utf8version == utf8agetab[i])
+ return 1;
+ i--;
+ }
+ return 0;
+}
+EXPORT_SYMBOL(utf8version_is_supported);
+
+/*
+ * UTF-8 valid ranges.
+ *
+ * The UTF-8 encoding spreads the bits of a 32bit word over several
+ * bytes. This table gives the ranges that can be held and how they'd
+ * be represented.
+ *
+ * 0x00000000 0x0000007F: 0xxxxxxx
+ * 0x00000000 0x000007FF: 110xxxxx 10xxxxxx
+ * 0x00000000 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ *
+ * There is an additional requirement on UTF-8, in that only the
+ * shortest representation of a 32bit value is to be used. A decoder
+ * must not decode sequences that do not satisfy this requirement.
+ * Thus the allowed ranges have a lower bound.
+ *
+ * 0x00000000 0x0000007F: 0xxxxxxx
+ * 0x00000080 0x000007FF: 110xxxxx 10xxxxxx
+ * 0x00000800 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
+ * 0x00010000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00200000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x04000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ *
+ * Actual unicode characters are limited to the range 0x0 - 0x10FFFF,
+ * 17 planes of 65536 values. This limits the sequences actually seen
+ * even more, to just the following.
+ *
+ * 0 - 0x7F: 0 - 0x7F
+ * 0x80 - 0x7FF: 0xC2 0x80 - 0xDF 0xBF
+ * 0x800 - 0xFFFF: 0xE0 0xA0 0x80 - 0xEF 0xBF 0xBF
+ * 0x10000 - 0x10FFFF: 0xF0 0x90 0x80 0x80 - 0xF4 0x8F 0xBF 0xBF
+ *
+ * Within those ranges the surrogates 0xD800 - 0xDFFF are not allowed.
+ *
+ * Note that the longest sequence seen with valid usage is 4 bytes,
+ * the same a single UTF-32 character. This makes the UTF-8
+ * representation of Unicode strictly smaller than UTF-32.
+ *
+ * The shortest sequence requirement was introduced by:
+ * Corrigendum #1: UTF-8 Shortest Form
+ * It can be found here:
+ * http://www.unicode.org/versions/corrigendum1.html
+ *
+ */
+
+/*
+ * Return the number of bytes used by the current UTF-8 sequence.
+ * Assumes the input points to the first byte of a valid UTF-8
+ * sequence.
+ */
+static inline int utf8clen(const char *s)
+{
+ unsigned char c = *s;
+
+ return 1 + (c >= 0xC0) + (c >= 0xE0) + (c >= 0xF0);
+}
+
+/*
+ * utf8trie_t
+ *
+ * A compact binary tree, used to decode UTF-8 characters.
+ *
+ * Internal nodes are one byte for the node itself, and up to three
+ * bytes for an offset into the tree. The first byte contains the
+ * following information:
+ * NEXTBYTE - flag - advance to next byte if set
+ * BITNUM - 3 bit field - the bit number to tested
+ * OFFLEN - 2 bit field - number of bytes in the offset
+ * if offlen == 0 (non-branching node)
+ * RIGHTPATH - 1 bit field - set if the following node is for the
+ * right-hand path (tested bit is set)
+ * TRIENODE - 1 bit field - set if the following node is an internal
+ * node, otherwise it is a leaf node
+ * if offlen != 0 (branching node)
+ * LEFTNODE - 1 bit field - set if the left-hand node is internal
+ * RIGHTNODE - 1 bit field - set if the right-hand node is internal
+ *
+ * Due to the way utf8 works, there cannot be branching nodes with
+ * NEXTBYTE set, and moreover those nodes always have a righthand
+ * descendant.
+ */
+typedef const unsigned char utf8trie_t;
+#define BITNUM 0x07
+#define NEXTBYTE 0x08
+#define OFFLEN 0x30
+#define OFFLEN_SHIFT 4
+#define RIGHTPATH 0x40
+#define TRIENODE 0x80
+#define RIGHTNODE 0x40
+#define LEFTNODE 0x80
+
+/*
+ * utf8leaf_t
+ *
+ * The leaves of the trie are embedded in the trie, and so the same
+ * underlying datatype: unsigned char.
+ *
+ * leaf[0]: The unicode version, stored as a generation number that is
+ * an index into utf8agetab[]. With this we can filter code
+ * points based on the unicode version in which they were
+ * defined. The CCC of a non-defined code point is 0.
+ * leaf[1]: Canonical Combining Class. During normalization, we need
+ * to do a stable sort into ascending order of all characters
+ * with a non-zero CCC that occur between two characters with
+ * a CCC of 0, or at the begin or end of a string.
+ * The unicode standard guarantees that all CCC values are
+ * between 0 and 254 inclusive, which leaves 255 available as
+ * a special value.
+ * Code points with CCC 0 are known as stoppers.
+ * leaf[2]: Decomposition. If leaf[1] == 255, then leaf[2] is the
+ * start of a NUL-terminated string that is the decomposition
+ * of the character.
+ * The CCC of a decomposable character is the same as the CCC
+ * of the first character of its decomposition.
+ * Some characters decompose as the empty string: these are
+ * characters with the Default_Ignorable_Code_Point property.
+ * These do affect normalization, as they all have CCC 0.
+ *
+ * The decompositions in the trie have been fully expanded.
+ *
+ * Casefolding, if applicable, is also done using decompositions.
+ *
+ * The trie is constructed in such a way that leaves exist for all
+ * UTF-8 sequences that match the criteria from the "UTF-8 valid
+ * ranges" comment above, and only for those sequences. Therefore a
+ * lookup in the trie can be used to validate the UTF-8 input.
+ */
+typedef const unsigned char utf8leaf_t;
+
+#define LEAF_GEN(LEAF) ((LEAF)[0])
+#define LEAF_CCC(LEAF) ((LEAF)[1])
+#define LEAF_STR(LEAF) ((const char *)((LEAF) + 2))
+
+#define MINCCC (0)
+#define MAXCCC (254)
+#define STOPPER (0)
+#define DECOMPOSE (255)
+
+/*
+ * Use trie to scan s, touching at most len bytes.
+ * Returns the leaf if one exists, NULL otherwise.
+ *
+ * A non-NULL return guarantees that the UTF-8 sequence starting at s
+ * is well-formed and corresponds to a known unicode code point. The
+ * shorthand for this will be "is valid UTF-8 unicode".
+ */
+static utf8leaf_t *utf8nlookup(const struct utf8data *data, const char *s,
+ size_t len)
+{
+ utf8trie_t *trie = utf8data + data->offset;
+ int offlen;
+ int offset;
+ int mask;
+ int node;
+
+ if (!data)
+ return NULL;
+ if (len == 0)
+ return NULL;
+ node = 1;
+ while (node) {
+ offlen = (*trie & OFFLEN) >> OFFLEN_SHIFT;
+ if (*trie & NEXTBYTE) {
+ if (--len == 0)
+ return NULL;
+ s++;
+ }
+ mask = 1 << (*trie & BITNUM);
+ if (*s & mask) {
+ /* Right leg */
+ if (offlen) {
+ /* Right node at offset of trie */
+ node = (*trie & RIGHTNODE);
+ offset = trie[offlen];
+ while (--offlen) {
+ offset <<= 8;
+ offset |= trie[offlen];
+ }
+ trie += offset;
+ } else if (*trie & RIGHTPATH) {
+ /* Right node after this node */
+ node = (*trie & TRIENODE);
+ trie++;
+ } else {
+ /* No right node. */
+ node = 0;
+ trie = NULL;
+ }
+ } else {
+ /* Left leg */
+ if (offlen) {
+ /* Left node after this node. */
+ node = (*trie & LEFTNODE);
+ trie += offlen + 1;
+ } else if (*trie & RIGHTPATH) {
+ /* No left node. */
+ node = 0;
+ trie = NULL;
+ } else {
+ /* Left node after this node */
+ node = (*trie & TRIENODE);
+ trie++;
+ }
+ }
+ }
+ return trie;
+}
+
+/*
+ * Use trie to scan s.
+ * Returns the leaf if one exists, NULL otherwise.
+ *
+ * Forwards to utf8nlookup().
+ */
+static utf8leaf_t *utf8lookup(const struct utf8data *data, const char *s)
+{
+ return utf8nlookup(data, s, (size_t)-1);
+}
+
+/*
+ * Maximum age of any character in s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ * Return 0 if only non-assigned code points are used.
+ */
+int utf8agemax(const struct utf8data *data, const char *s)
+{
+ utf8leaf_t *leaf;
+ int age = 0;
+ int leaf_age;
+
+ if (!data)
+ return -1;
+ while (*s) {
+ leaf = utf8lookup(data, s);
+ if (!leaf)
+ return -1;
+
+ leaf_age = utf8agetab[LEAF_GEN(leaf)];
+ if (leaf_age <= data->maxage && leaf_age > age)
+ age = leaf_age;
+ s += utf8clen(s);
+ }
+ return age;
+}
+EXPORT_SYMBOL(utf8agemax);
+
+/*
+ * Minimum age of any character in s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ * Return 0 if non-assigned code points are used.
+ */
+int utf8agemin(const struct utf8data *data, const char *s)
+{
+ utf8leaf_t *leaf;
+ int age;
+ int leaf_age;
+
+ if (!data)
+ return -1;
+ age = data->maxage;
+ while (*s) {
+ leaf = utf8lookup(data, s);
+ if (!leaf)
+ return -1;
+ leaf_age = utf8agetab[LEAF_GEN(leaf)];
+ if (leaf_age <= data->maxage && leaf_age < age)
+ age = leaf_age;
+ s += utf8clen(s);
+ }
+ return age;
+}
+EXPORT_SYMBOL(utf8agemin);
+
+/*
+ * Maximum age of any character in s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+int utf8nagemax(const struct utf8data *data, const char *s, size_t len)
+{
+ utf8leaf_t *leaf;
+ int age = 0;
+ int leaf_age;
+
+ if (!data)
+ return -1;
+ while (len && *s) {
+ leaf = utf8nlookup(data, s, len);
+ if (!leaf)
+ return -1;
+ leaf_age = utf8agetab[LEAF_GEN(leaf)];
+ if (leaf_age <= data->maxage && leaf_age > age)
+ age = leaf_age;
+ len -= utf8clen(s);
+ s += utf8clen(s);
+ }
+ return age;
+}
+EXPORT_SYMBOL(utf8nagemax);
+
+/*
+ * Maximum age of any character in s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+int utf8nagemin(const struct utf8data *data, const char *s, size_t len)
+{
+ utf8leaf_t *leaf;
+ int leaf_age;
+ int age;
+
+ if (!data)
+ return -1;
+ age = data->maxage;
+ while (len && *s) {
+ leaf = utf8nlookup(data, s, len);
+ if (!leaf)
+ return -1;
+ leaf_age = utf8agetab[LEAF_GEN(leaf)];
+ if (leaf_age <= data->maxage && leaf_age < age)
+ age = leaf_age;
+ len -= utf8clen(s);
+ s += utf8clen(s);
+ }
+ return age;
+}
+EXPORT_SYMBOL(utf8nagemin);
+
+/*
+ * Length of the normalization of s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ *
+ * A string of Default_Ignorable_Code_Point has length 0.
+ */
+ssize_t utf8len(const struct utf8data *data, const char *s)
+{
+ utf8leaf_t *leaf;
+ size_t ret = 0;
+
+ if (!data)
+ return -1;
+ while (*s) {
+ leaf = utf8lookup(data, s);
+ if (!leaf)
+ return -1;
+ if (utf8agetab[LEAF_GEN(leaf)] > data->maxage)
+ ret += utf8clen(s);
+ else if (LEAF_CCC(leaf) == DECOMPOSE)
+ ret += strlen(LEAF_STR(leaf));
+ else
+ ret += utf8clen(s);
+ s += utf8clen(s);
+ }
+ return ret;
+}
+EXPORT_SYMBOL(utf8len);
+
+/*
+ * Length of the normalization of s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+ssize_t utf8nlen(const struct utf8data *data, const char *s, size_t len)
+{
+ utf8leaf_t *leaf;
+ size_t ret = 0;
+
+ if (!data)
+ return -1;
+ while (len && *s) {
+ leaf = utf8nlookup(data, s, len);
+ if (!leaf)
+ return -1;
+ if (utf8agetab[LEAF_GEN(leaf)] > data->maxage)
+ ret += utf8clen(s);
+ else if (LEAF_CCC(leaf) == DECOMPOSE)
+ ret += strlen(LEAF_STR(leaf));
+ else
+ ret += utf8clen(s);
+ len -= utf8clen(s);
+ s += utf8clen(s);
+ }
+ return ret;
+}
+EXPORT_SYMBOL(utf8nlen);
+
+/*
+ * Set up an utf8cursor for use by utf8byte().
+ *
+ * u8c : pointer to cursor.
+ * data : const struct utf8data to use for normalization.
+ * s : string.
+ * len : length of s.
+ *
+ * Returns -1 on error, 0 on success.
+ */
+int utf8ncursor(struct utf8cursor *u8c, const struct utf8data *data,
+ const char *s, size_t len)
+{
+ if (!data)
+ return -1;
+ if (!s)
+ return -1;
+ u8c->data = data;
+ u8c->s = s;
+ u8c->p = NULL;
+ u8c->ss = NULL;
+ u8c->sp = NULL;
+ u8c->len = len;
+ u8c->slen = 0;
+ u8c->ccc = STOPPER;
+ u8c->nccc = STOPPER;
+ /* Check we didn't clobber the maximum length. */
+ if (u8c->len != len)
+ return -1;
+ /* The first byte of s may not be an utf8 continuation. */
+ if (len > 0 && (*s & 0xC0) == 0x80)
+ return -1;
+ return 0;
+}
+EXPORT_SYMBOL(utf8ncursor);
+
+/*
+ * Set up an utf8cursor for use by utf8byte().
+ *
+ * u8c : pointer to cursor.
+ * data : const struct utf8data to use for normalization.
+ * s : NUL-terminated string.
+ *
+ * Returns -1 on error, 0 on success.
+ */
+int utf8cursor(struct utf8cursor *u8c, const struct utf8data *data,
+ const char *s)
+{
+ return utf8ncursor(u8c, data, s, (unsigned int)-1);
+}
+EXPORT_SYMBOL(utf8cursor);
+
+/*
+ * Get one byte from the normalized form of the string described by u8c.
+ *
+ * Returns the byte cast to an unsigned char on succes, and -1 on failure.
+ *
+ * The cursor keeps track of the location in the string in u8c->s.
+ * When a character is decomposed, the current location is stored in
+ * u8c->p, and u8c->s is set to the start of the decomposition. Note
+ * that bytes from a decomposition do not count against u8c->len.
+ *
+ * Characters are emitted if they match the current CCC in u8c->ccc.
+ * Hitting end-of-string while u8c->ccc == STOPPER means we're done,
+ * and the function returns 0 in that case.
+ *
+ * Sorting by CCC is done by repeatedly scanning the string. The
+ * values of u8c->s and u8c->p are stored in u8c->ss and u8c->sp at
+ * the start of the scan. The first pass finds the lowest CCC to be
+ * emitted and stores it in u8c->nccc, the second pass emits the
+ * characters with this CCC and finds the next lowest CCC. This limits
+ * the number of passes to 1 + the number of different CCCs in the
+ * sequence being scanned.
+ *
+ * Therefore:
+ * u8c->p != NULL -> a decomposition is being scanned.
+ * u8c->ss != NULL -> this is a repeating scan.
+ * u8c->ccc == -1 -> this is the first scan of a repeating scan.
+ */
+int utf8byte(struct utf8cursor *u8c)
+{
+ utf8leaf_t *leaf;
+ int ccc;
+
+ for (;;) {
+ /* Check for the end of a decomposed character. */
+ if (u8c->p && *u8c->s == '\0') {
+ u8c->s = u8c->p;
+ u8c->p = NULL;
+ }
+
+ /* Check for end-of-string. */
+ if (!u8c->p && (u8c->len == 0 || *u8c->s == '\0')) {
+ /* There is no next byte. */
+ if (u8c->ccc == STOPPER)
+ return 0;
+ /* End-of-string during a scan counts as a stopper. */
+ ccc = STOPPER;
+ goto ccc_mismatch;
+ } else if ((*u8c->s & 0xC0) == 0x80) {
+ /* This is a continuation of the current character. */
+ if (!u8c->p)
+ u8c->len--;
+ return (unsigned char)*u8c->s++;
+ }
+
+ /* Look up the data for the current character. */
+ if (u8c->p)
+ leaf = utf8lookup(u8c->data, u8c->s);
+ else
+ leaf = utf8nlookup(u8c->data, u8c->s, u8c->len);
+
+ /* No leaf found implies that the input is a binary blob. */
+ if (!leaf)
+ return -1;
+
+ ccc = LEAF_CCC(leaf);
+ /* Characters that are too new have CCC 0. */
+ if (utf8agetab[LEAF_GEN(leaf)] > u8c->data->maxage) {
+ ccc = STOPPER;
+ } else if (ccc == DECOMPOSE) {
+ u8c->len -= utf8clen(u8c->s);
+ u8c->p = u8c->s + utf8clen(u8c->s);
+ u8c->s = LEAF_STR(leaf);
+ /* Empty decomposition implies CCC 0. */
+ if (*u8c->s == '\0') {
+ if (u8c->ccc == STOPPER)
+ continue;
+ ccc = STOPPER;
+ goto ccc_mismatch;
+ }
+ leaf = utf8lookup(u8c->data, u8c->s);
+ }
+
+ /*
+ * If this is not a stopper, then see if it updates
+ * the next canonical class to be emitted.
+ */
+ if (ccc != STOPPER && u8c->ccc < ccc && ccc < u8c->nccc)
+ u8c->nccc = ccc;
+
+ /*
+ * Return the current byte if this is the current
+ * combining class.
+ */
+ if (ccc == u8c->ccc) {
+ if (!u8c->p)
+ u8c->len--;
+ return (unsigned char)*u8c->s++;
+ }
+
+ /* Current combining class mismatch. */
+ccc_mismatch:
+ if (u8c->nccc == STOPPER) {
+ /*
+ * Scan forward for the first canonical class
+ * to be emitted. Save the position from
+ * which to restart.
+ */
+ u8c->ccc = MINCCC - 1;
+ u8c->nccc = ccc;
+ u8c->sp = u8c->p;
+ u8c->ss = u8c->s;
+ u8c->slen = u8c->len;
+ if (!u8c->p)
+ u8c->len -= utf8clen(u8c->s);
+ u8c->s += utf8clen(u8c->s);
+ } else if (ccc != STOPPER) {
+ /* Not a stopper, and not the ccc we're emitting. */
+ if (!u8c->p)
+ u8c->len -= utf8clen(u8c->s);
+ u8c->s += utf8clen(u8c->s);
+ } else if (u8c->nccc != MAXCCC + 1) {
+ /* At a stopper, restart for next ccc. */
+ u8c->ccc = u8c->nccc;
+ u8c->nccc = MAXCCC + 1;
+ u8c->s = u8c->ss;
+ u8c->p = u8c->sp;
+ u8c->len = u8c->slen;
+ } else {
+ /* All done, proceed from here. */
+ u8c->ccc = STOPPER;
+ u8c->nccc = STOPPER;
+ u8c->sp = NULL;
+ u8c->ss = NULL;
+ u8c->slen = 0;
+ }
+ }
+}
+EXPORT_SYMBOL(utf8byte);
+
+const struct utf8data *utf8nfkdi(unsigned int maxage)
+{
+ int i = ARRAY_SIZE(utf8nfkdidata) - 1;
+
+ while (maxage < utf8nfkdidata[i].maxage)
+ i--;
+ if (maxage > utf8nfkdidata[i].maxage)
+ return NULL;
+ return &utf8nfkdidata[i];
+}
+EXPORT_SYMBOL(utf8nfkdi);
+
+const struct utf8data *utf8nfkdicf(unsigned int maxage)
+{
+ int i = ARRAY_SIZE(utf8nfkdicfdata) - 1;
+
+ while (maxage < utf8nfkdicfdata[i].maxage)
+ i--;
+ if (maxage > utf8nfkdicfdata[i].maxage)
+ return NULL;
+ return &utf8nfkdicfdata[i];
+}
+EXPORT_SYMBOL(utf8nfkdicf);
+
+MODULE_AUTHOR("SGI");
+MODULE_DESCRIPTION("utf8 normalization");
+MODULE_LICENSE("GPL");
--
2.15.1

2018-01-25 02:53:42

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH RFC v2 06/13] charsets: utf8: reduce the size of utf8data[]

From: Olaf Weber <[email protected]>

Remove the Hangul decompositions from the utf8data trie, and do
algorithmic decomposition to calculate them on the fly. To store
the decomposition the caller of utf8lookup()/utf8nlookup() must
provide a 12-byte buffer, which is used to synthesize a leaf with
the decomposition. Trie size is reduced from 245kB to 90kB.

Signed-off-by: Olaf Weber <[email protected]>
Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
[Rebase to mainline]
[Fix checkpatch errors]
[Extract robustness fixes and merge back to original mkutf8data.c
patch]
---
include/linux/utf8norm.h | 4 +
lib/charsets/utf8norm.c | 191 ++++++++++++++++++++++++++----
scripts/mkutf8data.c | 295 +++++++++++++++++++++++++++++++++++++++++------
3 files changed, 435 insertions(+), 55 deletions(-)

diff --git a/include/linux/utf8norm.h b/include/linux/utf8norm.h
index 0f5fc14d4fd2..f60827663503 100644
--- a/include/linux/utf8norm.h
+++ b/include/linux/utf8norm.h
@@ -76,6 +76,9 @@ extern int utf8nagemin(const struct utf8data *data, const char *s, size_t len);
extern ssize_t utf8len(const struct utf8data *data, const char *s);
extern ssize_t utf8nlen(const struct utf8data *data, const char *s, size_t len);

+/* Needed in struct utf8cursor below. */
+#define UTF8HANGULLEAF (12)
+
/*
* Cursor structure used by the normalizer.
*/
@@ -89,6 +92,7 @@ struct utf8cursor {
unsigned int slen;
short int ccc;
short int nccc;
+ unsigned char hangul[UTF8HANGULLEAF];
};

/*
diff --git a/lib/charsets/utf8norm.c b/lib/charsets/utf8norm.c
index c225a13f6eea..8cd389c38f2b 100644
--- a/lib/charsets/utf8norm.c
+++ b/lib/charsets/utf8norm.c
@@ -98,6 +98,38 @@ static inline int utf8clen(const char *s)
return 1 + (c >= 0xC0) + (c >= 0xE0) + (c >= 0xF0);
}

+/*
+ * Decode a 3-byte UTF-8 sequence.
+ */
+static unsigned int
+utf8decode3(const char *str)
+{
+ unsigned int uc;
+
+ uc = *str++ & 0x0F;
+ uc <<= 6;
+ uc |= *str++ & 0x3F;
+ uc <<= 6;
+ uc |= *str++ & 0x3F;
+
+ return uc;
+}
+
+/*
+ * Encode a 3-byte UTF-8 sequence.
+ */
+static int
+utf8encode3(char *str, unsigned int val)
+{
+ str[2] = (val & 0x3F) | 0x80;
+ val >>= 6;
+ str[1] = (val & 0x3F) | 0x80;
+ val >>= 6;
+ str[0] = val | 0xE0;
+
+ return 3;
+}
+
/*
* utf8trie_t
*
@@ -159,7 +191,8 @@ typedef const unsigned char utf8trie_t;
* characters with the Default_Ignorable_Code_Point property.
* These do affect normalization, as they all have CCC 0.
*
- * The decompositions in the trie have been fully expanded.
+ * The decompositions in the trie have been fully expanded, with the
+ * exception of Hangul syllables, which are decomposed algorithmically.
*
* Casefolding, if applicable, is also done using decompositions.
*
@@ -179,6 +212,105 @@ typedef const unsigned char utf8leaf_t;
#define STOPPER (0)
#define DECOMPOSE (255)

+/* Marker for hangul syllable decomposition. */
+#define HANGUL ((char)(255))
+/* Size of the synthesized leaf used for Hangul syllable decomposition. */
+#define UTF8HANGULLEAF (12)
+
+/*
+ * Hangul decomposition (algorithm from Section 3.12 of Unicode 6.3.0)
+ *
+ * AC00;<Hangul Syllable, First>;Lo;0;L;;;;;N;;;;;
+ * D7A3;<Hangul Syllable, Last>;Lo;0;L;;;;;N;;;;;
+ *
+ * SBase = 0xAC00
+ * LBase = 0x1100
+ * VBase = 0x1161
+ * TBase = 0x11A7
+ * LCount = 19
+ * VCount = 21
+ * TCount = 28
+ * NCount = 588 (VCount * TCount)
+ * SCount = 11172 (LCount * NCount)
+ *
+ * Decomposition:
+ * SIndex = s - SBase
+ *
+ * LV (Canonical/Full)
+ * LIndex = SIndex / NCount
+ * VIndex = (Sindex % NCount) / TCount
+ * LPart = LBase + LIndex
+ * VPart = VBase + VIndex
+ *
+ * LVT (Canonical)
+ * LVIndex = (SIndex / TCount) * TCount
+ * TIndex = (Sindex % TCount)
+ * LVPart = SBase + LVIndex
+ * TPart = TBase + TIndex
+ *
+ * LVT (Full)
+ * LIndex = SIndex / NCount
+ * VIndex = (Sindex % NCount) / TCount
+ * TIndex = (Sindex % TCount)
+ * LPart = LBase + LIndex
+ * VPart = VBase + VIndex
+ * if (TIndex == 0) {
+ * d = <LPart, VPart>
+ * } else {
+ * TPart = TBase + TIndex
+ * d = <LPart, TPart, VPart>
+ * }
+ */
+
+/* Constants */
+#define SB (0xAC00)
+#define LB (0x1100)
+#define VB (0x1161)
+#define TB (0x11A7)
+#define LC (19)
+#define VC (21)
+#define TC (28)
+#define NC (VC * TC)
+#define SC (LC * NC)
+
+/* Algorithmic decomposition of hangul syllable. */
+static utf8leaf_t *
+utf8hangul(const char *str, unsigned char *hangul)
+{
+ unsigned int si;
+ unsigned int li;
+ unsigned int vi;
+ unsigned int ti;
+ unsigned char *h;
+
+ /* Calculate the SI, LI, VI, and TI values. */
+ si = utf8decode3(str) - SB;
+ li = si / NC;
+ vi = (si % NC) / TC;
+ ti = si % TC;
+
+ /* Fill in base of leaf. */
+ h = hangul;
+ LEAF_GEN(h) = 2;
+ LEAF_CCC(h) = DECOMPOSE;
+ h += 2;
+
+ /* Add LPart, a 3-byte UTF-8 sequence. */
+ h += utf8encode3((char *)h, li + LB);
+
+ /* Add VPart, a 3-byte UTF-8 sequence. */
+ h += utf8encode3((char *)h, vi + VB);
+
+ /* Add TPart if required, also a 3-byte UTF-8 sequence. */
+ if (ti)
+ h += utf8encode3((char *)h, ti + TB);
+
+ /* Terminate string. */
+ h[0] = '\0';
+
+ return hangul;
+}
+
/*
* Use trie to scan s, touching at most len bytes.
* Returns the leaf if one exists, NULL otherwise.
@@ -187,8 +319,8 @@ typedef const unsigned char utf8leaf_t;
* is well-formed and corresponds to a known unicode code point. The
* shorthand for this will be "is valid UTF-8 unicode".
*/
-static utf8leaf_t *utf8nlookup(const struct utf8data *data, const char *s,
- size_t len)
+static utf8leaf_t *utf8nlookup(const struct utf8data *data,
+ unsigned char *hangul, const char *s, size_t len)
{
utf8trie_t *trie = utf8data + data->offset;
int offlen;
@@ -226,8 +358,7 @@ static utf8leaf_t *utf8nlookup(const struct utf8data *data, const char *s,
trie++;
} else {
/* No right node. */
- node = 0;
- trie = NULL;
+ return NULL;
}
} else {
/* Left leg */
@@ -237,8 +368,7 @@ static utf8leaf_t *utf8nlookup(const struct utf8data *data, const char *s,
trie += offlen + 1;
} else if (*trie & RIGHTPATH) {
/* No left node. */
- node = 0;
- trie = NULL;
+ return NULL;
} else {
/* Left node after this node */
node = (*trie & TRIENODE);
@@ -246,6 +376,14 @@ static utf8leaf_t *utf8nlookup(const struct utf8data *data, const char *s,
}
}
}
+ /*
+ * Hangul decomposition is done algorithmically. These are the
+ * codepoints >= 0xAC00 and <= 0xD7A3. Their UTF-8 encoding is
+ * always 3 bytes long, so s has been advanced twice, and the
+ * start of the sequence is at s-2.
+ */
+ if (LEAF_CCC(trie) == DECOMPOSE && LEAF_STR(trie)[0] == HANGUL)
+ trie = utf8hangul(s - 2, hangul);
return trie;
}

@@ -255,9 +393,10 @@ static utf8leaf_t *utf8nlookup(const struct utf8data *data, const char *s,
*
* Forwards to utf8nlookup().
*/
-static utf8leaf_t *utf8lookup(const struct utf8data *data, const char *s)
+static utf8leaf_t *utf8lookup(const struct utf8data *data,
+ unsigned char *hangul, const char *s)
{
- return utf8nlookup(data, s, (size_t)-1);
+ return utf8nlookup(data, hangul, s, (size_t)-1);
}

/*
@@ -270,11 +409,13 @@ int utf8agemax(const struct utf8data *data, const char *s)
utf8leaf_t *leaf;
int age = 0;
int leaf_age;
+ unsigned char hangul[UTF8HANGULLEAF];

if (!data)
return -1;
+
while (*s) {
- leaf = utf8lookup(data, s);
+ leaf = utf8lookup(data, hangul, s);
if (!leaf)
return -1;

@@ -297,12 +438,13 @@ int utf8agemin(const struct utf8data *data, const char *s)
utf8leaf_t *leaf;
int age;
int leaf_age;
+ unsigned char hangul[UTF8HANGULLEAF];

if (!data)
return -1;
age = data->maxage;
while (*s) {
- leaf = utf8lookup(data, s);
+ leaf = utf8lookup(data, hangul, s);
if (!leaf)
return -1;
leaf_age = utf8agetab[LEAF_GEN(leaf)];
@@ -323,11 +465,13 @@ int utf8nagemax(const struct utf8data *data, const char *s, size_t len)
utf8leaf_t *leaf;
int age = 0;
int leaf_age;
+ unsigned char hangul[UTF8HANGULLEAF];

if (!data)
return -1;
+
while (len && *s) {
- leaf = utf8nlookup(data, s, len);
+ leaf = utf8nlookup(data, hangul, s, len);
if (!leaf)
return -1;
leaf_age = utf8agetab[LEAF_GEN(leaf)];
@@ -349,12 +493,13 @@ int utf8nagemin(const struct utf8data *data, const char *s, size_t len)
utf8leaf_t *leaf;
int leaf_age;
int age;
+ unsigned char hangul[UTF8HANGULLEAF];

if (!data)
return -1;
age = data->maxage;
while (len && *s) {
- leaf = utf8nlookup(data, s, len);
+ leaf = utf8nlookup(data, hangul, s, len);
if (!leaf)
return -1;
leaf_age = utf8agetab[LEAF_GEN(leaf)];
@@ -377,11 +522,12 @@ ssize_t utf8len(const struct utf8data *data, const char *s)
{
utf8leaf_t *leaf;
size_t ret = 0;
+ unsigned char hangul[UTF8HANGULLEAF];

if (!data)
return -1;
while (*s) {
- leaf = utf8lookup(data, s);
+ leaf = utf8lookup(data, hangul, s);
if (!leaf)
return -1;
if (utf8agetab[LEAF_GEN(leaf)] > data->maxage)
@@ -404,11 +550,12 @@ ssize_t utf8nlen(const struct utf8data *data, const char *s, size_t len)
{
utf8leaf_t *leaf;
size_t ret = 0;
+ unsigned char hangul[UTF8HANGULLEAF];

if (!data)
return -1;
while (len && *s) {
- leaf = utf8nlookup(data, s, len);
+ leaf = utf8nlookup(data, hangul, s, len);
if (!leaf)
return -1;
if (utf8agetab[LEAF_GEN(leaf)] > data->maxage)
@@ -531,10 +678,12 @@ int utf8byte(struct utf8cursor *u8c)
}

/* Look up the data for the current character. */
- if (u8c->p)
- leaf = utf8lookup(u8c->data, u8c->s);
- else
- leaf = utf8nlookup(u8c->data, u8c->s, u8c->len);
+ if (u8c->p) {
+ leaf = utf8lookup(u8c->data, u8c->hangul, u8c->s);
+ } else {
+ leaf = utf8nlookup(u8c->data, u8c->hangul,
+ u8c->s, u8c->len);
+ }

/* No leaf found implies that the input is a binary blob. */
if (!leaf)
@@ -555,7 +704,9 @@ int utf8byte(struct utf8cursor *u8c)
ccc = STOPPER;
goto ccc_mismatch;
}
- leaf = utf8lookup(u8c->data, u8c->s);
+
+ leaf = utf8lookup(u8c->data, u8c->hangul, u8c->s);
+ ccc = LEAF_CCC(leaf);
}

/*
diff --git a/scripts/mkutf8data.c b/scripts/mkutf8data.c
index 700b41c0cb66..69f3be92ba71 100644
--- a/scripts/mkutf8data.c
+++ b/scripts/mkutf8data.c
@@ -180,10 +180,14 @@ typedef unsigned char utf8leaf_t;
#define MAXCCC (254)
#define STOPPER (0)
#define DECOMPOSE (255)
+#define HANGUL ((char)(255))
+
+#define UTF8HANGULLEAF (12)

struct tree;
-static utf8leaf_t *utf8nlookup(struct tree *, const char *, size_t);
-static utf8leaf_t *utf8lookup(struct tree *, const char *);
+static utf8leaf_t *utf8nlookup(struct tree *, unsigned char *,
+ const char *, size_t);
+static utf8leaf_t *utf8lookup(struct tree *, unsigned char *, const char *);

unsigned char *utf8data;
size_t utf8data_size;
@@ -334,6 +338,8 @@ utf32valid(unsigned int unichar)
return unichar < 0x110000;
}

+#define HANGUL_SYLLABLE(U) ((U) >= 0xAC00 && (U) <= 0xD7A3)
+
#define NODE 1
#define LEAF 0

@@ -466,7 +472,7 @@ tree_walk(struct tree *tree)
indent+1);
leaves += 1;
} else if (node->right) {
- assert(node->rightnode==NODE);
+ assert(node->rightnode == NODE);
indent += 1;
node = node->right;
break;
@@ -864,7 +870,7 @@ mark_nodes(struct tree *tree)
}
}
} else if (node->right) {
- assert(node->rightnode==NODE);
+ assert(node->rightnode == NODE);
node = node->right;
continue;
}
@@ -916,7 +922,7 @@ mark_nodes(struct tree *tree)
}
}
} else if (node->right) {
- assert(node->rightnode==NODE);
+ assert(node->rightnode == NODE);
node = node->right;
if (!node->mark && node->parent->mark &&
!node->parent->left) {
@@ -1000,7 +1006,7 @@ index_nodes(struct tree *tree, int index)
index += tree->leaf_size(node->right);
count++;
} else if (node->right) {
- assert(node->rightnode==NODE);
+ assert(node->rightnode == NODE);
indent += 1;
node = node->right;
break;
@@ -1021,6 +1027,26 @@ index_nodes(struct tree *tree, int index)
return index;
}

+/*
+ * Mark the nodes in a subtree, helper for size_nodes().
+ */
+static int
+mark_subtree(struct node *node)
+{
+ int changed;
+
+ if (!node || node->mark)
+ return 0;
+ node->mark = 1;
+ node->index = node->parent->index;
+ changed = 1;
+ if (node->leftnode == NODE)
+ changed += mark_subtree(node->left);
+ if (node->rightnode == NODE)
+ changed += mark_subtree(node->right);
+ return changed;
+}
+
/*
* Compute the size of nodes and leaves. We start by assuming that
* each node needs to store a three-byte offset. The indexes of the
@@ -1040,6 +1066,7 @@ size_nodes(struct tree *tree)
unsigned int bitmask;
unsigned int pathbits;
unsigned int pathmask;
+ unsigned int nbit;
int changed;
int offset;
int size;
@@ -1067,22 +1094,40 @@ size_nodes(struct tree *tree)
size = 1;
} else {
if (node->rightnode == NODE) {
+ /*
+ * If the right node is not marked,
+ * look for a corresponding node in
+ * the next tree. Such a node need
+ * not exist.
+ */
right = node->right;
next = tree->next;
while (!right->mark) {
assert(next);
n = next->root;
while (n->bitnum != node->bitnum) {
- if (pathbits & (1<<n->bitnum))
+ nbit = 1 << n->bitnum;
+ if (!(pathmask & nbit))
+ break;
+ if (pathbits & nbit) {
+ if (n->rightnode == LEAF)
+ break;
n = n->right;
- else
+ } else {
+ if (n->leftnode == LEAF)
+ break;
n = n->left;
+ }
}
+ if (n->bitnum != node->bitnum)
+ break;
n = n->right;
- assert(right->bitnum == n->bitnum);
right = n;
next = next->next;
}
+ /* Make sure the right node is marked. */
+ if (!right->mark)
+ changed += mark_subtree(right);
offset = right->index - node->index;
} else {
offset = *tree->leaf_index(tree, node->right);
@@ -1124,7 +1169,7 @@ size_nodes(struct tree *tree)
if (node->rightnode == LEAF) {
assert(node->right);
} else if (node->right) {
- assert(node->rightnode==NODE);
+ assert(node->rightnode == NODE);
indent += 1;
node = node->right;
break;
@@ -1158,8 +1203,15 @@ emit(struct tree *tree, unsigned char *data)
int offset;
int index;
int indent;
+ int size;
+ int bytes;
+ int leaves;
+ int nodes[4];
unsigned char byte;

+ nodes[0] = nodes[1] = nodes[2] = nodes[3] = 0;
+ leaves = 0;
+ bytes = 0;
index = tree->index;
data += index;
indent = 1;
@@ -1168,7 +1220,10 @@ emit(struct tree *tree, unsigned char *data)
if (tree->childnode == LEAF) {
assert(tree->root);
tree->leaf_emit(tree->root, data);
- return;
+ size = tree->leaf_size(tree->root);
+ index += size;
+ leaves++;
+ goto done;
}

assert(tree->childnode == NODE);
@@ -1195,6 +1250,7 @@ emit(struct tree *tree, unsigned char *data)
offlen = 2;
else
offlen = 3;
+ nodes[offlen]++;
offset = node->offset;
byte |= offlen << OFFLEN_SHIFT;
*data++ = byte;
@@ -1207,12 +1263,14 @@ emit(struct tree *tree, unsigned char *data)
} else if (node->left) {
if (node->leftnode == NODE)
byte |= TRIENODE;
+ nodes[0]++;
*data++ = byte;
index++;
} else if (node->right) {
byte |= RIGHTNODE;
if (node->rightnode == NODE)
byte |= TRIENODE;
+ nodes[0]++;
*data++ = byte;
index++;
} else {
@@ -1227,7 +1285,10 @@ emit(struct tree *tree, unsigned char *data)
assert(node->left);
data = tree->leaf_emit(node->left,
data);
- index += tree->leaf_size(node->left);
+ size = tree->leaf_size(node->left);
+ index += size;
+ bytes += size;
+ leaves++;
} else if (node->left) {
assert(node->leftnode == NODE);
indent += 1;
@@ -1241,9 +1302,12 @@ emit(struct tree *tree, unsigned char *data)
assert(node->right);
data = tree->leaf_emit(node->right,
data);
- index += tree->leaf_size(node->right);
+ size = tree->leaf_size(node->right);
+ index += size;
+ bytes += size;
+ leaves++;
} else if (node->right) {
- assert(node->rightnode==NODE);
+ assert(node->rightnode == NODE);
indent += 1;
node = node->right;
break;
@@ -1255,6 +1319,15 @@ emit(struct tree *tree, unsigned char *data)
indent -= 1;
}
}
+done:
+ if (verbose > 0) {
+ printf("Emitted %d (%d) leaves",
+ leaves, bytes);
+ printf(" %d (%d+%d+%d+%d) nodes",
+ nodes[0] + nodes[1] + nodes[2] + nodes[3],
+ nodes[0], nodes[1], nodes[2], nodes[3]);
+ printf(" %d total\n", index - tree->index);
+ }
}

/* ------------------------------------------------------------------ */
@@ -1360,7 +1433,9 @@ nfkdi_print(void *l, int indent)

printf("%*sleaf @ %p code %X ccc %d gen %d", indent, "", leaf,
leaf->code, leaf->ccc, leaf->gen);
- if (leaf->utf8nfkdi)
+ if (leaf->utf8nfkdi && leaf->utf8nfkdi[0] == HANGUL)
+ printf(" nfkdi \"%s\"", "HANGUL SYLLABLE");
+ else if (leaf->utf8nfkdi)
printf(" nfkdi \"%s\"", (const char*)leaf->utf8nfkdi);
printf("\n");
}
@@ -1374,6 +1449,8 @@ nfkdicf_print(void *l, int indent)
leaf->code, leaf->ccc, leaf->gen);
if (leaf->utf8nfkdicf)
printf(" nfkdicf \"%s\"", (const char*)leaf->utf8nfkdicf);
+ else if (leaf->utf8nfkdi && leaf->utf8nfkdi[0] == HANGUL)
+ printf(" nfkdi \"%s\"", "HANGUL SYLLABLE");
else if (leaf->utf8nfkdi)
printf(" nfkdi \"%s\"", (const char*)leaf->utf8nfkdi);
printf("\n");
@@ -1409,7 +1486,9 @@ nfkdi_size(void *l)
struct unicode_data *leaf = l;

int size = 2;
- if (leaf->utf8nfkdi)
+ if (HANGUL_SYLLABLE(leaf->code))
+ size += 1;
+ else if (leaf->utf8nfkdi)
size += strlen(leaf->utf8nfkdi) + 1;
return size;
}
@@ -1420,7 +1499,9 @@ nfkdicf_size(void *l)
struct unicode_data *leaf = l;

int size = 2;
- if (leaf->utf8nfkdicf)
+ if (HANGUL_SYLLABLE(leaf->code))
+ size += 1;
+ else if (leaf->utf8nfkdicf)
size += strlen(leaf->utf8nfkdicf) + 1;
else if (leaf->utf8nfkdi)
size += strlen(leaf->utf8nfkdi) + 1;
@@ -1450,7 +1531,10 @@ nfkdi_emit(void *l, unsigned char *data)
unsigned char *s;

*data++ = leaf->gen;
- if (leaf->utf8nfkdi) {
+ if (HANGUL_SYLLABLE(leaf->code)) {
+ *data++ = DECOMPOSE;
+ *data++ = HANGUL;
+ } else if (leaf->utf8nfkdi) {
*data++ = DECOMPOSE;
s = (unsigned char*)leaf->utf8nfkdi;
while ((*data++ = *s++) != 0)
@@ -1468,7 +1552,10 @@ nfkdicf_emit(void *l, unsigned char *data)
unsigned char *s;

*data++ = leaf->gen;
- if (leaf->utf8nfkdicf) {
+ if (HANGUL_SYLLABLE(leaf->code)) {
+ *data++ = DECOMPOSE;
+ *data++ = HANGUL;
+ } else if (leaf->utf8nfkdicf) {
*data++ = DECOMPOSE;
s = (unsigned char*)leaf->utf8nfkdicf;
while ((*data++ = *s++) != 0)
@@ -1492,6 +1579,11 @@ utf8_create(struct unicode_data *data)
unsigned int *um;
int i;

+ if (data->utf8nfkdi) {
+ assert(data->utf8nfkdi[0] == HANGUL);
+ return;
+ }
+
u = utf;
um = data->utf32nfkdi;
if (um) {
@@ -1682,6 +1774,7 @@ verify(struct tree *tree)
utf8leaf_t *leaf;
unsigned int unichar;
char key[4];
+ unsigned char hangul[UTF8HANGULLEAF];
int report;
int nocf;

@@ -1695,7 +1788,8 @@ verify(struct tree *tree)
if (data->correction <= tree->maxage)
data = &unicode_data[unichar];
utf8encode(key,unichar);
- leaf = utf8lookup(tree, key);
+ leaf = utf8lookup(tree, hangul, key);
+
if (!leaf) {
if (data->gen != -1)
report++;
@@ -1709,7 +1803,10 @@ verify(struct tree *tree)
if (data->gen != LEAF_GEN(leaf))
report++;
if (LEAF_CCC(leaf) == DECOMPOSE) {
- if (nocf) {
+ if (HANGUL_SYLLABLE(data->code)) {
+ if (data->utf8nfkdi[0] != HANGUL)
+ report++;
+ } else if (nocf) {
if (!data->utf8nfkdi) {
report++;
} else if (strcmp(data->utf8nfkdi,
@@ -2394,6 +2491,15 @@ hangul_decompose(void)
memcpy(um, mapping, i * sizeof(unsigned int));
unicode_data[unichar].utf32nfkdicf = um;

+ /*
+ * Add a cookie as a reminder that the hangul syllable
+ * decompositions must not be stored in the generated
+ * trie.
+ */
+ unicode_data[unichar].utf8nfkdi = malloc(2);
+ unicode_data[unichar].utf8nfkdi[0] = HANGUL;
+ unicode_data[unichar].utf8nfkdi[1] = '\0';
+
if (verbose > 1)
print_utf32nfkdi(unichar);

@@ -2521,6 +2627,100 @@ int utf8cursor(struct utf8cursor *, struct tree *, const char *);
int utf8ncursor(struct utf8cursor *, struct tree *, const char *, size_t);
int utf8byte(struct utf8cursor *);

+/*
+ * Hangul decomposition (algorithm from Section 3.12 of Unicode 6.3.0)
+ *
+ * AC00;<Hangul Syllable, First>;Lo;0;L;;;;;N;;;;;
+ * D7A3;<Hangul Syllable, Last>;Lo;0;L;;;;;N;;;;;
+ *
+ * SBase = 0xAC00
+ * LBase = 0x1100
+ * VBase = 0x1161
+ * TBase = 0x11A7
+ * LCount = 19
+ * VCount = 21
+ * TCount = 28
+ * NCount = 588 (VCount * TCount)
+ * SCount = 11172 (LCount * NCount)
+ *
+ * Decomposition:
+ * SIndex = s - SBase
+ *
+ * LV (Canonical/Full)
+ * LIndex = SIndex / NCount
+ * VIndex = (Sindex % NCount) / TCount
+ * LPart = LBase + LIndex
+ * VPart = VBase + VIndex
+ *
+ * LVT (Canonical)
+ * LVIndex = (SIndex / TCount) * TCount
+ * TIndex = (Sindex % TCount)
+ * LVPart = SBase + LVIndex
+ * TPart = TBase + TIndex
+ *
+ * LVT (Full)
+ * LIndex = SIndex / NCount
+ * VIndex = (Sindex % NCount) / TCount
+ * TIndex = (Sindex % TCount)
+ * LPart = LBase + LIndex
+ * VPart = VBase + VIndex
+ * if (TIndex == 0) {
+ * d = <LPart, VPart>
+ * } else {
+ * TPart = TBase + TIndex
+ * d = <LPart, VPart, TPart>
+ * }
+ */
+
+/* Constants */
+#define SB (0xAC00)
+#define LB (0x1100)
+#define VB (0x1161)
+#define TB (0x11A7)
+#define LC (19)
+#define VC (21)
+#define TC (28)
+#define NC (VC * TC)
+#define SC (LC * NC)
+
+/* Algorithmic decomposition of hangul syllable. */
+static utf8leaf_t *
+utf8hangul(const char *str, unsigned char *hangul)
+{
+ unsigned int si;
+ unsigned int li;
+ unsigned int vi;
+ unsigned int ti;
+ unsigned char *h;
+
+ /* Calculate the SI, LI, VI, and TI values. */
+ si = utf8decode(str) - SB;
+ li = si / NC;
+ vi = (si % NC) / TC;
+ ti = si % TC;
+
+ /* Fill in base of leaf. */
+ h = hangul;
+ LEAF_GEN(h) = 2;
+ LEAF_CCC(h) = DECOMPOSE;
+ h += 2;
+
+ /* Add LPart, a 3-byte UTF-8 sequence. */
+ h += utf8encode((char *)h, li + LB);
+
+ /* Add VPart, a 3-byte UTF-8 sequence. */
+ h += utf8encode((char *)h, vi + VB);
+
+ /* Add TPart if required, also a 3-byte UTF-8 sequence. */
+ if (ti)
+ h += utf8encode((char *)h, ti + TB);
+
+ /* Terminate string. */
+ h[0] = '\0';
+
+ return hangul;
+}
+
/*
* Use trie to scan s, touching at most len bytes.
* Returns the leaf if one exists, NULL otherwise.
@@ -2530,7 +2730,7 @@ int utf8byte(struct utf8cursor *);
* shorthand for this will be "is valid UTF-8 unicode".
*/
static utf8leaf_t *
-utf8nlookup(struct tree *tree, const char *s, size_t len)
+utf8nlookup(struct tree *tree, unsigned char *hangul, const char *s, size_t len)
{
utf8trie_t *trie = utf8data + tree->index;
int offlen;
@@ -2586,6 +2786,14 @@ utf8nlookup(struct tree *tree, const char *s, size_t len)
}
}
}
+ /*
+ * Hangul decomposition is done algorithmically. These are the
+ * codepoints >= 0xAC00 and <= 0xD7A3. Their UTF-8 encoding is
+ * always 3 bytes long, so s has been advanced twice, and the
+ * start of the sequence is at s-2.
+ */
+ if (LEAF_CCC(trie) == DECOMPOSE && LEAF_STR(trie)[0] == HANGUL)
+ trie = utf8hangul(s - 2, hangul);
return trie;
}

@@ -2596,9 +2804,9 @@ utf8nlookup(struct tree *tree, const char *s, size_t len)
* Forwards to trie_nlookup().
*/
static utf8leaf_t *
-utf8lookup(struct tree *tree, const char *s)
+utf8lookup(struct tree *tree, unsigned char *hangul, const char *s)
{
- return utf8nlookup(tree, s, (size_t)-1);
+ return utf8nlookup(tree, hangul, s, (size_t)-1);
}

/*
@@ -2624,11 +2832,14 @@ utf8agemax(struct tree *tree, const char *s)
utf8leaf_t *leaf;
int age = 0;
int leaf_age;
+ unsigned char hangul[UTF8HANGULLEAF];

if (!tree)
return -1;
+
while (*s) {
- if (!(leaf = utf8lookup(tree, s)))
+ leaf = utf8lookup(tree, hangul, s);
+ if (!leaf)
return -1;
leaf_age = ages[LEAF_GEN(leaf)];
if (leaf_age <= tree->maxage && leaf_age > age)
@@ -2649,12 +2860,14 @@ utf8agemin(struct tree *tree, const char *s)
utf8leaf_t *leaf;
int age;
int leaf_age;
+ unsigned char hangul[UTF8HANGULLEAF];

if (!tree)
return -1;
age = tree->maxage;
while (*s) {
- if (!(leaf = utf8lookup(tree, s)))
+ leaf = utf8lookup(tree, hangul, s);
+ if (!leaf)
return -1;
leaf_age = ages[LEAF_GEN(leaf)];
if (leaf_age <= tree->maxage && leaf_age < age)
@@ -2674,11 +2887,14 @@ utf8nagemax(struct tree *tree, const char *s, size_t len)
utf8leaf_t *leaf;
int age = 0;
int leaf_age;
+ unsigned char hangul[UTF8HANGULLEAF];

if (!tree)
return -1;
+
while (len && *s) {
- if (!(leaf = utf8nlookup(tree, s, len)))
+ leaf = utf8nlookup(tree, hangul, s, len);
+ if (!leaf)
return -1;
leaf_age = ages[LEAF_GEN(leaf)];
if (leaf_age <= tree->maxage && leaf_age > age)
@@ -2699,12 +2915,14 @@ utf8nagemin(struct tree *tree, const char *s, size_t len)
utf8leaf_t *leaf;
int leaf_age;
int age;
+ unsigned char hangul[UTF8HANGULLEAF];

if (!tree)
return -1;
age = tree->maxage;
while (len && *s) {
- if (!(leaf = utf8nlookup(tree, s, len)))
+ leaf = utf8nlookup(tree, hangul, s, len);
+ if (!leaf)
return -1;
leaf_age = ages[LEAF_GEN(leaf)];
if (leaf_age <= tree->maxage && leaf_age < age)
@@ -2726,11 +2944,13 @@ utf8len(struct tree *tree, const char *s)
{
utf8leaf_t *leaf;
size_t ret = 0;
+ unsigned char hangul[UTF8HANGULLEAF];

if (!tree)
return -1;
while (*s) {
- if (!(leaf = utf8lookup(tree, s)))
+ leaf = utf8lookup(tree, hangul, s);
+ if (!leaf)
return -1;
if (ages[LEAF_GEN(leaf)] > tree->maxage)
ret += utf8clen(s);
@@ -2752,11 +2972,13 @@ utf8nlen(struct tree *tree, const char *s, size_t len)
{
utf8leaf_t *leaf;
size_t ret = 0;
+ unsigned char hangul[UTF8HANGULLEAF];

if (!tree)
return -1;
while (len && *s) {
- if (!(leaf = utf8nlookup(tree, s, len)))
+ leaf = utf8nlookup(tree, hangul, s, len);
+ if (!leaf)
return -1;
if (ages[LEAF_GEN(leaf)] > tree->maxage)
ret += utf8clen(s);
@@ -2784,6 +3006,7 @@ struct utf8cursor {
short int ccc;
short int nccc;
unsigned int unichar;
+ unsigned char hangul[UTF8HANGULLEAF];
};

/*
@@ -2900,10 +3123,12 @@ utf8byte(struct utf8cursor *u8c)
}

/* Look up the data for the current character. */
- if (u8c->p)
- leaf = utf8lookup(u8c->tree, u8c->s);
- else
- leaf = utf8nlookup(u8c->tree, u8c->s, u8c->len);
+ if (u8c->p) {
+ leaf = utf8lookup(u8c->tree, u8c->hangul, u8c->s);
+ } else {
+ leaf = utf8nlookup(u8c->tree, u8c->hangul,
+ u8c->s, u8c->len);
+ }

/* No leaf found implies that the input is a binary blob. */
if (!leaf)
@@ -2923,7 +3148,7 @@ utf8byte(struct utf8cursor *u8c)
ccc = STOPPER;
goto ccc_mismatch;
}
- leaf = utf8lookup(u8c->tree, u8c->s);
+ leaf = utf8lookup(u8c->tree, u8c->hangul, u8c->s);
ccc = LEAF_CCC(leaf);
}
u8c->unichar = utf8decode(u8c->s);
--
2.15.1

2018-01-25 02:55:10

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH RFC v2 07/13] charsets: utf8: Hook-up utf-8 code to charsets library

Changes since RFC v1:
- Change error return code from EIO to EINVAL. (Olaf Weber)
- Fix issues with strncmp/strcmp. (Olaf Weber)
- Remove stack buffer in normalization/casefold. (Olaf Weber)
- Include length parameter for second string on comparison functions.
- Change length type to size_t.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
lib/charsets/Makefile | 2 +-
lib/charsets/utf8_core.c | 180 +++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 181 insertions(+), 1 deletion(-)
create mode 100644 lib/charsets/utf8_core.c

diff --git a/lib/charsets/Makefile b/lib/charsets/Makefile
index d33ac8643ef9..004d0e5d6ac4 100644
--- a/lib/charsets/Makefile
+++ b/lib/charsets/Makefile
@@ -4,7 +4,7 @@ obj-$(CONFIG_CHARSETS) += charsets.o

obj-$(CONFIG_CHARSETS) += ascii.o

-utf8-y += utf8norm.o
+utf8-y += utf8_core.o utf8norm.o
obj-$(CONFIG_UTF8_NORMALIZATION) += utf8.o

$(obj)/utf8norm.o: $(obj)/utf8data.h
diff --git a/lib/charsets/utf8_core.c b/lib/charsets/utf8_core.c
new file mode 100644
index 000000000000..f1c4fcc249fb
--- /dev/null
+++ b/lib/charsets/utf8_core.c
@@ -0,0 +1,180 @@
+/*
+ * Copyright (c) 2017 Collabora Ltd.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#include <linux/charsets.h>
+#include <linux/utf8norm.h>
+#include <linux/slab.h>
+#include <linux/parser.h>
+#include <linux/string.h>
+
+static int utf8_strncmp(const struct charset *charset, const char *str1,
+ size_t len1, const char *str2, size_t len2)
+{
+ const struct utf8data *data = utf8nfkdi(charset->version);
+ struct utf8cursor cur1, cur2;
+ int c1, c2;
+ int r;
+
+ r = utf8ncursor(&cur1, data, str1, len1);
+ if (r < 0)
+ return -EINVAL;
+ r = utf8ncursor(&cur2, data, str2, len2);
+ if (r < 0)
+ return -EINVAL;
+
+ do {
+ c1 = utf8byte(&cur1);
+ c2 = utf8byte(&cur2);
+
+ if (c1 < 0 || c2 < 0)
+ return -EINVAL;
+ if (c1 != c2)
+ return 1;
+ } while (c1);
+
+ return 0;
+}
+
+static int utf8_strncasecmp(const struct charset *charset, const char *str1,
+ size_t len1, const char *str2, size_t len2)
+{
+ const struct utf8data *data = utf8nfkdicf(charset->version);
+ struct utf8cursor cur1, cur2;
+ unsigned char c1, c2;
+ int r;
+
+ r = utf8ncursor(&cur1, data, str1, len1);
+ if (r < 0)
+ return -EINVAL;
+
+ r = utf8ncursor(&cur2, data, str2, len2);
+ if (r < 0)
+ return -EINVAL;
+
+ do {
+ c1 = utf8byte(&cur1);
+ c2 = utf8byte(&cur2);
+
+ if (c1 < 0 || c2 < 0)
+ return -EINVAL;
+ if (c1 != c2)
+ return 1;
+ } while (c1);
+
+ return 0;
+}
+
+int utf8_casefold(const struct charset *charset, const char *str, int len,
+ char **folded)
+{
+ const struct utf8data *data = utf8nfkdicf(charset->version);
+ struct utf8cursor cur;
+ char *s;
+ ssize_t nlen;
+
+ nlen = utf8nlen(data, str, len);
+ if (nlen < 0)
+ return -EINVAL;
+
+ s = kmalloc(nlen + 1, GFP_NOFS);
+ if (!s)
+ return -ENOMEM;
+ *folded = s;
+
+ utf8ncursor(&cur, data, str, len);
+ do {
+ *s = utf8byte(&cur);
+ } while (*s++);
+
+ return nlen;
+}
+
+int utf8_normalize(const struct charset *charset, const char *str, int len,
+ char **normalization)
+{
+ const struct utf8data *data = utf8nfkdi(charset->version);
+ struct utf8cursor cur;
+ char *s;
+ ssize_t nlen;
+
+ nlen = utf8nlen(data, str, len);
+ if (nlen < 0)
+ return -EINVAL;
+
+ s = kmalloc(nlen + 1, GFP_NOFS);
+ if (!s)
+ return -ENOMEM;
+ *normalization = s;
+
+ utf8ncursor(&cur, data, str, len);
+ do {
+ *s = utf8byte(&cur);
+ } while (*s++);
+
+ return nlen;
+}
+
+static const struct charset_ops utf8_ops = {
+ .strncmp = utf8_strncmp,
+ .strncasecmp = utf8_strncasecmp,
+ .casefold = utf8_casefold,
+ .normalize = utf8_normalize,
+};
+
+static struct charset *utf8_load_charset(void *pargs)
+{
+ int maj, min, rev;
+ struct charset *charset;
+ substring_t *args = pargs;
+
+ if (match_int(&args[0], &maj) || match_int(&args[1], &min) ||
+ match_int(&args[2], &rev))
+ return NULL;
+
+ if (!utf8version_is_supported(maj, min, rev))
+ return NULL;
+
+ charset = kmalloc(sizeof(struct charset), GFP_KERNEL);
+ if (!charset)
+ return NULL;
+
+ charset->info = NULL;
+ charset->version = UNICODE_AGE(maj, min, rev);
+ charset->ops = &utf8_ops;
+
+ return charset;
+}
+
+static struct charset_info utf8_info = {
+ .name = "utf8",
+ .match_token = "utf8-%d.%d.%d",
+ .load_charset = utf8_load_charset,
+};
+
+static int __init init_utf8(void)
+{
+ charset_register(&utf8_info);
+ return 0;
+}
+
+static void __exit exit_utf8(void)
+{
+}
+
+module_init(init_utf8);
+module_exit(exit_utf8);
+MODULE_AUTHOR("Gabriel Krisman Bertazi");
+MODULE_DESCRIPTION("UTF-8 charset operations for filesystems");
+MODULE_LICENSE("GPL");
+
--
2.15.1

2018-01-25 02:55:14

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH RFC v2 08/13] charsets: utf8: Introduce test module for kernel UTF-8 implementation

Changes since RFC v1:
- Include comparison tests for matching strings with different lengths.
- Include tests for characters included in unicode 8.0.0, 9.0.0 and 10.0.0.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
lib/Kconfig | 5 +
lib/charsets/Makefile | 1 +
lib/charsets/test_ucd.c | 305 ++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 311 insertions(+)
create mode 100644 lib/charsets/test_ucd.c

diff --git a/lib/Kconfig b/lib/Kconfig
index 9ca3bda7b2be..dd4552dae48b 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -593,6 +593,11 @@ config UTF8_NORMALIZATION
depends on CHARSETS
help
Say Y here to enable utf8 normalization support.
+
+config UTF8_SELFTEST
+ tristate "Test UTF-8 normalization support"
+ default n
+ depends on UTF8_NORMALIZATION
endmenu

config GENERIC_ASHLDI3
diff --git a/lib/charsets/Makefile b/lib/charsets/Makefile
index 004d0e5d6ac4..cd57912ce7cb 100644
--- a/lib/charsets/Makefile
+++ b/lib/charsets/Makefile
@@ -6,6 +6,7 @@ obj-$(CONFIG_CHARSETS) += ascii.o

utf8-y += utf8_core.o utf8norm.o
obj-$(CONFIG_UTF8_NORMALIZATION) += utf8.o
+obj-$(CONFIG_UTF8_SELFTEST) += test_ucd.o

$(obj)/utf8norm.o: $(obj)/utf8data.h
$(obj)/utf8data.h: $(srctree)/$(src)/ucd/*.txt $(objtree)/scripts/mkutf8data FORCE
diff --git a/lib/charsets/test_ucd.c b/lib/charsets/test_ucd.c
new file mode 100644
index 000000000000..a579b798e82f
--- /dev/null
+++ b/lib/charsets/test_ucd.c
@@ -0,0 +1,305 @@
+/*
+ * Kernel module for testing utf-8 support.
+ *
+ * Copyright 2017 Collabora Ltd. All Rights Reserved
+ *
+ * This software is licensed under the terms of the GNU General Public
+ * License version 2, as published by the Free Software Foundation, and
+ * may be copied, distributed, and modified under those terms.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/module.h>
+#include <linux/printk.h>
+#include <linux/utf8norm.h>
+#include <linux/charsets.h>
+
+unsigned int failed_tests;
+unsigned int total_tests;
+
+/* Tests will be based on this version. */
+#define latest_maj 10
+#define latest_min 0
+#define latest_rev 0
+
+#define _test(cond, func, line, fmt, ...) do { \
+ total_tests++; \
+ if (!cond) { \
+ failed_tests++; \
+ pr_err("test %s:%d Failed: %s%s", \
+ func, line, #cond, (fmt?":":".")); \
+ if (fmt) \
+ pr_err(fmt, ##__VA_ARGS__); \
+ } \
+ } while (0)
+#define test_f(cond, fmt, ...) _test(cond, __func__, __LINE__, fmt, ##__VA_ARGS__)
+#define test(cond) _test(cond, __func__, __LINE__, "")
+
+const static struct {
+ /* UTF-8 strings in this vector _must_ be NULL-terminated. */
+ unsigned char str[10];
+ unsigned char dec[10];
+} nfkdi_test_data[] = {
+ /* Trivial sequence */
+ {
+ /* "ABba" decomposes to itself */
+ .str = {0x41, 0x42, 0x62, 0x61, 0x00},
+ .dec = {0x41, 0x42, 0x62, 0x61, 0x00}
+ },
+ /* Simple equivalent sequences */
+ {
+ /* 'VULGAR FRACTION ONE QUARTER' decomposes to
+ 'NUMBER 1' + 'FRACTION SLASH' + 'NUMBER 4' */
+ .str = {0xc2, 0xbc, 0x00},
+ .dec = {0x31, 0xe2, 0x81, 0x84, 0x34, 0x00},
+ },
+ {
+ /* 'LATIN SMALL LETTER A WITH DIAERESIS' decomposes to
+ 'LETTER A' + 'COMBINING DIAERESIS' */
+ .str = {0xc3, 0xa4, 0x00},
+ .dec = {0x61, 0xcc, 0x88, 0x00},
+ },
+ {
+ /* 'LATIN SMALL LETTER LJ' decomposes to
+ 'LETTER L' + 'LETTER J' */
+ .str = {0xC7, 0x89, 0x00},
+ .dec = {0x6c, 0x6a, 0x00},
+ },
+ {
+ /* GREEK ANO TELEIA decomposes to MIDDLE DOT */
+ .str = {0xCE, 0x87, 0x00},
+ .dec = {0xC2, 0xB7, 0x00}
+ },
+ /* Canonical ordering */
+ {
+ /* A + 'COMBINING ACUTE ACCENT' + 'COMBINING OGONEK' decomposes
+ to A + 'COMBINING OGONEK' + 'COMBINING ACUTE ACCENT' */
+ .str = {0x41, 0xcc, 0x81, 0xcc, 0xa8, 0x0},
+ .dec = {0x41, 0xcc, 0xa8, 0xcc, 0x81, 0x0},
+ },
+ {
+ /* 'LATIN SMALL LETTER A WITH DIAERESIS' + 'COMBINING OGONEK'
+ decomposes to
+ 'LETTER A' + 'COMBINING OGONEK' + 'COMBINING DIAERESIS' */
+ .str = {0xc3, 0xa4, 0xCC, 0xA8, 0x00},
+
+ .dec = {0x61, 0xCC, 0xA8, 0xcc, 0x88, 0x00},
+ },
+
+};
+
+const static struct {
+ /* UTF-8 strings in this vector _must_ be NULL-terminated. */
+ unsigned char str[30];
+ unsigned char ncf[30];
+} nfkdicf_test_data[] = {
+ /* Trivial sequences */
+ {
+ /* "ABba" folds to lowercase */
+ .str = {0x41, 0x42, 0x62, 0x61, 0x00},
+ .ncf = {0x61, 0x62, 0x62, 0x61, 0x00},
+ },
+ {
+ /* All ASCII folds to lower-case */
+ .str = "ABCDEFGHIJKLMNOPRSTUVWXYZ0.1",
+ .ncf = "abcdefghijklmnoprstuvwxyz0.1",
+ },
+ {
+ /* LATIN SMALL LETTER SHARP S folds to
+ LATIN SMALL LETTER S + LATIN SMALL LETTER S */
+ .str = {0xc3, 0x9f, 0x00},
+ .ncf = {0x73, 0x73, 0x00},
+ },
+ {
+ /* LATIN CAPITAL LETTER A WITH RING ABOVE folds to
+ LATIN SMALL LETTER A + COMBINING RING ABOVE */
+ .str = {0xC3, 0x85, 0x00},
+ .ncf = {0x61, 0xcc, 0x8a, 0x00},
+ },
+ /* Introduced by UTF-8.0.0. */
+ /* Cherokee letters are interesting test-cases because they fold
+ to upper-case. Before 8.0.0, Cherokee lowercase were
+ undefined, thus, the folding from LC is not stable between
+ 7.0.0 -> 8.0.0, but it is from UC. */
+ {
+ /* CHEROKEE SMALL LETTER A folds to CHEROKEE LETTER A */
+ .str = {0xea, 0xad, 0xb0, 0x00},
+ .ncf = {0xe1, 0x8e, 0xa0, 0x00},
+ },
+ {
+ /* CHEROKEE SMALL LETTER YE folds to CHEROKEE LETTER YE */
+ .str = {0xe1, 0x8f, 0xb8, 0x00},
+ .ncf = {0xe1, 0x8f, 0xb0, 0x00},
+ },
+ {
+ /* OLD HUNGARIAN CAPITAL LETTER AMB folds to
+ OLD HUNGARIAN SMALL LETTER AMB */
+ .str = {0xf0, 0x90, 0xb2, 0x83, 0x00},
+ .ncf = {0xf0, 0x90, 0xb3, 0x83, 0x00},
+ },
+ /* Introduced by UTF-9.0.0. */
+ {
+ /* OSAGE CAPITAL LETTER CHA folds to
+ OSAGE SMALL LETTER CHA */
+ .str = {0xf0, 0x90, 0x92, 0xb5, 0x00},
+ .ncf = {0xf0, 0x90, 0x93, 0x9d, 0x00},
+ },
+ {
+ /* LATIN CAPITAL LETTER SMALL CAPITAL I folds to
+ LATIN LETTER SMALL CAPITAL I */
+ .str = {0xea, 0x9e, 0xae, 0x00},
+ .ncf = {0xc9, 0xaa, 0x00},
+ },
+};
+
+static void check_utf8_nfkdi(void)
+{
+ int i;
+ struct utf8cursor u8c;
+ const struct utf8data *data;
+
+ data = utf8nfkdi(UNICODE_AGE(latest_maj, latest_min, latest_rev));
+ if (!data) {
+ pr_err("%s: Unable to load Unicode %d.%d.%d. Skipping.\n",
+ __func__, latest_maj, latest_min, latest_rev);
+ return;
+ }
+
+ for (i = 0; i < ARRAY_SIZE(nfkdi_test_data); i++) {
+ int len = strlen(nfkdi_test_data[i].str);
+ int nlen = strlen(nfkdi_test_data[i].dec);
+ int j = 0;
+ unsigned char c;
+
+ test((utf8len(data, nfkdi_test_data[i].str) == nlen));
+ test((utf8nlen(data, nfkdi_test_data[i].str, len) == nlen));
+
+ if (utf8cursor(&u8c, data, nfkdi_test_data[i].str) < 0)
+ pr_err("can't create cursor\n");
+
+ while ((c = utf8byte(&u8c)) > 0) {
+ test_f((c == nfkdi_test_data[i].dec[j]),
+ "Unexpected byte 0x%x should be 0x%x\n",
+ c, nfkdi_test_data[i].dec[j]);
+ j++;
+ }
+
+ test((j == nlen));
+ }
+}
+
+static void check_utf8_nfkdicf(void)
+{
+ int i;
+ struct utf8cursor u8c;
+ const struct utf8data *data;
+
+ data = utf8nfkdicf(UNICODE_AGE(latest_maj, latest_min, latest_rev));
+ if (!data) {
+ pr_err("%s: Unable to load Unicode %d.%d.%d. Skipping.\n",
+ __func__, latest_maj, latest_min, latest_rev);
+ return;
+ }
+
+ for (i = 0; i < ARRAY_SIZE(nfkdicf_test_data); i++) {
+ int len = strlen(nfkdicf_test_data[i].str);
+ int nlen = strlen(nfkdicf_test_data[i].ncf);
+ int j = 0;
+ unsigned char c;
+
+ test((utf8len(data, nfkdicf_test_data[i].str) == nlen));
+ test((utf8nlen(data, nfkdicf_test_data[i].str, len) == nlen));
+
+ if (utf8cursor(&u8c, data, nfkdicf_test_data[i].str) < 0)
+ pr_err("can't create cursor\n");
+
+ while ((c = utf8byte(&u8c)) > 0) {
+ test_f((c == nfkdicf_test_data[i].ncf[j]),
+ "Unexpected byte 0x%x should be 0x%x\n",
+ c, nfkdicf_test_data[i].ncf[j]);
+ j++;
+ }
+
+ test((j == nlen));
+ }
+}
+
+static void check_utf8_comparisons(void)
+{
+ int i;
+ const struct charset *charset = charset_load("utf8-10.0.0");
+
+ if(!charset) {
+ pr_err("%s: Unable to load Unicode %d.%d.%d. Skipping.\n",
+ __func__, latest_maj, latest_min, latest_rev);
+ return;
+ }
+
+ for (i = 0; i < ARRAY_SIZE(nfkdi_test_data); i++) {
+ const char *s1 = nfkdi_test_data[i].str;
+ const char *s2 = nfkdi_test_data[i].dec;
+
+ test_f(!charset_strncmp(charset, s1, strlen(s1), s2, strlen(s2)),
+ "%s %s comparison mismatch\n", s1, s2);
+ }
+ for (i = 0; i < ARRAY_SIZE(nfkdicf_test_data); i++) {
+ const char *s1 = nfkdicf_test_data[i].str;
+ const char *s2 = nfkdicf_test_data[i].ncf;
+
+ test_f(!charset_strncasecmp(charset, s1, strlen(s1),
+ s2, strlen(s2)),
+ "%s %s comparison mismatch\n", s1, s2);
+ }
+
+}
+
+static void check_supported_versions(void)
+{
+ /* Unicode 7.0.0 should be supported. */
+ test(utf8version_is_supported(7, 0, 0));
+
+ /* Unicode 9.0.0 should be supported. */
+ test(utf8version_is_supported(9, 0, 0));
+
+ /* Unicode 10.0.0 (the latest version) should be supported. */
+ test(utf8version_is_supported(latest_maj, latest_min, latest_rev));
+
+ /* Next versions don't exist. */
+ test(!utf8version_is_supported(11, 0, 0));
+ test(!utf8version_is_supported(0, 0, 0));
+ test(!utf8version_is_supported(-1, -1, -1));
+}
+
+static int __init init_test_ucd(void)
+{
+ failed_tests = 0;
+ total_tests = 0;
+
+ check_supported_versions();
+ check_utf8_nfkdi();
+ check_utf8_nfkdicf();
+ check_utf8_comparisons();
+
+ if (!failed_tests)
+ pr_info("All %u tests passed\n", total_tests);
+ else
+ pr_err("%u out of %u tests failed\n", failed_tests,
+ total_tests);
+ return 0;
+}
+
+static void __exit exit_test_ucd(void)
+{
+}
+
+module_init(init_test_ucd);
+module_exit(exit_test_ucd);
+
+MODULE_AUTHOR("Gabriel Krisman Bertazi <[email protected]>");
+MODULE_LICENSE("GPL");
--
2.15.1

2018-01-25 02:53:45

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH RFC v2 09/13] ext4: Add ignorecase mount option

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
fs/ext4/ext4.h | 3 +++
fs/ext4/super.c | 5 +++++
2 files changed, 8 insertions(+)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 4e091eae38b1..09fb2426c352 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1130,6 +1130,9 @@ struct ext4_inode_info {
#define EXT4_MOUNT2_EXPLICIT_JOURNAL_CHECKSUM 0x00000008 /* User explicitly
specified journal checksum */

+#define EXT4_MOUNT2_CASE_INSENSITIVE 0x00000010 /* Perform case-insensitive
+ lookups */
+
#define clear_opt(sb, opt) EXT4_SB(sb)->s_mount_opt &= \
~EXT4_MOUNT_##opt
#define set_opt(sb, opt) EXT4_SB(sb)->s_mount_opt |= \
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 7c46693a14d7..f737e90a3b4e 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1350,6 +1350,7 @@ enum {
Opt_dioread_nolock, Opt_dioread_lock,
Opt_discard, Opt_nodiscard, Opt_init_itable, Opt_noinit_itable,
Opt_max_dir_size_kb, Opt_nojournal_checksum, Opt_nombcache,
+ Opt_ignore_case,
};

static const match_table_t tokens = {
@@ -1432,6 +1433,7 @@ static const match_table_t tokens = {
{Opt_noinit_itable, "noinit_itable"},
{Opt_max_dir_size_kb, "max_dir_size_kb=%u"},
{Opt_test_dummy_encryption, "test_dummy_encryption"},
+ {Opt_ignore_case, "ignorecase"},
{Opt_nombcache, "nombcache"},
{Opt_nombcache, "no_mbcache"}, /* for backward compatibility */
{Opt_removed, "check=none"}, /* mount option from ext2/3 */
@@ -1687,6 +1689,9 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
case Opt_nolazytime:
sb->s_flags &= ~SB_LAZYTIME;
return 1;
+ case Opt_ignore_case:
+ set_opt2(sb, CASE_INSENSITIVE);
+ return 1;
}

for (m = ext4_mount_opts; m->token != Opt_err; m++)
--
2.15.1

2018-01-25 02:55:24

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH RFC v2 10/13] ext4: Include encoding information on the superblock

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
fs/ext4/ext4.h | 1 +
fs/ext4/super.c | 27 ++++++++++++++++++++++++++-
2 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 09fb2426c352..6c8aaf2dd322 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1369,6 +1369,7 @@ struct ext4_sb_info {
struct kobject s_kobj;
struct completion s_kobj_unregister;
struct super_block *s_sb;
+ const struct charset *encoding;

/* Journaling */
struct journal_s *s_journal;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index f737e90a3b4e..022bf5f274b2 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -40,6 +40,7 @@
#include <linux/dax.h>
#include <linux/cleancache.h>
#include <linux/uaccess.h>
+#include <linux/charsets.h>

#include <linux/kthread.h>
#include <linux/freezer.h>
@@ -1350,7 +1351,7 @@ enum {
Opt_dioread_nolock, Opt_dioread_lock,
Opt_discard, Opt_nodiscard, Opt_init_itable, Opt_noinit_itable,
Opt_max_dir_size_kb, Opt_nojournal_checksum, Opt_nombcache,
- Opt_ignore_case,
+ Opt_ignore_case, Opt_encoding,
};

static const match_table_t tokens = {
@@ -1434,6 +1435,7 @@ static const match_table_t tokens = {
{Opt_max_dir_size_kb, "max_dir_size_kb=%u"},
{Opt_test_dummy_encryption, "test_dummy_encryption"},
{Opt_ignore_case, "ignorecase"},
+ {Opt_encoding, "encoding=%s"},
{Opt_nombcache, "nombcache"},
{Opt_nombcache, "no_mbcache"}, /* for backward compatibility */
{Opt_removed, "check=none"}, /* mount option from ext2/3 */
@@ -1644,6 +1646,7 @@ static const struct mount_opts {
{Opt_max_dir_size_kb, 0, MOPT_GTE0},
{Opt_test_dummy_encryption, 0, MOPT_GTE0},
{Opt_nombcache, EXT4_MOUNT_NO_MBCACHE, MOPT_SET},
+ {Opt_encoding, 0, MOPT_EXT4_ONLY | MOPT_STRING},
{Opt_err, 0, 0}
};

@@ -1882,6 +1885,16 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
sbi->s_mount_opt |= m->mount_opt;
} else if (token == Opt_data_err_ignore) {
sbi->s_mount_opt &= ~m->mount_opt;
+ } else if (token == Opt_encoding) {
+ char *charset = match_strdup(&args[0]);
+
+ sbi->encoding = charset_load(charset);
+ if (!sbi->encoding) {
+ ext4_msg(sb, KERN_ERR, "Encoding %s not supported\n",
+ charset);
+ return -1;
+ }
+ kfree(charset);
} else {
if (!args->from)
arg = 1;
@@ -1968,6 +1981,12 @@ static int parse_options(char *options, struct super_block *sb,
return 0;
}
}
+
+ if (test_opt2(sb, CASE_INSENSITIVE) && !sbi->encoding) {
+ ext4_msg(sb, KERN_WARNING, "Encoding not explicitly specified "
+ "or identified on the superblock. ASCII is assumed.");
+ }
+
return 1;
}

@@ -3598,6 +3617,12 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
&journal_ioprio, 0))
goto failed_mount;

+ if (!sbi->encoding) {
+ sbi->encoding = charset_load("ascii");
+ if (!sbi->encoding)
+ goto failed_mount;
+ }
+
if (test_opt(sb, DATA_FLAGS) == EXT4_MOUNT_JOURNAL_DATA) {
printk_once(KERN_WARNING "EXT4-fs: Warning: mounting "
"with data=journal disables delayed "
--
2.15.1

2018-01-25 02:53:49

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH RFC v2 13/13] ext4: Implement ext4 dcache hooks for custom charsets

Changes from RFC v1:
- Handle errors from charset_normalize/casefold (Olaf Weber)
- Send length of both strings to comparison functions.
- Cast length type to size_t

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
fs/ext4/dir.c | 73 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/ext4/ext4.h | 2 ++
fs/ext4/super.c | 5 ++++
3 files changed, 80 insertions(+)

diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index d5babc9f222b..1c75aff87098 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -25,6 +25,7 @@
#include <linux/fs.h>
#include <linux/buffer_head.h>
#include <linux/slab.h>
+#include <linux/charsets.h>
#include "ext4.h"
#include "xattr.h"

@@ -661,3 +662,75 @@ const struct file_operations ext4_dir_operations = {
.open = ext4_dir_open,
.release = ext4_release_dir,
};
+
+static int ext4_d_hash(const struct dentry *dentry, struct qstr *q)
+{
+ unsigned long hash;
+ int i, len;
+ char *str;
+ const struct charset *charset = EXT4_SB(dentry->d_sb)->encoding;
+
+ len = charset_normalize(charset, q->name, q->len, &str);
+
+ if (len < 0) {
+ kfree(str);
+ return -EINVAL;
+ }
+
+ hash = init_name_hash(dentry);
+ for (i = 0; i < len; i++)
+ hash = partial_name_hash(str[i], hash);
+ q->hash = end_name_hash(hash);
+
+ kfree(str);
+ return 0;
+}
+
+static int ext4_d_compare(const struct dentry *dentry, unsigned int len,
+ const char *str, const struct qstr *name)
+{
+ const struct charset *charset = EXT4_SB(dentry->d_sb)->encoding;
+
+ return charset_strncmp(charset, str, len, name->name, strlen(name->name));
+}
+
+const struct dentry_operations ext4_dentry_ops = {
+ .d_hash = ext4_d_hash,
+ .d_compare = ext4_d_compare,
+};
+
+static int ext4_d_ci_hash(const struct dentry *dentry, struct qstr *q)
+{
+ unsigned long hash;
+ int i, len;
+ char *str;
+ const struct charset *charset = EXT4_SB(dentry->d_sb)->encoding;
+
+ len = charset_casefold(charset, q->name, q->len, &str);
+
+ if (len < 0) {
+ kfree(str);
+ return -EINVAL;
+ }
+
+ hash = init_name_hash(dentry);
+ for (i = 0; i < len; i++)
+ hash = partial_name_hash(str[i], hash);
+ q->hash = end_name_hash(hash);
+
+ kfree(str);
+ return 0;
+}
+
+static int ext4_d_ci_compare(const struct dentry *dentry, unsigned int len,
+ const char *str, const struct qstr *name)
+{
+ const struct charset *charset = EXT4_SB(dentry->d_sb)->encoding;
+
+ return charset_strncasecmp(charset, str, len, name->name, strlen(name->name));
+}
+
+const struct dentry_operations ext4_ci_dentry_ops = {
+ .d_hash = ext4_d_ci_hash,
+ .d_compare = ext4_d_ci_compare,
+};
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 6c8aaf2dd322..980968aeb1cf 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2943,6 +2943,8 @@ static inline void ext4_unlock_group(struct super_block *sb,

/* dir.c */
extern const struct file_operations ext4_dir_operations;
+extern const struct dentry_operations ext4_dentry_ops;
+extern const struct dentry_operations ext4_ci_dentry_ops;

/* file.c */
extern const struct inode_operations ext4_file_inode_operations;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 022bf5f274b2..877acb938453 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4358,6 +4358,11 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
if (es->s_error_count)
mod_timer(&sbi->s_err_report, jiffies + 300*HZ); /* 5 minutes */

+ if (test_opt2(sb, CASE_INSENSITIVE))
+ sb->s_d_op = &ext4_ci_dentry_ops;
+ else
+ sb->s_d_op = &ext4_dentry_ops;
+
/* Enable message ratelimiting. Default is 10 messages per 5 secs. */
ratelimit_state_init(&sbi->s_err_ratelimit_state, 5 * HZ, 10);
ratelimit_state_init(&sbi->s_warning_ratelimit_state, 5 * HZ, 10);
--
2.15.1

2018-01-25 02:55:29

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH RFC v2 11/13] fscrypt: Introduce charset-based matching functions

Changes since RFC v1:
- Send length of both strings to comparison functions.
- Change length type to size_t.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
include/linux/fscrypt.h | 1 +
include/linux/fscrypt_notsupp.h | 16 ++++++++++++++++
include/linux/fscrypt_supp.h | 27 +++++++++++++++++++++++++++
3 files changed, 44 insertions(+)

diff --git a/include/linux/fscrypt.h b/include/linux/fscrypt.h
index 08b4b40c5aa8..8ca010ef2871 100644
--- a/include/linux/fscrypt.h
+++ b/include/linux/fscrypt.h
@@ -21,6 +21,7 @@
#include <linux/dcache.h>
#include <crypto/skcipher.h>
#include <uapi/linux/fs.h>
+#include <linux/charsets.h>

#define FS_CRYPTO_BLOCK_SIZE 16

diff --git a/include/linux/fscrypt_notsupp.h b/include/linux/fscrypt_notsupp.h
index 63e58808519a..d790052b8cbe 100644
--- a/include/linux/fscrypt_notsupp.h
+++ b/include/linux/fscrypt_notsupp.h
@@ -160,6 +160,22 @@ static inline bool fscrypt_match_name(const struct fscrypt_name *fname,
return !memcmp(de_name, fname->disk_name.name, fname->disk_name.len);
}

+static inline bool fscrypt_charset_match_name(const struct fscrypt_name *fname,
+ const struct charset *charset,
+ const u8 *de_name,
+ u32 de_name_len, bool ignorecase)
+{
+ if (!ignorecase) {
+ return !charset_strncmp(charset, (char *) de_name, de_name_len,
+ fname->disk_name.name,
+ fname->disk_name.len);
+ }
+
+ return !charset_strncasecmp(charset, (char *) de_name, (size_t) de_name_len,
+ fname->disk_name.name,
+ (size_t) fname->disk_name.len);
+}
+
/* bio.c */
static inline void fscrypt_decrypt_bio_pages(struct fscrypt_ctx *ctx,
struct bio *bio)
diff --git a/include/linux/fscrypt_supp.h b/include/linux/fscrypt_supp.h
index cf9e9fc02f0a..8ef12553a174 100644
--- a/include/linux/fscrypt_supp.h
+++ b/include/linux/fscrypt_supp.h
@@ -138,6 +138,33 @@ static inline bool fscrypt_match_name(const struct fscrypt_name *fname,
return !memcmp(de_name, fname->disk_name.name, fname->disk_name.len);
}

+static inline bool fscrypt_charset_match_name(const struct fscrypt_name *fname,
+ const struct charset *charset,
+ const u8 *de_name,
+ u32 de_name_len, bool ignorecase)
+{
+ if (unlikely(!fname->disk_name.name)) {
+ const struct fscrypt_digested_name *n =
+ (const void *)fname->crypto_buf.name;
+ if (WARN_ON_ONCE(fname->usr_fname->name[0] != '_'))
+ return false;
+ if (de_name_len <= FSCRYPT_FNAME_MAX_UNDIGESTED_SIZE)
+ return false;
+ return !memcmp(FSCRYPT_FNAME_DIGEST(de_name, de_name_len),
+ n->digest, FSCRYPT_FNAME_DIGEST_SIZE);
+ }
+
+ if (!ignorecase) {
+ return !charset_strncmp(charset, (char *) de_name, de_name_len,
+ fname->disk_name.name,
+ fname->disk_name.len);
+ }
+
+ return !charset_strncasecmp(charset, (char *) de_name, de_name_len,
+ fname->disk_name.name,
+ fname->disk_name.len);
+}
+
/* bio.c */
extern void fscrypt_decrypt_bio_pages(struct fscrypt_ctx *, struct bio *);
extern void fscrypt_pullback_bio_page(struct page **, bool);
--
2.15.1

2018-01-25 02:55:34

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH RFC v2 12/13] ext4: Support charset name matching

Implement generic caseless lookup based on the mount flags using the
charset library.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
fs/ext4/namei.c | 27 ++++++++++++++++++---------
1 file changed, 18 insertions(+), 9 deletions(-)

diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index e750d68fbcb5..24056172e225 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -1255,8 +1255,10 @@ static void dx_insert_block(struct dx_frame *frame, u32 hash, ext4_lblk_t block)
*
* Return: %true if the directory entry matches, otherwise %false.
*/
-static inline bool ext4_match(const struct ext4_filename *fname,
- const struct ext4_dir_entry_2 *de)
+static inline bool ext4_match(const struct charset *charset,
+ const struct ext4_filename *fname,
+ const struct ext4_dir_entry_2 *de,
+ bool ignorecase)
{
struct fscrypt_name f;

@@ -1268,7 +1270,8 @@ static inline bool ext4_match(const struct ext4_filename *fname,
#ifdef CONFIG_EXT4_FS_ENCRYPTION
f.crypto_buf = fname->crypto_buf;
#endif
- return fscrypt_match_name(&f, de->name, de->name_len);
+ return fscrypt_charset_match_name(&f, charset, de->name, de->name_len,
+ ignorecase);
}

/*
@@ -1281,6 +1284,8 @@ int ext4_search_dir(struct buffer_head *bh, char *search_buf, int buf_size,
struct ext4_dir_entry_2 * de;
char * dlimit;
int de_len;
+ struct super_block *sb = dir->i_sb;
+ bool ignorecase = test_opt2(sb, CASE_INSENSITIVE);

de = (struct ext4_dir_entry_2 *)search_buf;
dlimit = search_buf + buf_size;
@@ -1288,7 +1293,7 @@ int ext4_search_dir(struct buffer_head *bh, char *search_buf, int buf_size,
/* this code is executed quadratically often */
/* do minimal checking `by hand' */
if ((char *) de + de->name_len <= dlimit &&
- ext4_match(fname, de)) {
+ ext4_match(EXT4_SB(sb)->encoding, fname, de, ignorecase)) {
/* found a match - just to be sure, do
* a full check */
if (ext4_check_dir_entry(dir, NULL, de, bh, bh->b_data,
@@ -1389,11 +1394,13 @@ static struct buffer_head * ext4_find_entry (struct inode *dir,
if (is_dx(dir)) {
ret = ext4_dx_find_entry(dir, &fname, res_dir);
/*
- * On success, or if the error was file not found,
- * return. Otherwise, fall back to doing a search the
- * old fashioned way.
+ * On success, or if the file cannot be found, return.
+ * If an error occurred, or if the file was not found
+ * but we can do case-insensitive lookups, fall back to
+ * the linear search.
*/
- if (!IS_ERR(ret) || PTR_ERR(ret) != ERR_BAD_DX_DIR)
+ if ((!IS_ERR(ret) && (ret || !test_opt2(sb, CASE_INSENSITIVE))) ||
+ (IS_ERR(ret) && PTR_ERR(ret) != ERR_BAD_DX_DIR))
goto cleanup_and_exit;
dxtrace(printk(KERN_DEBUG "ext4_find_entry: dx failed, "
"falling back\n"));
@@ -1788,6 +1795,8 @@ int ext4_find_dest_de(struct inode *dir, struct inode *inode,
int nlen, rlen;
unsigned int offset = 0;
char *top;
+ struct super_block *sb = dir->i_sb;
+ bool ignorecase = test_opt2(sb, CASE_INSENSITIVE);

de = (struct ext4_dir_entry_2 *)buf;
top = buf + buf_size - reclen;
@@ -1795,7 +1804,7 @@ int ext4_find_dest_de(struct inode *dir, struct inode *inode,
if (ext4_check_dir_entry(dir, NULL, de, bh,
buf, buf_size, offset))
return -EFSCORRUPTED;
- if (ext4_match(fname, de))
+ if (ext4_match(EXT4_SB(sb)->encoding, fname, de, ignorecase))
return -EEXIST;
nlen = EXT4_DIR_REC_LEN(de->name_len);
rlen = ext4_rec_len_from_disk(de->rec_len, buf_size);
--
2.15.1

2018-01-25 03:16:56

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH RFC v2 00/13] NLS/UTF-8 Case-Insensitive lookups for ext4 and VFS proposal

On Thu, Jan 25, 2018 at 12:53:36AM -0200, Gabriel Krisman Bertazi wrote:

> The second proposal is related to the VFS layer:
>
> (2) Enable Insensitive lookup support on a per-mountpoint basis,
> via a MS_CASEFOLD flag, with the ultimate goal of supporting a
> case-insensitive bind mount of a subtree, side-by-side with a
> sensitive version of the filesystem.

First reaction: No. With the side of HELL NO.

Your ultimate goal sounds utterly insane - dcache tree must be shared
for all mounts. Moreover, "would these two names refer to the same
object" can not be mount-dependent. Not going to happen.

Please, post the description of what you are planning to do.
Detailed. I'm not saying that it's 100% impossible to do correctly,
but I'm _very_ sceptical about the feasibility.

I'm certainly not going to ACK any VFS changes until you convince
me that this thing can be done with sane semantics.

2018-01-25 19:32:39

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH RFC v2 00/13] NLS/UTF-8 Case-Insensitive lookups for ext4 and VFS proposal

On Thu, Jan 25, 2018 at 03:16:50AM +0000, Al Viro wrote:
> On Thu, Jan 25, 2018 at 12:53:36AM -0200, Gabriel Krisman Bertazi wrote:
>
> > The second proposal is related to the VFS layer:
> >
> > (2) Enable Insensitive lookup support on a per-mountpoint basis,
> > via a MS_CASEFOLD flag, with the ultimate goal of supporting a
> > case-insensitive bind mount of a subtree, side-by-side with a
> > sensitive version of the filesystem.
>
> First reaction: No. With the side of HELL NO.
>
> Your ultimate goal sounds utterly insane - dcache tree must be shared
> for all mounts. Moreover, "would these two names refer to the same
> object" can not be mount-dependent. Not going to happen.
>
> Please, post the description of what you are planning to do.
> Detailed. I'm not saying that it's 100% impossible to do correctly,
> but I'm _very_ sceptical about the feasibility.

So I can't speak authoratatively about the Collabora's customer
requirements that drove this particular feature, but I can talk about
why I am interested in this feature --- although I understand why it's
hard to do correctly.

Android is currently using a horrible wrapfs scheme to provide case
insesitive lookups. We talked about this last year at LSF/MM[1]. It
is using separate dentry entries for the upper and lower layers, and
it is *definitely* prone to races. Running fsstress on the top and
lower layers simultaneously is a pretty simple way to induce hilarity,
although my understanding is that the developers working on it having
been hammering on it for a while so that the races that cause kernel
panicks or file system damage are rare. I don't believe they have
been completely elimniated, because it's not clear to me that any
wrapfs solution can ever be made race-free.

[1] https://lwn.net/Articles/718640/

The problem is that both the case insensitive and case sensitive
directory trees are made available to userspace, and for backwards
compatibility reasons, they can't just make it disappear. Although
they don't have a Linus Torvalds going all Mr. Shouty about breaking
userspace, the combined whinging from the App developer ecosystem
probably can be a fairly good substitute for Linus when they complain
about compatibility breakages visible to the Android application
programmer. :-)

So my hope has been to find a way to (a) make wrapfs go away, and (b)
allow the functionality of case sensitive and case insensitive bind
mounts of the same underlying file system directories be *somehow*
supported.

If the answer is that if two files that differ only in case are
created via the case sensitive mount results in it being undefined
which file you get when you open, rename, or delete that file via the
case insensitive mount, that's fine. I think that's what happpens in
the wrapfs implementation today anyway. It would probably be fine if
the answer was you get the file with an exact match if it is present,
but if the two files created on the case sensitive side are "makefile"
and "Makefile", and you try open/delete "MaKeFiLe" via the case
insensitive mount, it's undefined which file you get --- that's fine.

Anyone who does that will get everything they deserve, and they get
that today anyway with the wrapfs or the previous fuse hack anyway.

This may indeed be an O_PONIES style request. And in the worst case,
Android will continue using an out-of-tree wrapfs solution, and
hopefully I can fend off people asking me for help when the wrapfs
code blows up in various not-so-entertainng ways. Which is,
admittedly, a purely selfish reason for why I would like to see if we
can give Android folks equivalent functionality to the wrapfs hack in
a sane, safe, and consensual way. :-)

Cheers,

- Ted

2018-01-26 02:52:32

by Gao Xiang

[permalink] [raw]
Subject: Re: [PATCH RFC v2 00/13] NLS/UTF-8 Case-Insensitive lookups for ext4 and VFS proposal

Hi Ted,

On 2018/1/26 3:32, Theodore Ts'o wrote:
> On Thu, Jan 25, 2018 at 03:16:50AM +0000, Al Viro wrote:
>> On Thu, Jan 25, 2018 at 12:53:36AM -0200, Gabriel Krisman Bertazi wrote:
>>
>>> The second proposal is related to the VFS layer:
>>>
>>> (2) Enable Insensitive lookup support on a per-mountpoint basis,
>>> via a MS_CASEFOLD flag, with the ultimate goal of supporting a
>>> case-insensitive bind mount of a subtree, side-by-side with a
>>> sensitive version of the filesystem.
>>
>> First reaction: No. With the side of HELL NO.
>>
>> Your ultimate goal sounds utterly insane - dcache tree must be shared
>> for all mounts. Moreover, "would these two names refer to the same
>> object" can not be mount-dependent. Not going to happen.
>>
>> Please, post the description of what you are planning to do.
>> Detailed. I'm not saying that it's 100% impossible to do correctly,
>> but I'm _very_ sceptical about the feasibility.
>
> So I can't speak authoratatively about the Collabora's customer
> requirements that drove this particular feature, but I can talk about
> why I am interested in this feature --- although I understand why it's
> hard to do correctly.
>
> Android is currently using a horrible wrapfs scheme to provide case
> insesitive lookups. We talked about this last year at LSF/MM[1]. It
> is using separate dentry entries for the upper and lower layers, and
> it is *definitely* prone to races. Running fsstress on the top and
> lower layers simultaneously is a pretty simple way to induce hilarity,
> although my understanding is that the developers working on it having
> been hammering on it for a while so that the races that cause kernel
> panicks or file system damage are rare. I don't believe they have
> been completely elimniated, because it's not clear to me that any
> wrapfs solution can ever be made race-free.

Sorry for bother, I mean you refer to sdcardfs,
The current Samsung sdcardfs and esdfs are all fsstress broken for many issues,
However we developed an another approach last year, in

https://android-review.googlesource.com/#/c/kernel/common/+/574402/3/Documentation/filesystems/hwsdcardfs.txt

We applied it on HUAWEI Mate 10 (pro), and the complete solution is

https://android-review.googlesource.com/#/q/status:open+project:kernel/common+branch:android-4.4+topic:huawei_sdcardfs

It supports stackable case-insensitive without modifing VFS smoothly,
and we tested fsstress on the top and lower layers at the same time
without any problem.

>
> [1] https://lwn.net/Articles/718640/
>
> The problem is that both the case insensitive and case sensitive
> directory trees are made available to userspace, and for backwards
> compatibility reasons, they can't just make it disappear. Although
> they don't have a Linus Torvalds going all Mr. Shouty about breaking
> userspace, the combined whinging from the App developer ecosystem
> probably can be a fairly good substitute for Linus when they complain
> about compatibility breakages visible to the Android application
> programmer. :-)
>
> So my hope has been to find a way to (a) make wrapfs go away, and (b)
> allow the functionality of case sensitive and case insensitive bind
> mounts of the same underlying file system directories be *somehow*
> supported.
>
> If the answer is that if two files that differ only in case are
> created via the case sensitive mount results in it being undefined
> which file you get when you open, rename, or delete that file via the
> case insensitive mount, that's fine. I think that's what happpens in
> the wrapfs implementation today anyway. It would probably be fine if
> the answer was you get the file with an exact match if it is present,
> but if the two files created on the case sensitive side are "makefile"
> and "Makefile", and you try open/delete "MaKeFiLe" via the case
> insensitive mount, it's undefined which file you get --- that's fine.

We have a solution to solve case-insensitive stackable but without modifing VFS.

Thanks,

>
> Anyone who does that will get everything they deserve, and they get
> that today anyway with the wrapfs or the previous fuse hack anyway.
>
> This may indeed be an O_PONIES style request. And in the worst case,
> Android will continue using an out-of-tree wrapfs solution, and
> hopefully I can fend off people asking me for help when the wrapfs
> code blows up in various not-so-entertainng ways. Which is,
> admittedly, a purely selfish reason for why I would like to see if we
> can give Android folks equivalent functionality to the wrapfs hack in
> a sane, safe, and consensual way. :-)
>
> Cheers,
>
> - Ted
>

2018-02-06 02:24:28

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: Re: [PATCH RFC v2 00/13] NLS/UTF-8 Case-Insensitive lookups for ext4 and VFS proposal

Al Viro <[email protected]> writes:

> On Thu, Jan 25, 2018 at 12:53:36AM -0200, Gabriel Krisman Bertazi wrote:
>
>> The second proposal is related to the VFS layer:
>>
>> (2) Enable Insensitive lookup support on a per-mountpoint basis,
>> via a MS_CASEFOLD flag, with the ultimate goal of supporting a
>> case-insensitive bind mount of a subtree, side-by-side with a
>> sensitive version of the filesystem.
>
> First reaction: No. With the side of HELL NO.
>
> Your ultimate goal sounds utterly insane - dcache tree must be shared
> for all mounts. Moreover, "would these two names refer to the same
> object" can not be mount-dependent. Not going to happen.
>
> Please, post the description of what you are planning to do.
> Detailed. I'm not saying that it's 100% impossible to do correctly,
> but I'm _very_ sceptical about the feasibility.
>
> I'm certainly not going to ACK any VFS changes until you convince
> me that this thing can be done with sane semantics.

Hi Viro,

Sorry for the delay. I wanted to get a better understanding of the
code before getting back to you.

Our customer is interested in exposing a subtree of an existing
filesystem (native Linux filesystems, xfs, ext4 and others) in an
case-insensitive lookup manner, without paying the cost of a userspace
getdents implementation, and, preferably, without requiring the user to
modify data on the disk.

After learning, last year, about the similar Android issue from Ted at a
conference, I started to believe a bind mount is the appropriate
solution for both our use cases.

Like Ted mentioned, we let the user shot herself in the foot by having
two files with the exact CI name, and which one she gets will be
undefined. My current solution gets the exact match if it is present,
and my second approach could be adapted to do that with a performance
cost only if both files are not cached. That seem to be acceptable by
Ted and shouldn't be a problem for me too.

Let me start by discussing what I have now, which is a functional
solution, to the best of my knowledge, but which still requires rework
to better benefit from the dentry cache. Then I will discuss the
solutions I am working on.

(1) What I have now:

As I mentioned in the first email, MS_CASEFOLD triggers a MNT_CASEFOLD
flag in the vfsmount structure. This flag, which gets recalculated
every time a mountpoint is traversed, is used to decide if we do a
casefolded ->lookup() for each component of the path. This obviously
allows paths with multiple nested mountpoints to work.

The ->lookup() function calls d_add_ci(), making sure that only the
exact-match dentry is inserted in the cache. This means that we don't
have multiple dentries, differing only by case, associated to the same
file. This is an important semantic, which I am not modifying. This
also means that a following CI lookup with an inexact-case match will
trigger another ->lookup(), which is bad, but is manageable (for now).
I address this problem later in this email.

Regarding negative dentries. If we have a negative dentry in the cache,
the case-sensitive algorithm will abort the lookup and return the
dentry, nothing new here. If, on the other hand, That component lookup
is under a MNT_CASEFOLD search, the code does a bigger effort, ignores
the negative dentry and still do ->lookup().

This means that:

- Case-sensitive (CS) search performance is not affected, except for
the following sequence: Do a CS(foo) which returns a negative_dentry,
then CI(FOO), and finally CS(foo). This will require the second
CS(foo) to re-do the search and recreate the negative dentry. If
only CS searches are done, the performance should be unaffected.

- Case-insensitive (CI) benefits from the dentry cache as long as it is
doing an exact-case name match. Otherwise, it always falls to
->lookup(). The exact-name dentry, if already in the cache, will be
returned in both cases.

- Negative dentries are ignored if doing CI and it always re-do the
search.

The prototype code is in the following branch. I can send
patches if you want.

https://gitlab.collabora.com/krisman/linux.git -b vfs-ms_casefold

(2) What I want to do:

I want to allow lookups of the inexact name to find the cached dentry,
preventing it from going to the disk only to find out later it already
has that dentry cached in some other cache bucket. This means that,
whenever there is a MNT_CASEFOLD enabled, every mountpoint of that
superblock needs to use d_hash(casefold(name)), such that every file is
in a known bucket. Obviously this is a worse hash function, but the
cost is only payed by filesystems using the feature. If a bind mount is
done, we must rehash all existing entries of that filesystem at
"bind,remount,ignorecase" time.

Like the previous version, we only store exact name dentries, even
though the bucket is different now. When doing a LOOKUP_CASEFOLD, we
use a different ->d_compare_CI() to do the matching, such that we can
find the right dentry or not, in both scenarios (CI and CS).

We still can't trust negative dentries when doing a CI lookup. Consider
the case where we do CS(FOO) and foo doesn't exist. Now we have a
negative_dentry(FOO), but CI(FOO) would still need to find an existing
'foo' file. The first solution is to make fast_CI_lookup() walk the
entire bucket and ignore previous matching negative dentries if a
positive one is found. If no positive dentry is found, then fallback
and do ->lookup(). A superior solution is to mark negative dentries as
soft negatives and hard negatives, depending on what kind of lookup
created the dentry, whether CS or CI, respectively. Then, we can allow
a CI lookup to return the negative dentry only if it is a HARD negative,
and try the ->lookup() otherwise. For this to work, any dentry creation
on that parent requires that we invalidate hard negative children
dentries, and make them soft, at insertion/rename time.

With these changes I believe the dentry cache could be shared by code
doing CI and CS lookups of the same dentry. My main concern is whether
we could invalidate a negative dentry without harming parallel lookups
using that dentry, but I believe I can work around that.

I believe this covers everything I could identify, but please let me
know if I missed something, obvious or not.

--
Gabriel Krisman Bertazi

2018-02-06 03:21:44

by Gao Xiang

[permalink] [raw]
Subject: Re: [PATCH RFC v2 00/13] NLS/UTF-8 Case-Insensitive lookups for ext4 and VFS proposal

Hi Gabriel,

On 2018/2/6 10:24, Gabriel Krisman Bertazi wrote:
> Al Viro <[email protected]> writes:
>
>> On Thu, Jan 25, 2018 at 12:53:36AM -0200, Gabriel Krisman Bertazi wrote:
>>
>>> The second proposal is related to the VFS layer:
>>>
>>> (2) Enable Insensitive lookup support on a per-mountpoint basis,
>>> via a MS_CASEFOLD flag, with the ultimate goal of supporting a
>>> case-insensitive bind mount of a subtree, side-by-side with a
>>> sensitive version of the filesystem.
>>
>> First reaction: No. With the side of HELL NO.
>>
>> Your ultimate goal sounds utterly insane - dcache tree must be shared
>> for all mounts. Moreover, "would these two names refer to the same
>> object" can not be mount-dependent. Not going to happen.
>>
>> Please, post the description of what you are planning to do.
>> Detailed. I'm not saying that it's 100% impossible to do correctly,
>> but I'm _very_ sceptical about the feasibility.
>>
>> I'm certainly not going to ACK any VFS changes until you convince
>> me that this thing can be done with sane semantics.
>
> Hi Viro,
>
> Sorry for the delay. I wanted to get a better understanding of the
> code before getting back to you.
>
> Our customer is interested in exposing a subtree of an existing
> filesystem (native Linux filesystems, xfs, ext4 and others) in an
> case-insensitive lookup manner, without paying the cost of a userspace
> getdents implementation, and, preferably, without requiring the user to
> modify data on the disk.
Could I express my opinion? I have working on case-insensitive sdcardfs
for months.

I think your problem is how we optimise a case-insensitive lookup on the
file system with a case-sensitive dcache (I mean d_add and no d_compare
and d_hash).
In that case, we could not trust the negative dentry when _creating_ a
case-insensitive file, for example:
there exists "anDroid" on-disk, but ext4's in-memory dcache only has
the negative "Android", if we lookup "Android" we will get the
_negative_ dentry, but we _cannot_ create it since "anDroid" exists on
disk. In the create case, an on-disk _iterate_ (or readdir) is necessary.

I could give another example, if we uses case-insentive ext4 and create
"Android" and "anDroid", how to deal with the case in the
case-insensitive way?
I mean in that case we should make both "Android" and "anDroid" can
access, right?
I think we need to build a special case-sensitive dcache rather than
a case-insensitive dcache following the native case-insentive fs(use
d_add_ci, d_compare and d_hash, eg. fat, ntfs...)

Finally, I agree "let the user shot herself in the foot by having two
files with the exact CI name", but I think it could not the VFS
_busniess_ itself since each customer solution "case-sensitive ext4 ->
case-insensitive lookup" has their _perfered_ way (for example,
"android" and "Android" exist, A perfers android and B perfers Android.

Finally, I think for optmization, ext4 or other fs could add some dir
inode _tag_ and supports native case-insensitive for these dirs could be
better....

Sorry to bother,

Thanks,

>
> After learning, last year, about the similar Android issue from Ted at a
> conference, I started to believe a bind mount is the appropriate
> solution for both our use cases.
>
> Like Ted mentioned, we let the user shot herself in the foot by having
> two files with the exact CI name, and which one she gets will be
> undefined. My current solution gets the exact match if it is present,
> and my second approach could be adapted to do that with a performance
> cost only if both files are not cached. That seem to be acceptable by
> Ted and shouldn't be a problem for me too.
>
> Let me start by discussing what I have now, which is a functional
> solution, to the best of my knowledge, but which still requires rework
> to better benefit from the dentry cache. Then I will discuss the
> solutions I am working on.
>
> (1) What I have now:
>
> As I mentioned in the first email, MS_CASEFOLD triggers a MNT_CASEFOLD
> flag in the vfsmount structure. This flag, which gets recalculated
> every time a mountpoint is traversed, is used to decide if we do a
> casefolded ->lookup() for each component of the path. This obviously
> allows paths with multiple nested mountpoints to work.
>
> The ->lookup() function calls d_add_ci(), making sure that only the
> exact-match dentry is inserted in the cache. This means that we don't
> have multiple dentries, differing only by case, associated to the same
> file. This is an important semantic, which I am not modifying. This
> also means that a following CI lookup with an inexact-case match will
> trigger another ->lookup(), which is bad, but is manageable (for now).
> I address this problem later in this email.
>
> Regarding negative dentries. If we have a negative dentry in the cache,
> the case-sensitive algorithm will abort the lookup and return the
> dentry, nothing new here. If, on the other hand, That component lookup
> is under a MNT_CASEFOLD search, the code does a bigger effort, ignores
> the negative dentry and still do ->lookup().
>
> This means that:
>
> - Case-sensitive (CS) search performance is not affected, except for
> the following sequence: Do a CS(foo) which returns a negative_dentry,
> then CI(FOO), and finally CS(foo). This will require the second
> CS(foo) to re-do the search and recreate the negative dentry. If
> only CS searches are done, the performance should be unaffected.
>
> - Case-insensitive (CI) benefits from the dentry cache as long as it is
> doing an exact-case name match. Otherwise, it always falls to
> ->lookup(). The exact-name dentry, if already in the cache, will be
> returned in both cases.
>
> - Negative dentries are ignored if doing CI and it always re-do the
> search.
>
> The prototype code is in the following branch. I can send
> patches if you want.
>
> https://gitlab.collabora.com/krisman/linux.git -b vfs-ms_casefold
>
> (2) What I want to do:
>
> I want to allow lookups of the inexact name to find the cached dentry,
> preventing it from going to the disk only to find out later it already
> has that dentry cached in some other cache bucket. This means that,
> whenever there is a MNT_CASEFOLD enabled, every mountpoint of that
> superblock needs to use d_hash(casefold(name)), such that every file is
> in a known bucket. Obviously this is a worse hash function, but the
> cost is only payed by filesystems using the feature. If a bind mount is
> done, we must rehash all existing entries of that filesystem at
> "bind,remount,ignorecase" time.
>
> Like the previous version, we only store exact name dentries, even
> though the bucket is different now. When doing a LOOKUP_CASEFOLD, we
> use a different ->d_compare_CI() to do the matching, such that we can
> find the right dentry or not, in both scenarios (CI and CS).
>
> We still can't trust negative dentries when doing a CI lookup. Consider
> the case where we do CS(FOO) and foo doesn't exist. Now we have a
> negative_dentry(FOO), but CI(FOO) would still need to find an existing
> 'foo' file. The first solution is to make fast_CI_lookup() walk the
> entire bucket and ignore previous matching negative dentries if a
> positive one is found. If no positive dentry is found, then fallback
> and do ->lookup(). A superior solution is to mark negative dentries as
> soft negatives and hard negatives, depending on what kind of lookup
> created the dentry, whether CS or CI, respectively. Then, we can allow
> a CI lookup to return the negative dentry only if it is a HARD negative,
> and try the ->lookup() otherwise. For this to work, any dentry creation
> on that parent requires that we invalidate hard negative children
> dentries, and make them soft, at insertion/rename time.
>
> With these changes I believe the dentry cache could be shared by code
> doing CI and CS lookups of the same dentry. My main concern is whether
> we could invalidate a negative dentry without harming parallel lookups
> using that dentry, but I believe I can work around that.
>
> I believe this covers everything I could identify, but please let me
> know if I missed something, obvious or not.
>

2018-02-12 19:56:30

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: Re: [PATCH RFC v2 00/13] NLS/UTF-8 Case-Insensitive lookups for ext4 and VFS proposal

Gao Xiang <[email protected]> writes:

> Could I express my opinion? I have working on case-insensitive sdcardfs
> for months.

Hi Gao,

Thanks for helping out with this topic.

> I think your problem is how we optimise a case-insensitive lookup on the
> file system with a case-sensitive dcache (I mean d_add and no d_compare
> and d_hash).

Are d_compare and d_hash to be considered really disruptive
performance-wise? Even if they are only used when casefold/encoding
support is enabled? I don't see how we could better use the dcache
without at least requiring these functions to handle CI cases.

> In that case, we could not trust the negative dentry when _creating_ a
> case-insensitive file, for example:
> there exists "anDroid" on-disk, but ext4's in-memory dcache only has
> the negative "Android", if we lookup "Android" we will get the
> _negative_ dentry, but we _cannot_ create it since "anDroid" exists on
> disk. In the create case, an on-disk _iterate_ (or readdir) is
> necessary.

In my previous email, I mentioned my current implementation ignores
negative dentries and forces a ->lookup(), which walks over the disk
entries. (I had to add a fix to the creation path in the vfs-ms_casefold
branch to exactly match that description, so you might have missed the
updated version in that branch).

Either way, this case is supported like this:

If we have two bind-mounts of the same directory, /mnt and /mnt-ci,
case-sensitive and case-insensitive, respectively, We can do:

open("/mnt/anDroid", O_EXCL|O_CREAT) = 3
open("/mnt/Android", 0) = -2 No such file or directory
open("/mnt-ci/Android", 0) = 4
open("/mnt-ci/Android", O_EXCL|O_CREAT) = -17 File exists
open("/mnt-ci/AndROID", O_EXCL|O_CREAT) = -17 File exists

The second open() is expected to create an negative_dentry of "Android",
which, if it wasn't ignored by the 3th open(), the CI operation would
have failed. Notice that the 3th open() operation actually opens the
file that was created by the first open(). It doesn't create a new
file.

Following on, the 4th operation (file creation) *must fail* because
there is a CI name collision with /mnt-ci/anDroid. The same is true for
the final case.

> I could give another example, if we uses case-insentive ext4 and create
> "Android" and "anDroid", how to deal with the case in the
> case-insensitive way?
> I mean in that case we should make both "Android" and "anDroid" can
> access, right?

Not sure if I follow you here, but I'm assuming we create Android and
anDroid in the sensitive mountpoint, because, otherwise the
second file creation in the insensitive mountpoint would fail.

This is the case where I'm hiding one of the previously (CS) created
files, when in the insensitive mountpoint, and the user is shooting
himself. For the sensitive case, Both stays visible to the user.

> I think we need to build a special case-sensitive dcache rather than
> a case-insensitive dcache following the native case-insentive fs(use
> d_add_ci, d_compare and d_hash, eg. fat, ntfs...)

What do you think about the second part of my proposal, where I mention
dealing differently with negative dentries created by a CI lookup?
We don't need to ignore them if we can invalidate them after a creation
in the directory.

> Finally, I agree "let the user shot herself in the foot by having two
> files with the exact CI name", but I think it could not the VFS
> _busniess_ itself since each customer solution "case-sensitive ext4 ->
> case-insensitive lookup" has their _perfered_ way (for example,
> "android" and "Android" exist, A perfers android and B perfers Android.

I don't see how we could defer the decision to the filesystem, that's a
pretty good problem, which I don't have a solution right now.

> Finally, I think for optmization, ext4 or other fs could add some dir
> inode _tag_ and supports native case-insensitive for these dirs could be
> better....

Agreed. But I'm seeing this as outside the scope of my proposal, since it
is specific to each filesystem. My ext4 adaptation, for instance, falls
back to linear search when it can't find the exact match.

Thanks,

--
Gabriel Krisman Bertazi

2018-02-12 22:43:10

by Gao Xiang

[permalink] [raw]
Subject: Re: [PATCH RFC v2 00/13] NLS/UTF-8 Case-Insensitive lookups for ext4 and VFS proposal

Hi Gabriel,

Thanks for your reply.

On 2018/2/13 3:56, Gabriel Krisman Bertazi wrote:
> Gao Xiang <[email protected]> writes:
>
>> Could I express my opinion? I have working on case-insensitive sdcardfs
>> for months.
> Hi Gao,
>
> Thanks for helping out with this topic.
>
>> I think your problem is how we optimise a case-insensitive lookup on the
>> file system with a case-sensitive dcache (I mean d_add and no d_compare
>> and d_hash).
> Are d_compare and d_hash to be considered really disruptive
> performance-wise? Even if they are only used when casefold/encoding
> support is enabled? I don't see how we could better use the dcache
> without at least requiring these functions to handle CI cases.
I mean if both "Android" and "anDroid" exist in the same directory of
on-disk ext4,
I could tend to make both "Android" and "anDroid" accessed by
the case-exact name lookup, otherwise do a case-insensitive lookup. eg.
(my current implementation),
1. open("/mnt/anDroid", O_EXCL|O_CREAT)
2. open("/mnt/Android", O_EXCL|O_CREAT)
3. open("/mnt-ci/Android") --- exactly Android
4. open("/mnt-ci/anDroid") --- exactly anDroid
5. open("/mnt-ci/ANDROID") --- Android or anDroid, do a case-insensitive
lookup.
6. unlink("/mnt-ci/Android") --- exactly Android
7. unlink("/mnt-ci/ANDROID") --- anDroid, do a case-insensitive lookup.

>> In that case, we could not trust the negative dentry when _creating_ a
>> case-insensitive file, for example:
>> there exists "anDroid" on-disk, but ext4's in-memory dcache only has
>> the negative "Android", if we lookup "Android" we will get the
>> _negative_ dentry, but we _cannot_ create it since "anDroid" exists on
>> disk. In the create case, an on-disk _iterate_ (or readdir) is
>> necessary.
> In my previous email, I mentioned my current implementation ignores
> negative dentries and forces a ->lookup(), which walks over the disk
> entries. (I had to add a fix to the creation path in the vfs-ms_casefold
> branch to exactly match that description, so you might have missed the
> updated version in that branch).
I am sorry, I haven't looked into your CI patch yet.

I decided to join in this topic because in the past months, I did some work
on the case-insensitive sdcardfs, and I just want to express my thoughts
for this topic.

I don't know which level negative dentries you ignore, your
case-insensitive negative dentries
or the original underlayfs(eg. ext4) case-sensitive negative dentries?

Actually I observed some false "negative dentries" race or deadlock
if multiple case-insensitive mounts exist and do fs-ops on these mounts
at the same time,
but I am not sure whether your implementation has these potential issues
in principle or not.

In addition, quote `
Our customer is interested in exposing a subtree of an existing
filesystem (native Linux filesystems, xfs, ext4 and others) in an
case-insensitive lookup manner, without paying the cost of a userspace
getdents implementation, and, preferably, without requiring the user to
modify data on the disk.`

I think the *most expensive* operation for case-insensitive lookup is to
create a not-exist file rather than do a existed-file lookup.
It takes much overhead and I think no straightforward way to directly
reduce the cost
and improve its performance.

>
> Either way, this case is supported like this:
>
> If we have two bind-mounts of the same directory, /mnt and /mnt-ci,
> case-sensitive and case-insensitive, respectively, We can do:
>
> open("/mnt/anDroid", O_EXCL|O_CREAT) = 3
> open("/mnt/Android", 0) = -2 No such file or directory
> open("/mnt-ci/Android", 0) = 4
> open("/mnt-ci/Android", O_EXCL|O_CREAT) = -17 File exists
> open("/mnt-ci/AndROID", O_EXCL|O_CREAT) = -17 File exists
>
> The second open() is expected to create an negative_dentry of "Android",
> which, if it wasn't ignored by the 3th open(), the CI operation would
> have failed. Notice that the 3th open() operation actually opens the
Yes, the 3th open() should invalidate the 2nd negative dentry.
However, I don't know your detailed implementation behavior, for example
as follows:

1. open("/mnt/anDroid", O_EXCL|O_CREAT)
2. open("/mnt/Android", 0)                 ---- dentry A
3. open("/mnt-ci/Android", 0)             ---- CI-dentry B or the same
dentry A?

if 2 and 3 are the different denties but the same inode, furthermore,
if the inode is a directory inode, it seems like a hard link directory,
any *potential deadlock* with that?

and how about?
1. open("/mnt/anDroid", O_EXCL|O_CREAT)             --- inode A
2. close("/mnt/anDroid") --- still positive, referenced
3. open("/mnt/Android", 0) --- No such file or directory
4. open("/mnt-ci/Android", 0) --- positive, inode A
5. close("/mnt-ci/Android") --- still positive, referenced
6. open("/mnt/android", O_EXCL|O_CREAT)             --- inode B
7. unlink("/mnt/anDroid")   --- expected behavior? since we have 2) and
5) referenced, can the inode finally be evicted? ---
8. open("/mnt-ci/Android", 0) --- ??? positive and inode B? ---

> file that was created by the first open(). It doesn't create a new
> file.
>
> Following on, the 4th operation (file creation) *must fail* because
> there is a CI name collision with /mnt-ci/anDroid. The same is true for
> the final case.
>
>> I could give another example, if we uses case-insentive ext4 and create
>> "Android" and "anDroid", how to deal with the case in the
>> case-insensitive way?
>> I mean in that case we should make both "Android" and "anDroid" can
>> access, right?
> Not sure if I follow you here, but I'm assuming we create Android and
> anDroid in the sensitive mountpoint, because, otherwise the
> second file creation in the insensitive mountpoint would fail.
>
> This is the case where I'm hiding one of the previously (CS) created
> files, when in the insensitive mountpoint, and the user is shooting
> himself. For the sensitive case, Both stays visible to the user.
As I mentioned above....
1. open("/mnt/anDroid", O_EXCL|O_CREAT)
2. open("/mnt/Android", O_EXCL|O_CREAT)
3. open("/mnt-ci/Android") --- exactly "Android"
4. open("/mnt-ci/anDroid") --- exactly "anDroid"
5. open("/mnt-ci/ANDROID") --- "Android" or "anDroid", do a
case-insensitive lookup.
6. unlink("/mnt-ci/Android") --- exactly "Android"
7. unlink("/mnt-ci/ANDROID") --- "anDroid", do a case-insensitive lookup.

>> I think we need to build a special case-sensitive dcache rather than
>> a case-insensitive dcache following the native case-insentive fs(use
>> d_add_ci, d_compare and d_hash, eg. fat, ntfs...)
> What do you think about the second part of my proposal, where I mention
> dealing differently with negative dentries created by a CI lookup?
> We don't need to ignore them if we can invalidate them after a creation
> in the directory.
I feel some complex of your second part... I still need some time to
look into that...
and I think it is useless of negative dentries for that case
intuitively.....
>
>> Finally, I agree "let the user shot herself in the foot by having two
>> files with the exact CI name", but I think it could not the VFS
>> _busniess_ itself since each customer solution "case-sensitive ext4 ->
>> case-insensitive lookup" has their _perfered_ way (for example,
>> "android" and "Android" exist, A perfers android and B perfers Android.
> I don't see how we could defer the decision to the filesystem, that's a
> pretty good problem, which I don't have a solution right now.
>
>> Finally, I think for optmization, ext4 or other fs could add some dir
>> inode _tag_ and supports native case-insensitive for these dirs could be
>> better....
> Agreed. But I'm seeing this as outside the scope of my proposal, since it
> is specific to each filesystem. My ext4 adaptation, for instance, falls
> back to linear search when it can't find the exact match.
>
> Thanks,
>
I could miss something important, if I recall, I will reply in the
future....

Thanks,

2018-02-13 22:20:47

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: Re: [PATCH RFC v2 00/13] NLS/UTF-8 Case-Insensitive lookups for ext4 and VFS proposal

Gao Xiang <[email protected]> writes:

> Hi Gabriel,
>
> Thanks for your reply.
>
> On 2018/2/13 3:56, Gabriel Krisman Bertazi wrote:
>> Gao Xiang <[email protected]> writes:
>>
>>> Could I express my opinion? I have working on case-insensitive sdcardfs
>>> for months.
>> Hi Gao,
>>
>> Thanks for helping out with this topic.
>>
>>> I think your problem is how we optimise a case-insensitive lookup on the
>>> file system with a case-sensitive dcache (I mean d_add and no d_compare
>>> and d_hash).
>> Are d_compare and d_hash to be considered really disruptive
>> performance-wise? Even if they are only used when casefold/encoding
>> support is enabled? I don't see how we could better use the dcache
>> without at least requiring these functions to handle CI cases.
> I mean if both "Android" and "anDroid" exist in the same directory of
> on-disk ext4,
> I could tend to make both "Android" and "anDroid" accessed by
> the case-exact name lookup, otherwise do a case-insensitive
> lookup. eg. (my current implementation),
> 1. open("/mnt/anDroid", O_EXCL|O_CREAT)
> 2. open("/mnt/Android", O_EXCL|O_CREAT)
> 3. open("/mnt-ci/Android") --- exactly Android
> 4. open("/mnt-ci/anDroid") --- exactly anDroid
> 5. open("/mnt-ci/ANDROID") --- Android or anDroid, do a case-insensitive
> lookup.
> 6. unlink("/mnt-ci/Android") --- exactly Android
> 7. unlink("/mnt-ci/ANDROID") --- anDroid, do a case-insensitive lookup.

Hmm, let me correct what I said before. My code is not fetching the
exact-match in this case, it always returns the result of a CI lookup.

Still, I adapted it last night to return the exact-case file, by
touching the lookup hook itself, which I believe allows it to be made a
filesystem decision, despite being much more expensive.

Apart from that, can we revisit this design decision ? It feels a bit
awkward that we can change the file returned by the lookup by doing an
unlink() or creat() on the other mountpoint.

$ echo foo > /mnt/bla
$ cat /mnt-ci/BLA
foo
$ echo bar > /mnt/BLA
$ cat /mnt-ci/BLA <--- Same lookup has different result now
bar

This seems awfully confusing. Maybe that's the nature of CI, but,
wouldn't entirely hiding the duplicated file be less confusing? Is
there any benefit or use case that requires this behavior?


>>> In that case, we could not trust the negative dentry when _creating_ a
>>> case-insensitive file, for example:
>>> there exists "anDroid" on-disk, but ext4's in-memory dcache only has
>>> the negative "Android", if we lookup "Android" we will get the
>>> _negative_ dentry, but we _cannot_ create it since "anDroid" exists on
>>> disk. In the create case, an on-disk _iterate_ (or readdir) is
>>> necessary.
>> In my previous email, I mentioned my current implementation ignores
>> negative dentries and forces a ->lookup(), which walks over the disk
>> entries. (I had to add a fix to the creation path in the vfs-ms_casefold
>> branch to exactly match that description, so you might have missed the
>> updated version in that branch).
> I am sorry, I haven't looked into your CI patch yet.
>
> I decided to join in this topic because in the past months, I did some work
> on the case-insensitive sdcardfs, and I just want to express my thoughts
> for this topic.
>
> I don't know which level negative dentries you ignore, your
> case-insensitive negative dentries
> or the original underlayfs(eg. ext4) case-sensitive negative dentries?

I'm talking about the existing case-sensitive negative dentries. I'm
not adding any dentries other than the ones created by the underlayfs

> Actually I observed some false "negative dentries" race or deadlock
> if multiple case-insensitive mounts exist and do fs-ops on these mounts
> at the same time,
> but I am not sure whether your implementation has these potential issues
> in principle or not.

Do you have a test case? I didn't observe that, but I might not be
stressing my code enough.

>
> In addition, quote `
> Our customer is interested in exposing a subtree of an existing
> filesystem (native Linux filesystems, xfs, ext4 and others) in an
> case-insensitive lookup manner, without paying the cost of a userspace
> getdents implementation, and, preferably, without requiring the user to
> modify data on the disk.`
>
> I think the *most expensive* operation for case-insensitive lookup is to
> create a not-exist file rather than do a existed-file lookup.
> It takes much overhead and I think no straightforward way to directly
> reduce the cost
> and improve its performance.
>
>>
>> Either way, this case is supported like this:
>>
>> If we have two bind-mounts of the same directory, /mnt and /mnt-ci,
>> case-sensitive and case-insensitive, respectively, We can do:
>>
>> open("/mnt/anDroid", O_EXCL|O_CREAT) = 3
>> open("/mnt/Android", 0) = -2 No such file or directory
>> open("/mnt-ci/Android", 0) = 4
>> open("/mnt-ci/Android", O_EXCL|O_CREAT) = -17 File exists
>> open("/mnt-ci/AndROID", O_EXCL|O_CREAT) = -17 File exists
>>
>> The second open() is expected to create an negative_dentry of "Android",
>> which, if it wasn't ignored by the 3th open(), the CI operation would
>> have failed. Notice that the 3th open() operation actually opens the
> Yes, the 3th open() should invalidate the 2nd negative dentry.
> However, I don't know your detailed implementation behavior, for example
> as follows:
>
> 1. open("/mnt/anDroid", O_EXCL|O_CREAT)
> 2. open("/mnt/Android", 0)                 ---- dentry A
> 3. open("/mnt-ci/Android", 0)             ---- CI-dentry B or the same
> dentry A?
>
> if 2 and 3 are the different denties but the same inode, furthermore,
> if the inode is a directory inode, it seems like a hard link directory,
> any *potential deadlock* with that?

2 and 3 are the same dentry, not a duplicate or anything else. The
point is to avoid dealing with multiple dentries kind of complexity.


> and how about?
> 1. open("/mnt/anDroid", O_EXCL|O_CREAT)             --- inode A
> 2. close("/mnt/anDroid") --- still positive, referenced
> 3. open("/mnt/Android", 0) --- No such file or directory
> 4. open("/mnt-ci/Android", 0) --- positive, inode A
> 5. close("/mnt-ci/Android") --- still positive, referenced
> 6. open("/mnt/android", O_EXCL|O_CREAT)             --- inode B
> 7. unlink("/mnt/anDroid")   --- expected behavior? since we have 2) and
> 5) referenced, can the inode finally be evicted? ---
> 8. open("/mnt-ci/Android", 0) --- ??? positive and inode B? ---

If I understand correctly, since the dentry recovered by 1 and 4 are the
same, 7 should unlink /mnt/anDroid and make it negative, since there are
no more references.

Since /mnt/anDroid no longer exists we are sure that step 8. when
accessing /mnt-ci/Android will return inode B, since that is the only
one remaining. I think this is it.

>> file that was created by the first open(). It doesn't create a new
>> file.
>>
>> Following on, the 4th operation (file creation) *must fail* because
>> there is a CI name collision with /mnt-ci/anDroid. The same is true for
>> the final case.
>>
>>> I could give another example, if we uses case-insentive ext4 and create
>>> "Android" and "anDroid", how to deal with the case in the
>>> case-insensitive way?
>>> I mean in that case we should make both "Android" and "anDroid" can
>>> access, right?
>> Not sure if I follow you here, but I'm assuming we create Android and
>> anDroid in the sensitive mountpoint, because, otherwise the
>> second file creation in the insensitive mountpoint would fail.
>>
>> This is the case where I'm hiding one of the previously (CS) created
>> files, when in the insensitive mountpoint, and the user is shooting
>> himself. For the sensitive case, Both stays visible to the user.
> As I mentioned above....
> 1. open("/mnt/anDroid", O_EXCL|O_CREAT)
> 2. open("/mnt/Android", O_EXCL|O_CREAT)
> 3. open("/mnt-ci/Android") --- exactly "Android"
> 4. open("/mnt-ci/anDroid") --- exactly "anDroid"
> 5. open("/mnt-ci/ANDROID") --- "Android" or "anDroid", do a
> case-insensitive lookup.
> 6. unlink("/mnt-ci/Android") --- exactly "Android"
> 7. unlink("/mnt-ci/ANDROID") --- "anDroid", do a case-insensitive
> lookup.

This depends on my question about the behavior of doing an exact name
lookup.

>
>>> I think we need to build a special case-sensitive dcache rather than
>>> a case-insensitive dcache following the native case-insentive fs(use
>>> d_add_ci, d_compare and d_hash, eg. fat, ntfs...)
>> What do you think about the second part of my proposal, where I mention
>> dealing differently with negative dentries created by a CI lookup?
>> We don't need to ignore them if we can invalidate them after a creation
>> in the directory.
> I feel some complex of your second part... I still need some time to
> look into that...
> and I think it is useless of negative dentries for that case
> intuitively.....
>>
>>> Finally, I agree "let the user shot herself in the foot by having two
>>> files with the exact CI name", but I think it could not the VFS
>>> _busniess_ itself since each customer solution "case-sensitive ext4 ->
>>> case-insensitive lookup" has their _perfered_ way (for example,
>>> "android" and "Android" exist, A perfers android and B perfers Android.
>> I don't see how we could defer the decision to the filesystem, that's a
>> pretty good problem, which I don't have a solution right now.
>>
>>> Finally, I think for optmization, ext4 or other fs could add some dir
>>> inode _tag_ and supports native case-insensitive for these dirs could be
>>> better....
>> Agreed. But I'm seeing this as outside the scope of my proposal, since it
>> is specific to each filesystem. My ext4 adaptation, for instance, falls
>> back to linear search when it can't find the exact match.
>>
>> Thanks,
>>
> I could miss something important, if I recall, I will reply in the
> future....
>
> Thanks,

--
Gabriel Krisman Bertazi

2018-02-14 12:27:48

by Gao Xiang

[permalink] [raw]
Subject: Re: [PATCH RFC v2 00/13] NLS/UTF-8 Case-Insensitive lookups for ext4 and VFS proposal

Hi Gabriel,

On 2018/2/14 6:20, Gabriel Krisman Bertazi wrote:
> Gao Xiang <[email protected]> writes:
>
>> Hi Gabriel,
>>
>> Thanks for your reply.
>>
>> On 2018/2/13 3:56, Gabriel Krisman Bertazi wrote:
>>> Gao Xiang <[email protected]> writes:
>>>
>>>> Could I express my opinion? I have working on case-insensitive sdcardfs
>>>> for months.
>>> Hi Gao,
>>>
>>> Thanks for helping out with this topic.
>>>
>>>> I think your problem is how we optimise a case-insensitive lookup on the
>>>> file system with a case-sensitive dcache (I mean d_add and no d_compare
>>>> and d_hash).
>>> Are d_compare and d_hash to be considered really disruptive
>>> performance-wise? Even if they are only used when casefold/encoding
>>> support is enabled? I don't see how we could better use the dcache
>>> without at least requiring these functions to handle CI cases.
>> I mean if both "Android" and "anDroid" exist in the same directory of
>> on-disk ext4,
>> I could tend to make both "Android" and "anDroid" accessed by
>> the case-exact name lookup, otherwise do a case-insensitive
>> lookup. eg. (my current implementation),
>> 1. open("/mnt/anDroid", O_EXCL|O_CREAT)
>> 2. open("/mnt/Android", O_EXCL|O_CREAT)
>> 3. open("/mnt-ci/Android") --- exactly Android
>> 4. open("/mnt-ci/anDroid") --- exactly anDroid
>> 5. open("/mnt-ci/ANDROID") --- Android or anDroid, do a case-insensitive
>> lookup.
>> 6. unlink("/mnt-ci/Android") --- exactly Android
>> 7. unlink("/mnt-ci/ANDROID") --- anDroid, do a case-insensitive lookup.
> Hmm, let me correct what I said before. My code is not fetching the
> exact-match in this case, it always returns the result of a CI lookup.
>
> Still, I adapted it last night to return the exact-case file, by
> touching the lookup hook itself, which I believe allows it to be made a
> filesystem decision, despite being much more expensive.
I don't think it is a purely filesystem decision.
CI selection depends on the users...(eg. some users perfer the smallest
alphabet order,
some users select CI file by time, such as the oldest one)...
IMO, it behaves as a file system filter users could select by themselves
rather than the part of VFS or filesystems.
> Apart from that, can we revisit this design decision ? It feels a bit
> awkward that we can change the file returned by the lookup by doing an
> unlink() or creat() on the other mountpoint.
>
> $ echo foo > /mnt/bla
> $ cat /mnt-ci/BLA
> foo
> $ echo bar > /mnt/BLA
> $ cat /mnt-ci/BLA <--- Same lookup has different result now
> bar
>
> This seems awfully confusing. Maybe that's the nature of CI, but,
> wouldn't entirely hiding the duplicated file be less confusing? Is
> there any benefit or use case that requires this behavior?
It seems no confusing... bacause the user creates a new case-exact
"/mnt/BLA", and
the first "cat /mnt-ci/BLA" and the second "cat /mnt-ci/BLA" are not the
same fd.
If the exact-case file exist, I think we should *give a way* to get
these files by the exact-case names on the CI mountpoint, and that is
why I said
"a case-insensitive lookup on the file system with a case-sensitive dcache",
since it can create/rename a case-exact file on the original
case-sensitive mnt.

BTW, If you feel confused, how about the following?

1) echo foo > /mnt/bla
2) echo bar > /mnt/BLA
3) cat /mnt-ci/bla    --- assume it is "/mnt/bla"
4) rm /mnt/bla
3) cat /mnt-ci/bla   ---> it will be "/mnt/BLA" because the user
operates /mnt

I don't think these two have too much difference...
>
>>>> In that case, we could not trust the negative dentry when _creating_ a
>>>> case-insensitive file, for example:
>>>> there exists "anDroid" on-disk, but ext4's in-memory dcache only has
>>>> the negative "Android", if we lookup "Android" we will get the
>>>> _negative_ dentry, but we _cannot_ create it since "anDroid" exists on
>>>> disk. In the create case, an on-disk _iterate_ (or readdir) is
>>>> necessary.
>>> In my previous email, I mentioned my current implementation ignores
>>> negative dentries and forces a ->lookup(), which walks over the disk
>>> entries. (I had to add a fix to the creation path in the vfs-ms_casefold
>>> branch to exactly match that description, so you might have missed the
>>> updated version in that branch).
>> I am sorry, I haven't looked into your CI patch yet.
>>
>> I decided to join in this topic because in the past months, I did some work
>> on the case-insensitive sdcardfs, and I just want to express my thoughts
>> for this topic.
>>
>> I don't know which level negative dentries you ignore, your
>> case-insensitive negative dentries
>> or the original underlayfs(eg. ext4) case-sensitive negative dentries?
> I'm talking about the existing case-sensitive negative dentries. I'm
> not adding any dentries other than the ones created by the underlayfs
OK, if you don't create duplicated dentries, I don't think it has the
following two potential issues.
I have no more question at the moment, and I need to look into your
detailed implementation first.
Thanks for your reply.

Thanks,
>> Actually I observed some false "negative dentries" race or deadlock
>> if multiple case-insensitive mounts exist and do fs-ops on these mounts
>> at the same time,
>> but I am not sure whether your implementation has these potential issues
>> in principle or not.
> Do you have a test case? I didn't observe that, but I might not be
> stressing my code enough.
>
>> In addition, quote `
>> Our customer is interested in exposing a subtree of an existing
>> filesystem (native Linux filesystems, xfs, ext4 and others) in an
>> case-insensitive lookup manner, without paying the cost of a userspace
>> getdents implementation, and, preferably, without requiring the user to
>> modify data on the disk.`
>>
>> I think the *most expensive* operation for case-insensitive lookup is to
>> create a not-exist file rather than do a existed-file lookup.
>> It takes much overhead and I think no straightforward way to directly
>> reduce the cost
>> and improve its performance.
>>
>>> Either way, this case is supported like this:
>>>
>>> If we have two bind-mounts of the same directory, /mnt and /mnt-ci,
>>> case-sensitive and case-insensitive, respectively, We can do:
>>>
>>> open("/mnt/anDroid", O_EXCL|O_CREAT) = 3
>>> open("/mnt/Android", 0) = -2 No such file or directory
>>> open("/mnt-ci/Android", 0) = 4
>>> open("/mnt-ci/Android", O_EXCL|O_CREAT) = -17 File exists
>>> open("/mnt-ci/AndROID", O_EXCL|O_CREAT) = -17 File exists
>>>
>>> The second open() is expected to create an negative_dentry of "Android",
>>> which, if it wasn't ignored by the 3th open(), the CI operation would
>>> have failed. Notice that the 3th open() operation actually opens the
>> Yes, the 3th open() should invalidate the 2nd negative dentry.
>> However, I don't know your detailed implementation behavior, for example
>> as follows:
>>
>> 1. open("/mnt/anDroid", O_EXCL|O_CREAT)
>> 2. open("/mnt/Android", 0)                 ---- dentry A
>> 3. open("/mnt-ci/Android", 0)             ---- CI-dentry B or the same
>> dentry A?
>>
>> if 2 and 3 are the different denties but the same inode, furthermore,
>> if the inode is a directory inode, it seems like a hard link directory,
>> any *potential deadlock* with that?
> 2 and 3 are the same dentry, not a duplicate or anything else. The
> point is to avoid dealing with multiple dentries kind of complexity.
>
>
>> and how about?
>> 1. open("/mnt/anDroid", O_EXCL|O_CREAT)             --- inode A
>> 2. close("/mnt/anDroid") --- still positive, referenced
>> 3. open("/mnt/Android", 0) --- No such file or directory
>> 4. open("/mnt-ci/Android", 0) --- positive, inode A
>> 5. close("/mnt-ci/Android") --- still positive, referenced
>> 6. open("/mnt/android", O_EXCL|O_CREAT)             --- inode B
>> 7. unlink("/mnt/anDroid")   --- expected behavior? since we have 2) and
>> 5) referenced, can the inode finally be evicted? ---
>> 8. open("/mnt-ci/Android", 0) --- ??? positive and inode B? ---
> If I understand correctly, since the dentry recovered by 1 and 4 are the
> same, 7 should unlink /mnt/anDroid and make it negative, since there are
> no more references.
>
> Since /mnt/anDroid no longer exists we are sure that step 8. when
> accessing /mnt-ci/Android will return inode B, since that is the only
> one remaining. I think this is it.
>
>>> file that was created by the first open(). It doesn't create a new
>>> file.
>>>
>>> Following on, the 4th operation (file creation) *must fail* because
>>> there is a CI name collision with /mnt-ci/anDroid. The same is true for
>>> the final case.
>>>
>>>> I could give another example, if we uses case-insentive ext4 and create
>>>> "Android" and "anDroid", how to deal with the case in the
>>>> case-insensitive way?
>>>> I mean in that case we should make both "Android" and "anDroid" can
>>>> access, right?
>>> Not sure if I follow you here, but I'm assuming we create Android and
>>> anDroid in the sensitive mountpoint, because, otherwise the
>>> second file creation in the insensitive mountpoint would fail.
>>>
>>> This is the case where I'm hiding one of the previously (CS) created
>>> files, when in the insensitive mountpoint, and the user is shooting
>>> himself. For the sensitive case, Both stays visible to the user.
>> As I mentioned above....
>> 1. open("/mnt/anDroid", O_EXCL|O_CREAT)
>> 2. open("/mnt/Android", O_EXCL|O_CREAT)
>> 3. open("/mnt-ci/Android") --- exactly "Android"
>> 4. open("/mnt-ci/anDroid") --- exactly "anDroid"
>> 5. open("/mnt-ci/ANDROID") --- "Android" or "anDroid", do a
>> case-insensitive lookup.
>> 6. unlink("/mnt-ci/Android") --- exactly "Android"
>> 7. unlink("/mnt-ci/ANDROID") --- "anDroid", do a case-insensitive
>> lookup.
> This depends on my question about the behavior of doing an exact name
> lookup.
>
>>>> I think we need to build a special case-sensitive dcache rather than
>>>> a case-insensitive dcache following the native case-insentive fs(use
>>>> d_add_ci, d_compare and d_hash, eg. fat, ntfs...)
>>> What do you think about the second part of my proposal, where I mention
>>> dealing differently with negative dentries created by a CI lookup?
>>> We don't need to ignore them if we can invalidate them after a creation
>>> in the directory.
>> I feel some complex of your second part... I still need some time to
>> look into that...
>> and I think it is useless of negative dentries for that case
>> intuitively.....
>>>> Finally, I agree "let the user shot herself in the foot by having two
>>>> files with the exact CI name", but I think it could not the VFS
>>>> _busniess_ itself since each customer solution "case-sensitive ext4 ->
>>>> case-insensitive lookup" has their _perfered_ way (for example,
>>>> "android" and "Android" exist, A perfers android and B perfers Android.
>>> I don't see how we could defer the decision to the filesystem, that's a
>>> pretty good problem, which I don't have a solution right now.
>>>
>>>> Finally, I think for optmization, ext4 or other fs could add some dir
>>>> inode _tag_ and supports native case-insensitive for these dirs could be
>>>> better....
>>> Agreed. But I'm seeing this as outside the scope of my proposal, since it
>>> is specific to each filesystem. My ext4 adaptation, for instance, falls
>>> back to linear search when it can't find the exact match.
>>>
>>> Thanks,
>>>
>> I could miss something important, if I recall, I will reply in the
>> future....
>>
>> Thanks,