LinuxLists.cc - [PATCH RFC v6 00/11] Ext4 Encoding and Case-insensitive support

2019-03-18 20:28:03

by Gabriel Krisman Bertazi

[permalink] [raw]

Subject: [PATCH RFC v6 00/11] Ext4 Encoding and Case-insensitive support

Hi,

This version pretty much the same as v5. I am resending cause as the
previous version didn't grab much discussion on the main topic of moving
from KD to D.

Same as version 5, at a first glance, you will notice the series got a
lot smaller, with the separation of unicode code from the NLS subsystem,
as Linus requested. The ext4 parts are pretty much the same, with only
the addition of a verification in ext4_feature_set_ok() to fail encoding
mounts when without CONFIG_UNICODE on newer kernels.

The main change presented here is a proposal to migrate the
normalization method from NFKD to NFD. After our discussions, and
reviewing other operating systems and languages aspects, I am more
convinced that canonical decomposition is more viable solution than
compatibility decomposition, because it doesn't ignore eliminate any
semantic meaning, like the definitive case of superscript numbers. NFD
is also the documented method used by HFS+ and APFS, so there is
precedent. Notice however, that as far as my research goes, APFS doesn't
completely follows NFD, and in some cases, like <compat> flags, it
actually does NFKD, but not in others (<fraction>), where it applies the
canonical form. We take a more consistent approach and always do plain NFD.

This RFC, therefore, aims to resume/start conversation with some
stalkeholders that may have something to say regarding the normalization
method used. I added people from SMB, NFS and FS development who
might be interested on this.

Regarding Casefold, I am unsure whether Casefold Common + Full still
makes sense after migrating from the compatibility to the canonical
form. While Casefold Full, by definition, addresses cases where the
casefolding grows in size, like the casefold of the german eszett to SS,
it also is responsible for folding smallcase ligatures without a
corresponding uppercase to their compatible counterpart. Which means
that on -F directories, o_f_f_i_c_e and o_ff_i_c_e will differ, while on
+F directories they will match. This seems unaceptable to me,
suggesting that we should start to use Common + Simple instead of Common
+ Full, but I would like more input on what seems more reasonable to
you.

After we decide on this, I will be sending new patches to update
e2fsprogs to the agreed method and remove the normalization/casefold
type flags (EXT4_UTF8_NORMALIZATION_TYPE_NFKD,
EXT4_UTF8_CASEFOLD_TYPE_NFKDCF), before actually proposing the current
patch series for inclusion in the kernel.

For the record, I am aware that unicode 12 was released 2 weeks ago. The
world can't live without a new set of emojis every 6 months. I will
withold updating the unicode version until we get something
upstreamable, then I will update to the latest version and send a new
version. This way I avoid having to update versions that will never
actually be used.

Practical things, w.r.t. this patch series:

- As usual, the UCD files are not part of the series, because they
would cause the email to bounce. To test this one would need to fetch
the files as explained in the commit message.

- If you prefer, you can checkout from
https://gitlab.collabora.com/krisman/linux -b ext4-ci-directory-no-nls

- More details on the design decisions restricted to ext4 are
available in the corresponding commit messages.

Thanks!

Gabriel Krisman Bertazi (7):
unicode: Implement higher level API for string handling
unicode: Introduce test module for normalized utf8 implementation
MAINTAINERS: Add Unicode subsystem entry
ext4: Include encoding information in the superblock
ext4: Support encoding-aware file name lookups
ext4: Implement EXT4_CASEFOLD_FL flag
docs: ext4.rst: Document encoding and case-insensitive

Olaf Weber (4):
unicode: Add unicode character database files
scripts: add trie generator for UTF-8
unicode: Introduce code for UTF-8 normalization
unicode: reduce the size of utf8data[]

Documentation/admin-guide/ext4.rst | 41 +
MAINTAINERS | 6 +
fs/Kconfig | 1 +
fs/Makefile | 1 +
fs/ext4/dir.c | 43 +
fs/ext4/ext4.h | 42 +-
fs/ext4/hash.c | 38 +-
fs/ext4/ialloc.c | 2 +-
fs/ext4/inline.c | 2 +-
fs/ext4/inode.c | 4 +-
fs/ext4/ioctl.c | 18 +
fs/ext4/namei.c | 104 +-
fs/ext4/super.c | 91 +
fs/unicode/Kconfig | 13 +
fs/unicode/Makefile | 22 +
fs/unicode/ucd/README | 33 +
fs/unicode/utf8-core.c | 183 ++
fs/unicode/utf8-norm.c | 797 +++++++
fs/unicode/utf8-selftest.c | 320 +++
fs/unicode/utf8n.h | 117 +
include/linux/fs.h | 2 +
include/linux/unicode.h | 30 +
scripts/Makefile | 1 +
scripts/mkutf8data.c | 3418 ++++++++++++++++++++++++++++
24 files changed, 5307 insertions(+), 22 deletions(-)
create mode 100644 fs/unicode/Kconfig
create mode 100644 fs/unicode/Makefile
create mode 100644 fs/unicode/ucd/README
create mode 100644 fs/unicode/utf8-core.c
create mode 100644 fs/unicode/utf8-norm.c
create mode 100644 fs/unicode/utf8-selftest.c
create mode 100644 fs/unicode/utf8n.h
create mode 100644 include/linux/unicode.h
create mode 100644 scripts/mkutf8data.c

--
2.20.1

2019-03-18 20:28:06

by Gabriel Krisman Bertazi

[permalink] [raw]

Subject: [PATCH RFC v6 01/11] unicode: Add unicode character database files

From: Olaf Weber <[email protected]>

Add files from the Unicode Character Database, version 11.0, to the
source. A helper program that generates a trie used for normalization
from these files is part of a separate commit.

- Notes on the update from 8.0.0 and 11.0:

The structure of ucd files and special cases have not experienced any
changes between versions 8.0.0 and 11.0.0. 8.0.0 saw the addition of
Cherokee LC characters, which is an interesting case for case-folding.
The update is accompanied by new tests on the test_ucd module to catch
specific cases. No changes to mkutf8data script was required for the
update.

The actual files are not part of the commit submitted to the list
because they are to big and would bounce. Still, they can be obtained
by the following script:

FILES="CaseFolding.txt DerivedAge.txt extracted/DerivedCombiningClass.txt
DerivedCoreProperties.txt NormalizationCorrections.txt
NormalizationTest.txt UnicodeData.txt"
VERSION=11.0.0
BASE=http://www.unicode.org/Public/${VERSION}/ucd

for i in ${FILES} ; do
wget "${BASE}/$i" -O fs/unicode/ucd/$(basename ${i} .txt)-${VERSION}.txt
done

Signed-off-by: Olaf Weber <[email protected]>
Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
[Move ucd directory to fs/unicode/]
[Update to Unicode 11.0.0]
---
fs/unicode/ucd/README | 33 +++++++++++++++++++++++++++++++++
1 file changed, 33 insertions(+)
create mode 100644 fs/unicode/ucd/README

diff --git a/fs/unicode/ucd/README b/fs/unicode/ucd/README
new file mode 100644
index 000000000000..5f89017b35ee
--- /dev/null
+++ b/fs/unicode/ucd/README
@@ -0,0 +1,33 @@
+The files in this directory are part of the Unicode Character Database
+for version 11.0.0 of the Unicode standard.
+
+The full set of files can be found here:
+
+ http://www.unicode.org/Public/11.0.0/ucd/
+
+The latest released version of the UCD can be found here:
+
+ http://www.unicode.org/Public/UCD/latest/
+
+The files in this directory are identical, except that they have been
+renamed with a suffix indicating the unicode version.
+
+Individual source links:
+
+ http://www.unicode.org/Public/11.0.0/ucd/CaseFolding.txt
+ http://www.unicode.org/Public/11.0.0/ucd/DerivedAge.txt
+ http://www.unicode.org/Public/11.0.0/ucd/extracted/DerivedCombiningClass.txt
+ http://www.unicode.org/Public/11.0.0/ucd/DerivedCoreProperties.txt
+ http://www.unicode.org/Public/11.0.0/ucd/NormalizationCorrections.txt
+ http://www.unicode.org/Public/11.0.0/ucd/NormalizationTest.txt
+ http://www.unicode.org/Public/11.0.0/ucd/UnicodeData.txt
+
+md5sums
+
+ 414436796cf097df55f798e1585448ee CaseFolding-11.0.0.txt
+ 6032a595fbb782694456491d86eecfac DerivedAge-11.0.0.txt
+ 3240997d671297ac754ab0d27577acf7 DerivedCombiningClass-11.0.0.txt
+ 2a4fe257d9d8184518e036194d2248ec DerivedCoreProperties-11.0.0.txt
+ 4e7d383fa0dd3cd9d49d64e5b7b7c9e0 NormalizationCorrections-11.0.0.txt
+ c9500c5b8b88e584469f056023ecc3f2 NormalizationTest-11.0.0.txt
+ acc291106c3758d2025f8d7bd5518bee UnicodeData-11.0.0.txt
--
2.20.1

2019-03-18 20:28:14

by Gabriel Krisman Bertazi

[permalink] [raw]

Subject: [PATCH RFC v6 03/11] unicode: Introduce code for UTF-8 normalization

From: Olaf Weber <[email protected]>

Supporting functions for UTF-8 normalization are in utf8norm.c with the
header utf8norm.h. Two normalization forms are supported: nfdi and
nfdicf.

nfdi:
- Apply unicode normalization form NFD.
- Remove any Default_Ignorable_Code_Point.

nfdicf:
- Apply unicode normalization form NFD.
- Remove any Default_Ignorable_Code_Point.
- Apply a full casefold (C + F).

For the purposes of the code, a string is valid UTF-8 if:

- The values encoded are 0x1..0x10FFFF.
- The surrogate codepoints 0xD800..0xDFFFF are not encoded.
- The shortest possible encoding is used for all values.

The supporting functions work on null-terminated strings (utf8 prefix)
and on length-limited strings (utf8n prefix).

From the original SGI patch and for conformity with coding standards,
the utf8data_t typedef was dropped, since it was just masking the struct
keyword. On other occasions, namely utf8leaf_t and utf8trie_t, I
decided to keep it, since they are simple pointers to memory buffers,
and using uchars here wouldn't provide any more meaningful information.

From the original submission, we also converted from the compatibility
form to canonical.

Changes since v4:
- Convert to canonical decomposition form.

Changes since RFC v1:
- utf8_version_is_supported receives maj, min and rev as separate
arguments. (Olaf Weber)

Signed-off-by: Olaf Weber <[email protected]>
Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
[Rebase to Mainline]
[Fix up checkpatch.pl warnings]
[Drop typedefs]
[move out of libxfs]
[Convert from NFKD to NFD]
---
fs/unicode/Makefile | 3 +
fs/unicode/utf8-norm.c | 640 +++++++++++++++++++++++++++++++++++++++++
fs/unicode/utf8n.h | 112 ++++++++
3 files changed, 755 insertions(+)
create mode 100644 fs/unicode/utf8-norm.c
create mode 100644 fs/unicode/utf8n.h

diff --git a/fs/unicode/Makefile b/fs/unicode/Makefile
index efc944a75b4b..1ed10e40c30d 100644
--- a/fs/unicode/Makefile
+++ b/fs/unicode/Makefile
@@ -2,6 +2,9 @@

UNICODE_VERSION=11.0.0

+obj-$(CONFIG_UNICODE) += utf8-norm.o
+
+$(obj)/utf8-norm.o: $(obj)/utf8data.h
$(obj)/utf8data.h: $(srctree)/$(src)/ucd/*.txt $(objtree)/scripts/mkutf8data FORCE
$(call cmd,mkutf8data)
quiet_cmd_mkutf8data = MKUTF8DATA $@
diff --git a/fs/unicode/utf8-norm.c b/fs/unicode/utf8-norm.c
new file mode 100644
index 000000000000..4ed50f3fda6e
--- /dev/null
+++ b/fs/unicode/utf8-norm.c
@@ -0,0 +1,640 @@
+/*
+ * Copyright (c) 2014 SGI.
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#include "utf8n.h"
+
+struct utf8data {
+ unsigned int maxage;
+ unsigned int offset;
+};
+
+#define __INCLUDED_FROM_UTF8NORM_C__
+#include "utf8data.h"
+#undef __INCLUDED_FROM_UTF8NORM_C__
+
+int utf8version_is_supported(u8 maj, u8 min, u8 rev)
+{
+ int i = ARRAY_SIZE(utf8agetab) - 1;
+ unsigned int sb_utf8version = UNICODE_AGE(maj, min, rev);
+
+ while (i >= 0 && utf8agetab[i] != 0) {
+ if (sb_utf8version == utf8agetab[i])
+ return 1;
+ i--;
+ }
+ return 0;
+}
+EXPORT_SYMBOL(utf8version_is_supported);
+
+/*
+ * UTF-8 valid ranges.
+ *
+ * The UTF-8 encoding spreads the bits of a 32bit word over several
+ * bytes. This table gives the ranges that can be held and how they'd
+ * be represented.
+ *
+ * 0x00000000 0x0000007F: 0xxxxxxx
+ * 0x00000000 0x000007FF: 110xxxxx 10xxxxxx
+ * 0x00000000 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ *
+ * There is an additional requirement on UTF-8, in that only the
+ * shortest representation of a 32bit value is to be used. A decoder
+ * must not decode sequences that do not satisfy this requirement.
+ * Thus the allowed ranges have a lower bound.
+ *
+ * 0x00000000 0x0000007F: 0xxxxxxx
+ * 0x00000080 0x000007FF: 110xxxxx 10xxxxxx
+ * 0x00000800 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
+ * 0x00010000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00200000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x04000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ *
+ * Actual unicode characters are limited to the range 0x0 - 0x10FFFF,
+ * 17 planes of 65536 values. This limits the sequences actually seen
+ * even more, to just the following.
+ *
+ * 0 - 0x7F: 0 - 0x7F
+ * 0x80 - 0x7FF: 0xC2 0x80 - 0xDF 0xBF
+ * 0x800 - 0xFFFF: 0xE0 0xA0 0x80 - 0xEF 0xBF 0xBF
+ * 0x10000 - 0x10FFFF: 0xF0 0x90 0x80 0x80 - 0xF4 0x8F 0xBF 0xBF
+ *
+ * Within those ranges the surrogates 0xD800 - 0xDFFF are not allowed.
+ *
+ * Note that the longest sequence seen with valid usage is 4 bytes,
+ * the same a single UTF-32 character. This makes the UTF-8
+ * representation of Unicode strictly smaller than UTF-32.
+ *
+ * The shortest sequence requirement was introduced by:
+ * Corrigendum #1: UTF-8 Shortest Form
+ * It can be found here:
+ * http://www.unicode.org/versions/corrigendum1.html
+ *
+ */
+
+/*
+ * Return the number of bytes used by the current UTF-8 sequence.
+ * Assumes the input points to the first byte of a valid UTF-8
+ * sequence.
+ */
+static inline int utf8clen(const char *s)
+{
+ unsigned char c = *s;
+
+ return 1 + (c >= 0xC0) + (c >= 0xE0) + (c >= 0xF0);
+}
+
+/*
+ * utf8trie_t
+ *
+ * A compact binary tree, used to decode UTF-8 characters.
+ *
+ * Internal nodes are one byte for the node itself, and up to three
+ * bytes for an offset into the tree. The first byte contains the
+ * following information:
+ * NEXTBYTE - flag - advance to next byte if set
+ * BITNUM - 3 bit field - the bit number to tested
+ * OFFLEN - 2 bit field - number of bytes in the offset
+ * if offlen == 0 (non-branching node)
+ * RIGHTPATH - 1 bit field - set if the following node is for the
+ * right-hand path (tested bit is set)
+ * TRIENODE - 1 bit field - set if the following node is an internal
+ * node, otherwise it is a leaf node
+ * if offlen != 0 (branching node)
+ * LEFTNODE - 1 bit field - set if the left-hand node is internal
+ * RIGHTNODE - 1 bit field - set if the right-hand node is internal
+ *
+ * Due to the way utf8 works, there cannot be branching nodes with
+ * NEXTBYTE set, and moreover those nodes always have a righthand
+ * descendant.
+ */
+typedef const unsigned char utf8trie_t;
+#define BITNUM 0x07
+#define NEXTBYTE 0x08
+#define OFFLEN 0x30
+#define OFFLEN_SHIFT 4
+#define RIGHTPATH 0x40
+#define TRIENODE 0x80
+#define RIGHTNODE 0x40
+#define LEFTNODE 0x80
+
+/*
+ * utf8leaf_t
+ *
+ * The leaves of the trie are embedded in the trie, and so the same
+ * underlying datatype: unsigned char.
+ *
+ * leaf[0]: The unicode version, stored as a generation number that is
+ * an index into utf8agetab[]. With this we can filter code
+ * points based on the unicode version in which they were
+ * defined. The CCC of a non-defined code point is 0.
+ * leaf[1]: Canonical Combining Class. During normalization, we need
+ * to do a stable sort into ascending order of all characters
+ * with a non-zero CCC that occur between two characters with
+ * a CCC of 0, or at the begin or end of a string.
+ * The unicode standard guarantees that all CCC values are
+ * between 0 and 254 inclusive, which leaves 255 available as
+ * a special value.
+ * Code points with CCC 0 are known as stoppers.
+ * leaf[2]: Decomposition. If leaf[1] == 255, then leaf[2] is the
+ * start of a NUL-terminated string that is the decomposition
+ * of the character.
+ * The CCC of a decomposable character is the same as the CCC
+ * of the first character of its decomposition.
+ * Some characters decompose as the empty string: these are
+ * characters with the Default_Ignorable_Code_Point property.
+ * These do affect normalization, as they all have CCC 0.
+ *
+ * The decompositions in the trie have been fully expanded.
+ *
+ * Casefolding, if applicable, is also done using decompositions.
+ *
+ * The trie is constructed in such a way that leaves exist for all
+ * UTF-8 sequences that match the criteria from the "UTF-8 valid
+ * ranges" comment above, and only for those sequences. Therefore a
+ * lookup in the trie can be used to validate the UTF-8 input.
+ */
+typedef const unsigned char utf8leaf_t;
+
+#define LEAF_GEN(LEAF) ((LEAF)[0])
+#define LEAF_CCC(LEAF) ((LEAF)[1])
+#define LEAF_STR(LEAF) ((const char *)((LEAF) + 2))
+
+#define MINCCC (0)
+#define MAXCCC (254)
+#define STOPPER (0)
+#define DECOMPOSE (255)
+
+/*
+ * Use trie to scan s, touching at most len bytes.
+ * Returns the leaf if one exists, NULL otherwise.
+ *
+ * A non-NULL return guarantees that the UTF-8 sequence starting at s
+ * is well-formed and corresponds to a known unicode code point. The
+ * shorthand for this will be "is valid UTF-8 unicode".
+ */
+static utf8leaf_t *utf8nlookup(const struct utf8data *data, const char *s,
+ size_t len)
+{
+ utf8trie_t *trie = utf8data + data->offset;
+ int offlen;
+ int offset;
+ int mask;
+ int node;
+
+ if (!data)
+ return NULL;
+ if (len == 0)
+ return NULL;
+ node = 1;
+ while (node) {
+ offlen = (*trie & OFFLEN) >> OFFLEN_SHIFT;
+ if (*trie & NEXTBYTE) {
+ if (--len == 0)
+ return NULL;
+ s++;
+ }
+ mask = 1 << (*trie & BITNUM);
+ if (*s & mask) {
+ /* Right leg */
+ if (offlen) {
+ /* Right node at offset of trie */
+ node = (*trie & RIGHTNODE);
+ offset = trie[offlen];
+ while (--offlen) {
+ offset <<= 8;
+ offset |= trie[offlen];
+ }
+ trie += offset;
+ } else if (*trie & RIGHTPATH) {
+ /* Right node after this node */
+ node = (*trie & TRIENODE);
+ trie++;
+ } else {
+ /* No right node. */
+ node = 0;
+ trie = NULL;
+ }
+ } else {
+ /* Left leg */
+ if (offlen) {
+ /* Left node after this node. */
+ node = (*trie & LEFTNODE);
+ trie += offlen + 1;
+ } else if (*trie & RIGHTPATH) {
+ /* No left node. */
+ node = 0;
+ trie = NULL;
+ } else {
+ /* Left node after this node */
+ node = (*trie & TRIENODE);
+ trie++;
+ }
+ }
+ }
+ return trie;
+}
+
+/*
+ * Use trie to scan s.
+ * Returns the leaf if one exists, NULL otherwise.
+ *
+ * Forwards to utf8nlookup().
+ */
+static utf8leaf_t *utf8lookup(const struct utf8data *data, const char *s)
+{
+ return utf8nlookup(data, s, (size_t)-1);
+}
+
+/*
+ * Maximum age of any character in s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ * Return 0 if only non-assigned code points are used.
+ */
+int utf8agemax(const struct utf8data *data, const char *s)
+{
+ utf8leaf_t *leaf;
+ int age = 0;
+ int leaf_age;
+
+ if (!data)
+ return -1;
+ while (*s) {
+ leaf = utf8lookup(data, s);
+ if (!leaf)
+ return -1;
+
+ leaf_age = utf8agetab[LEAF_GEN(leaf)];
+ if (leaf_age <= data->maxage && leaf_age > age)
+ age = leaf_age;
+ s += utf8clen(s);
+ }
+ return age;
+}
+EXPORT_SYMBOL(utf8agemax);
+
+/*
+ * Minimum age of any character in s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ * Return 0 if non-assigned code points are used.
+ */
+int utf8agemin(const struct utf8data *data, const char *s)
+{
+ utf8leaf_t *leaf;
+ int age;
+ int leaf_age;
+
+ if (!data)
+ return -1;
+ age = data->maxage;
+ while (*s) {
+ leaf = utf8lookup(data, s);
+ if (!leaf)
+ return -1;
+ leaf_age = utf8agetab[LEAF_GEN(leaf)];
+ if (leaf_age <= data->maxage && leaf_age < age)
+ age = leaf_age;
+ s += utf8clen(s);
+ }
+ return age;
+}
+EXPORT_SYMBOL(utf8agemin);
+
+/*
+ * Maximum age of any character in s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+int utf8nagemax(const struct utf8data *data, const char *s, size_t len)
+{
+ utf8leaf_t *leaf;
+ int age = 0;
+ int leaf_age;
+
+ if (!data)
+ return -1;
+ while (len && *s) {
+ leaf = utf8nlookup(data, s, len);
+ if (!leaf)
+ return -1;
+ leaf_age = utf8agetab[LEAF_GEN(leaf)];
+ if (leaf_age <= data->maxage && leaf_age > age)
+ age = leaf_age;
+ len -= utf8clen(s);
+ s += utf8clen(s);
+ }
+ return age;
+}
+EXPORT_SYMBOL(utf8nagemax);
+
+/*
+ * Maximum age of any character in s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+int utf8nagemin(const struct utf8data *data, const char *s, size_t len)
+{
+ utf8leaf_t *leaf;
+ int leaf_age;
+ int age;
+
+ if (!data)
+ return -1;
+ age = data->maxage;
+ while (len && *s) {
+ leaf = utf8nlookup(data, s, len);
+ if (!leaf)
+ return -1;
+ leaf_age = utf8agetab[LEAF_GEN(leaf)];
+ if (leaf_age <= data->maxage && leaf_age < age)
+ age = leaf_age;
+ len -= utf8clen(s);
+ s += utf8clen(s);
+ }
+ return age;
+}
+EXPORT_SYMBOL(utf8nagemin);
+
+/*
+ * Length of the normalization of s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ *
+ * A string of Default_Ignorable_Code_Point has length 0.
+ */
+ssize_t utf8len(const struct utf8data *data, const char *s)
+{
+ utf8leaf_t *leaf;
+ size_t ret = 0;
+
+ if (!data)
+ return -1;
+ while (*s) {
+ leaf = utf8lookup(data, s);
+ if (!leaf)
+ return -1;
+ if (utf8agetab[LEAF_GEN(leaf)] > data->maxage)
+ ret += utf8clen(s);
+ else if (LEAF_CCC(leaf) == DECOMPOSE)
+ ret += strlen(LEAF_STR(leaf));
+ else
+ ret += utf8clen(s);
+ s += utf8clen(s);
+ }
+ return ret;
+}
+EXPORT_SYMBOL(utf8len);
+
+/*
+ * Length of the normalization of s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+ssize_t utf8nlen(const struct utf8data *data, const char *s, size_t len)
+{
+ utf8leaf_t *leaf;
+ size_t ret = 0;
+
+ if (!data)
+ return -1;
+ while (len && *s) {
+ leaf = utf8nlookup(data, s, len);
+ if (!leaf)
+ return -1;
+ if (utf8agetab[LEAF_GEN(leaf)] > data->maxage)
+ ret += utf8clen(s);
+ else if (LEAF_CCC(leaf) == DECOMPOSE)
+ ret += strlen(LEAF_STR(leaf));
+ else
+ ret += utf8clen(s);
+ len -= utf8clen(s);
+ s += utf8clen(s);
+ }
+ return ret;
+}
+EXPORT_SYMBOL(utf8nlen);
+
+/*
+ * Set up an utf8cursor for use by utf8byte().
+ *
+ * u8c : pointer to cursor.
+ * data : const struct utf8data to use for normalization.
+ * s : string.
+ * len : length of s.
+ *
+ * Returns -1 on error, 0 on success.
+ */
+int utf8ncursor(struct utf8cursor *u8c, const struct utf8data *data,
+ const char *s, size_t len)
+{
+ if (!data)
+ return -1;
+ if (!s)
+ return -1;
+ u8c->data = data;
+ u8c->s = s;
+ u8c->p = NULL;
+ u8c->ss = NULL;
+ u8c->sp = NULL;
+ u8c->len = len;
+ u8c->slen = 0;
+ u8c->ccc = STOPPER;
+ u8c->nccc = STOPPER;
+ /* Check we didn't clobber the maximum length. */
+ if (u8c->len != len)
+ return -1;
+ /* The first byte of s may not be an utf8 continuation. */
+ if (len > 0 && (*s & 0xC0) == 0x80)
+ return -1;
+ return 0;
+}
+EXPORT_SYMBOL(utf8ncursor);
+
+/*
+ * Set up an utf8cursor for use by utf8byte().
+ *
+ * u8c : pointer to cursor.
+ * data : const struct utf8data to use for normalization.
+ * s : NUL-terminated string.
+ *
+ * Returns -1 on error, 0 on success.
+ */
+int utf8cursor(struct utf8cursor *u8c, const struct utf8data *data,
+ const char *s)
+{
+ return utf8ncursor(u8c, data, s, (unsigned int)-1);
+}
+EXPORT_SYMBOL(utf8cursor);
+
+/*
+ * Get one byte from the normalized form of the string described by u8c.
+ *
+ * Returns the byte cast to an unsigned char on succes, and -1 on failure.
+ *
+ * The cursor keeps track of the location in the string in u8c->s.
+ * When a character is decomposed, the current location is stored in
+ * u8c->p, and u8c->s is set to the start of the decomposition. Note
+ * that bytes from a decomposition do not count against u8c->len.
+ *
+ * Characters are emitted if they match the current CCC in u8c->ccc.
+ * Hitting end-of-string while u8c->ccc == STOPPER means we're done,
+ * and the function returns 0 in that case.
+ *
+ * Sorting by CCC is done by repeatedly scanning the string. The
+ * values of u8c->s and u8c->p are stored in u8c->ss and u8c->sp at
+ * the start of the scan. The first pass finds the lowest CCC to be
+ * emitted and stores it in u8c->nccc, the second pass emits the
+ * characters with this CCC and finds the next lowest CCC. This limits
+ * the number of passes to 1 + the number of different CCCs in the
+ * sequence being scanned.
+ *
+ * Therefore:
+ * u8c->p != NULL -> a decomposition is being scanned.
+ * u8c->ss != NULL -> this is a repeating scan.
+ * u8c->ccc == -1 -> this is the first scan of a repeating scan.
+ */
+int utf8byte(struct utf8cursor *u8c)
+{
+ utf8leaf_t *leaf;
+ int ccc;
+
+ for (;;) {
+ /* Check for the end of a decomposed character. */
+ if (u8c->p && *u8c->s == '\0') {
+ u8c->s = u8c->p;
+ u8c->p = NULL;
+ }
+
+ /* Check for end-of-string. */
+ if (!u8c->p && (u8c->len == 0 || *u8c->s == '\0')) {
+ /* There is no next byte. */
+ if (u8c->ccc == STOPPER)
+ return 0;
+ /* End-of-string during a scan counts as a stopper. */
+ ccc = STOPPER;
+ goto ccc_mismatch;
+ } else if ((*u8c->s & 0xC0) == 0x80) {
+ /* This is a continuation of the current character. */
+ if (!u8c->p)
+ u8c->len--;
+ return (unsigned char)*u8c->s++;
+ }
+
+ /* Look up the data for the current character. */
+ if (u8c->p)
+ leaf = utf8lookup(u8c->data, u8c->s);
+ else
+ leaf = utf8nlookup(u8c->data, u8c->s, u8c->len);
+
+ /* No leaf found implies that the input is a binary blob. */
+ if (!leaf)
+ return -1;
+
+ ccc = LEAF_CCC(leaf);
+ /* Characters that are too new have CCC 0. */
+ if (utf8agetab[LEAF_GEN(leaf)] > u8c->data->maxage) {
+ ccc = STOPPER;
+ } else if (ccc == DECOMPOSE) {
+ u8c->len -= utf8clen(u8c->s);
+ u8c->p = u8c->s + utf8clen(u8c->s);
+ u8c->s = LEAF_STR(leaf);
+ /* Empty decomposition implies CCC 0. */
+ if (*u8c->s == '\0') {
+ if (u8c->ccc == STOPPER)
+ continue;
+ ccc = STOPPER;
+ goto ccc_mismatch;
+ }
+ leaf = utf8lookup(u8c->data, u8c->s);
+ }
+
+ /*
+ * If this is not a stopper, then see if it updates
+ * the next canonical class to be emitted.
+ */
+ if (ccc != STOPPER && u8c->ccc < ccc && ccc < u8c->nccc)
+ u8c->nccc = ccc;
+
+ /*
+ * Return the current byte if this is the current
+ * combining class.
+ */
+ if (ccc == u8c->ccc) {
+ if (!u8c->p)
+ u8c->len--;
+ return (unsigned char)*u8c->s++;
+ }
+
+ /* Current combining class mismatch. */
+ccc_mismatch:
+ if (u8c->nccc == STOPPER) {
+ /*
+ * Scan forward for the first canonical class
+ * to be emitted. Save the position from
+ * which to restart.
+ */
+ u8c->ccc = MINCCC - 1;
+ u8c->nccc = ccc;
+ u8c->sp = u8c->p;
+ u8c->ss = u8c->s;
+ u8c->slen = u8c->len;
+ if (!u8c->p)
+ u8c->len -= utf8clen(u8c->s);
+ u8c->s += utf8clen(u8c->s);
+ } else if (ccc != STOPPER) {
+ /* Not a stopper, and not the ccc we're emitting. */
+ if (!u8c->p)
+ u8c->len -= utf8clen(u8c->s);
+ u8c->s += utf8clen(u8c->s);
+ } else if (u8c->nccc != MAXCCC + 1) {
+ /* At a stopper, restart for next ccc. */
+ u8c->ccc = u8c->nccc;
+ u8c->nccc = MAXCCC + 1;
+ u8c->s = u8c->ss;
+ u8c->p = u8c->sp;
+ u8c->len = u8c->slen;
+ } else {
+ /* All done, proceed from here. */
+ u8c->ccc = STOPPER;
+ u8c->nccc = STOPPER;
+ u8c->sp = NULL;
+ u8c->ss = NULL;
+ u8c->slen = 0;
+ }
+ }
+}
+EXPORT_SYMBOL(utf8byte);
+
+const struct utf8data *utf8nfdi(unsigned int maxage)
+{
+ int i = ARRAY_SIZE(utf8nfdidata) - 1;
+
+ while (maxage < utf8nfdidata[i].maxage)
+ i--;
+ if (maxage > utf8nfdidata[i].maxage)
+ return NULL;
+ return &utf8nfdidata[i];
+}
+EXPORT_SYMBOL(utf8nfdi);
+
+const struct utf8data *utf8nfdicf(unsigned int maxage)
+{
+ int i = ARRAY_SIZE(utf8nfdicfdata) - 1;
+
+ while (maxage < utf8nfdicfdata[i].maxage)
+ i--;
+ if (maxage > utf8nfdicfdata[i].maxage)
+ return NULL;
+ return &utf8nfdicfdata[i];
+}
+EXPORT_SYMBOL(utf8nfdicf);
diff --git a/fs/unicode/utf8n.h b/fs/unicode/utf8n.h
new file mode 100644
index 000000000000..696e52124296
--- /dev/null
+++ b/fs/unicode/utf8n.h
@@ -0,0 +1,112 @@
+/*
+ * Copyright (c) 2014 SGI.
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#ifndef UTF8NORM_H
+#define UTF8NORM_H
+
+#include <linux/types.h>
+#include <linux/export.h>
+#include <linux/string.h>
+#include <linux/module.h>
+
+/* Encoding a unicode version number as a single unsigned int. */
+#define UNICODE_MAJ_SHIFT (16)
+#define UNICODE_MIN_SHIFT (8)
+
+#define UNICODE_AGE(MAJ, MIN, REV) \
+ (((unsigned int)(MAJ) << UNICODE_MAJ_SHIFT) | \
+ ((unsigned int)(MIN) << UNICODE_MIN_SHIFT) | \
+ ((unsigned int)(REV)))
+
+/* Highest unicode version supported by the data tables. */
+extern int utf8version_is_supported(u8 maj, u8 min, u8 rev);
+
+/*
+ * Look for the correct const struct utf8data for a unicode version.
+ * Returns NULL if the version requested is too new.
+ *
+ * Two normalization forms are supported: nfdi and nfdicf.
+ *
+ * nfdi:
+ * - Apply unicode normalization form NFD.
+ * - Remove any Default_Ignorable_Code_Point.
+ *
+ * nfdicf:
+ * - Apply unicode normalization form NFD.
+ * - Remove any Default_Ignorable_Code_Point.
+ * - Apply a full casefold (C + F).
+ */
+extern const struct utf8data *utf8nfdi(unsigned int maxage);
+extern const struct utf8data *utf8nfdicf(unsigned int maxage);
+
+/*
+ * Determine the maximum age of any unicode character in the string.
+ * Returns 0 if only unassigned code points are present.
+ * Returns -1 if the input is not valid UTF-8.
+ */
+extern int utf8agemax(const struct utf8data *data, const char *s);
+extern int utf8nagemax(const struct utf8data *data, const char *s, size_t len);
+
+/*
+ * Determine the minimum age of any unicode character in the string.
+ * Returns 0 if any unassigned code points are present.
+ * Returns -1 if the input is not valid UTF-8.
+ */
+extern int utf8agemin(const struct utf8data *data, const char *s);
+extern int utf8nagemin(const struct utf8data *data, const char *s, size_t len);
+
+/*
+ * Determine the length of the normalized from of the string,
+ * excluding any terminating NULL byte.
+ * Returns 0 if only ignorable code points are present.
+ * Returns -1 if the input is not valid UTF-8.
+ */
+extern ssize_t utf8len(const struct utf8data *data, const char *s);
+extern ssize_t utf8nlen(const struct utf8data *data, const char *s, size_t len);
+
+/*
+ * Cursor structure used by the normalizer.
+ */
+struct utf8cursor {
+ const struct utf8data *data;
+ const char *s;
+ const char *p;
+ const char *ss;
+ const char *sp;
+ unsigned int len;
+ unsigned int slen;
+ short int ccc;
+ short int nccc;
+};
+
+/*
+ * Initialize a utf8cursor to normalize a string.
+ * Returns 0 on success.
+ * Returns -1 on failure.
+ */
+extern int utf8cursor(struct utf8cursor *u8c, const struct utf8data *data,
+ const char *s);
+extern int utf8ncursor(struct utf8cursor *u8c, const struct utf8data *data,
+ const char *s, size_t len);
+
+/*
+ * Get the next byte in the normalization.
+ * Returns a value > 0 && < 256 on success.
+ * Returns 0 when the end of the normalization is reached.
+ * Returns -1 if the string being normalized is not valid UTF-8.
+ */
+extern int utf8byte(struct utf8cursor *u8c);
+
+#endif /* UTF8NORM_H */
--
2.20.1

2019-03-18 20:28:16

by Gabriel Krisman Bertazi

[permalink] [raw]

Subject: [PATCH RFC v6 02/11] scripts: add trie generator for UTF-8

From: Olaf Weber <[email protected]>

mkutf8data.c is the source for a program that generates utf8data.h, which
contains the trie that utf8norm.c uses. The trie is generated from the
Unicode 11.0.0 data files. The format of the utf8data[] table is described
in utf8norm.c, which is added in the next patch.

Signed-off-by: Olaf Weber <[email protected]>
Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
[Rebase to mainline]
[Fix out-of-tree-build]
[Fix checkpatch warnings]
[Merge back robustness fixes from original patch. Requested by Dave Chinner]
[Update makefile to build 11.0.0 ucd files]
[drop references to xfs]
[Convert NFKD to NFD]
---
fs/Kconfig | 1 +
fs/Makefile | 1 +
fs/unicode/Kconfig | 8 +
fs/unicode/Makefile | 16 +
scripts/Makefile | 1 +
scripts/mkutf8data.c | 3189 ++++++++++++++++++++++++++++++++++++++++++
6 files changed, 3216 insertions(+)
create mode 100644 fs/unicode/Kconfig
create mode 100644 fs/unicode/Makefile
create mode 100644 scripts/mkutf8data.c

diff --git a/fs/Kconfig b/fs/Kconfig
index 3e6d3101f3ff..cbbffc8b9ef5 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -317,5 +317,6 @@ endif # NETWORK_FILESYSTEMS

source "fs/nls/Kconfig"
source "fs/dlm/Kconfig"
+source "fs/unicode/Kconfig"

endmenu
diff --git a/fs/Makefile b/fs/Makefile
index 427fec226fae..ab21f060f661 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -92,6 +92,7 @@ obj-$(CONFIG_EXPORTFS) += exportfs/
obj-$(CONFIG_NFSD) += nfsd/
obj-$(CONFIG_LOCKD) += lockd/
obj-$(CONFIG_NLS) += nls/
+obj-$(CONFIG_UNICODE) += unicode/
obj-$(CONFIG_SYSV_FS) += sysv/
obj-$(CONFIG_CIFS) += cifs/
obj-$(CONFIG_HPFS_FS) += hpfs/
diff --git a/fs/unicode/Kconfig b/fs/unicode/Kconfig
new file mode 100644
index 000000000000..f41520f57dff
--- /dev/null
+++ b/fs/unicode/Kconfig
@@ -0,0 +1,8 @@
+#
+# UTF-8 normalization
+#
+config UNICODE
+ bool "UTF-8 normalization and casefolding support"
+ help
+ Say Y here to enable UTF-8 NFD normalization and NFD+CF casefolding
+ support.
diff --git a/fs/unicode/Makefile b/fs/unicode/Makefile
new file mode 100644
index 000000000000..efc944a75b4b
--- /dev/null
+++ b/fs/unicode/Makefile
@@ -0,0 +1,16 @@
+# SPDX-License-Identifier: GPL-2.0
+
+UNICODE_VERSION=11.0.0
+
+$(obj)/utf8data.h: $(srctree)/$(src)/ucd/*.txt $(objtree)/scripts/mkutf8data FORCE
+ $(call cmd,mkutf8data)
+quiet_cmd_mkutf8data = MKUTF8DATA $@
+ cmd_mkutf8data = $(objtree)/scripts/mkutf8data \
+ -a $(srctree)/$(src)/ucd/DerivedAge-$(UNICODE_VERSION).txt \
+ -c $(srctree)/$(src)/ucd/DerivedCombiningClass-$(UNICODE_VERSION).txt \
+ -p $(srctree)/$(src)/ucd/DerivedCoreProperties-$(UNICODE_VERSION).txt \
+ -d $(srctree)/$(src)/ucd/UnicodeData-$(UNICODE_VERSION).txt \
+ -f $(srctree)/$(src)/ucd/CaseFolding-$(UNICODE_VERSION).txt \
+ -n $(srctree)/$(src)/ucd/NormalizationCorrections-$(UNICODE_VERSION).txt \
+ -t $(srctree)/$(src)/ucd/NormalizationTest-$(UNICODE_VERSION).txt \
+ -o $@
diff --git a/scripts/Makefile b/scripts/Makefile
index 9d442ee050bd..b87e3e0ade4d 100644
--- a/scripts/Makefile
+++ b/scripts/Makefile
@@ -20,6 +20,7 @@ hostprogs-$(CONFIG_ASN1) += asn1_compiler
hostprogs-$(CONFIG_MODULE_SIG) += sign-file
hostprogs-$(CONFIG_SYSTEM_TRUSTED_KEYRING) += extract-cert
hostprogs-$(CONFIG_SYSTEM_EXTRA_CERTIFICATE) += insert-sys-cert
+hostprogs-$(CONFIG_UNICODE) += mkutf8data

HOSTCFLAGS_sortextable.o = -I$(srctree)/tools/include
HOSTCFLAGS_asn1_compiler.o = -I$(srctree)/include
diff --git a/scripts/mkutf8data.c b/scripts/mkutf8data.c
new file mode 100644
index 000000000000..4df1a799f73c
--- /dev/null
+++ b/scripts/mkutf8data.c
@@ -0,0 +1,3189 @@
+/*
+ * Copyright (c) 2014 SGI.
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+/* Generator for a compact trie for unicode normalization */
+
+#include <sys/types.h>
+#include <stddef.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <assert.h>
+#include <string.h>
+#include <unistd.h>
+#include <errno.h>
+
+/* Default names of the in- and output files. */
+
+#define AGE_NAME "DerivedAge.txt"
+#define CCC_NAME "DerivedCombiningClass.txt"
+#define PROP_NAME "DerivedCoreProperties.txt"
+#define DATA_NAME "UnicodeData.txt"
+#define FOLD_NAME "CaseFolding.txt"
+#define NORM_NAME "NormalizationCorrections.txt"
+#define TEST_NAME "NormalizationTest.txt"
+#define UTF8_NAME "utf8data.h"
+
+const char *age_name = AGE_NAME;
+const char *ccc_name = CCC_NAME;
+const char *prop_name = PROP_NAME;
+const char *data_name = DATA_NAME;
+const char *fold_name = FOLD_NAME;
+const char *norm_name = NORM_NAME;
+const char *test_name = TEST_NAME;
+const char *utf8_name = UTF8_NAME;
+
+int verbose = 0;
+
+/* An arbitrary line size limit on input lines. */
+
+#define LINESIZE 1024
+char line[LINESIZE];
+char buf0[LINESIZE];
+char buf1[LINESIZE];
+char buf2[LINESIZE];
+char buf3[LINESIZE];
+
+const char *argv0;
+
+#define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * Unicode version numbers consist of three parts: major, minor, and a
+ * revision. These numbers are packed into an unsigned int to obtain
+ * a single version number.
+ *
+ * To save space in the generated trie, the unicode version is not
+ * stored directly, instead we calculate a generation number from the
+ * unicode versions seen in the DerivedAge file, and use that as an
+ * index into a table of unicode versions.
+ */
+#define UNICODE_MAJ_SHIFT (16)
+#define UNICODE_MIN_SHIFT (8)
+
+#define UNICODE_MAJ_MAX ((unsigned short)-1)
+#define UNICODE_MIN_MAX ((unsigned char)-1)
+#define UNICODE_REV_MAX ((unsigned char)-1)
+
+#define UNICODE_AGE(MAJ,MIN,REV) \
+ (((unsigned int)(MAJ) << UNICODE_MAJ_SHIFT) | \
+ ((unsigned int)(MIN) << UNICODE_MIN_SHIFT) | \
+ ((unsigned int)(REV)))
+
+unsigned int *ages;
+int ages_count;
+
+unsigned int unicode_maxage;
+
+static int age_valid(unsigned int major, unsigned int minor,
+ unsigned int revision)
+{
+ if (major > UNICODE_MAJ_MAX)
+ return 0;
+ if (minor > UNICODE_MIN_MAX)
+ return 0;
+ if (revision > UNICODE_REV_MAX)
+ return 0;
+ return 1;
+}
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * utf8trie_t
+ *
+ * A compact binary tree, used to decode UTF-8 characters.
+ *
+ * Internal nodes are one byte for the node itself, and up to three
+ * bytes for an offset into the tree. The first byte contains the
+ * following information:
+ * NEXTBYTE - flag - advance to next byte if set
+ * BITNUM - 3 bit field - the bit number to tested
+ * OFFLEN - 2 bit field - number of bytes in the offset
+ * if offlen == 0 (non-branching node)
+ * RIGHTPATH - 1 bit field - set if the following node is for the
+ * right-hand path (tested bit is set)
+ * TRIENODE - 1 bit field - set if the following node is an internal
+ * node, otherwise it is a leaf node
+ * if offlen != 0 (branching node)
+ * LEFTNODE - 1 bit field - set if the left-hand node is internal
+ * RIGHTNODE - 1 bit field - set if the right-hand node is internal
+ *
+ * Due to the way utf8 works, there cannot be branching nodes with
+ * NEXTBYTE set, and moreover those nodes always have a righthand
+ * descendant.
+ */
+typedef unsigned char utf8trie_t;
+#define BITNUM 0x07
+#define NEXTBYTE 0x08
+#define OFFLEN 0x30
+#define OFFLEN_SHIFT 4
+#define RIGHTPATH 0x40
+#define TRIENODE 0x80
+#define RIGHTNODE 0x40
+#define LEFTNODE 0x80
+
+/*
+ * utf8leaf_t
+ *
+ * The leaves of the trie are embedded in the trie, and so the same
+ * underlying datatype, unsigned char.
+ *
+ * leaf[0]: The unicode version, stored as a generation number that is
+ * an index into utf8agetab[]. With this we can filter code
+ * points based on the unicode version in which they were
+ * defined. The CCC of a non-defined code point is 0.
+ * leaf[1]: Canonical Combining Class. During normalization, we need
+ * to do a stable sort into ascending order of all characters
+ * with a non-zero CCC that occur between two characters with
+ * a CCC of 0, or at the begin or end of a string.
+ * The unicode standard guarantees that all CCC values are
+ * between 0 and 254 inclusive, which leaves 255 available as
+ * a special value.
+ * Code points with CCC 0 are known as stoppers.
+ * leaf[2]: Decomposition. If leaf[1] == 255, then leaf[2] is the
+ * start of a NUL-terminated string that is the decomposition
+ * of the character.
+ * The CCC of a decomposable character is the same as the CCC
+ * of the first character of its decomposition.
+ * Some characters decompose as the empty string: these are
+ * characters with the Default_Ignorable_Code_Point property.
+ * These do affect normalization, as they all have CCC 0.
+ *
+ * The decompositions in the trie have been fully expanded.
+ *
+ * Casefolding, if applicable, is also done using decompositions.
+ */
+typedef unsigned char utf8leaf_t;
+
+#define LEAF_GEN(LEAF) ((LEAF)[0])
+#define LEAF_CCC(LEAF) ((LEAF)[1])
+#define LEAF_STR(LEAF) ((const char*)((LEAF) + 2))
+
+#define MAXGEN (255)
+
+#define MINCCC (0)
+#define MAXCCC (254)
+#define STOPPER (0)
+#define DECOMPOSE (255)
+
+struct tree;
+static utf8leaf_t *utf8nlookup(struct tree *, const char *, size_t);
+static utf8leaf_t *utf8lookup(struct tree *, const char *);
+
+unsigned char *utf8data;
+size_t utf8data_size;
+
+utf8trie_t *nfdi;
+utf8trie_t *nfdicf;
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * UTF8 valid ranges.
+ *
+ * The UTF-8 encoding spreads the bits of a 32bit word over several
+ * bytes. This table gives the ranges that can be held and how they'd
+ * be represented.
+ *
+ * 0x00000000 0x0000007F: 0xxxxxxx
+ * 0x00000000 0x000007FF: 110xxxxx 10xxxxxx
+ * 0x00000000 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ *
+ * There is an additional requirement on UTF-8, in that only the
+ * shortest representation of a 32bit value is to be used. A decoder
+ * must not decode sequences that do not satisfy this requirement.
+ * Thus the allowed ranges have a lower bound.
+ *
+ * 0x00000000 0x0000007F: 0xxxxxxx
+ * 0x00000080 0x000007FF: 110xxxxx 10xxxxxx
+ * 0x00000800 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
+ * 0x00010000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00200000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x04000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ *
+ * Actual unicode characters are limited to the range 0x0 - 0x10FFFF,
+ * 17 planes of 65536 values. This limits the sequences actually seen
+ * even more, to just the following.
+ *
+ * 0 - 0x7f: 0 0x7f
+ * 0x80 - 0x7ff: 0xc2 0x80 0xdf 0xbf
+ * 0x800 - 0xffff: 0xe0 0xa0 0x80 0xef 0xbf 0xbf
+ * 0x10000 - 0x10ffff: 0xf0 0x90 0x80 0x80 0xf4 0x8f 0xbf 0xbf
+ *
+ * Even within those ranges not all values are allowed: the surrogates
+ * 0xd800 - 0xdfff should never be seen.
+ *
+ * Note that the longest sequence seen with valid usage is 4 bytes,
+ * the same a single UTF-32 character. This makes the UTF-8
+ * representation of Unicode strictly smaller than UTF-32.
+ *
+ * The shortest sequence requirement was introduced by:
+ * Corrigendum #1: UTF-8 Shortest Form
+ * It can be found here:
+ * http://www.unicode.org/versions/corrigendum1.html
+ *
+ */
+
+#define UTF8_2_BITS 0xC0
+#define UTF8_3_BITS 0xE0
+#define UTF8_4_BITS 0xF0
+#define UTF8_N_BITS 0x80
+#define UTF8_2_MASK 0xE0
+#define UTF8_3_MASK 0xF0
+#define UTF8_4_MASK 0xF8
+#define UTF8_N_MASK 0xC0
+#define UTF8_V_MASK 0x3F
+#define UTF8_V_SHIFT 6
+
+static int utf8encode(char *str, unsigned int val)
+{
+ int len;
+
+ if (val < 0x80) {
+ str[0] = val;
+ len = 1;
+ } else if (val < 0x800) {
+ str[1] = val & UTF8_V_MASK;
+ str[1] |= UTF8_N_BITS;
+ val >>= UTF8_V_SHIFT;
+ str[0] = val;
+ str[0] |= UTF8_2_BITS;
+ len = 2;
+ } else if (val < 0x10000) {
+ str[2] = val & UTF8_V_MASK;
+ str[2] |= UTF8_N_BITS;
+ val >>= UTF8_V_SHIFT;
+ str[1] = val & UTF8_V_MASK;
+ str[1] |= UTF8_N_BITS;
+ val >>= UTF8_V_SHIFT;
+ str[0] = val;
+ str[0] |= UTF8_3_BITS;
+ len = 3;
+ } else if (val < 0x110000) {
+ str[3] = val & UTF8_V_MASK;
+ str[3] |= UTF8_N_BITS;
+ val >>= UTF8_V_SHIFT;
+ str[2] = val & UTF8_V_MASK;
+ str[2] |= UTF8_N_BITS;
+ val >>= UTF8_V_SHIFT;
+ str[1] = val & UTF8_V_MASK;
+ str[1] |= UTF8_N_BITS;
+ val >>= UTF8_V_SHIFT;
+ str[0] = val;
+ str[0] |= UTF8_4_BITS;
+ len = 4;
+ } else {
+ printf("%#x: illegal val\n", val);
+ len = 0;
+ }
+ return len;
+}
+
+static unsigned int utf8decode(const char *str)
+{
+ const unsigned char *s = (const unsigned char*)str;
+ unsigned int unichar = 0;
+
+ if (*s < 0x80) {
+ unichar = *s;
+ } else if (*s < UTF8_3_BITS) {
+ unichar = *s++ & 0x1F;
+ unichar <<= UTF8_V_SHIFT;
+ unichar |= *s & 0x3F;
+ } else if (*s < UTF8_4_BITS) {
+ unichar = *s++ & 0x0F;
+ unichar <<= UTF8_V_SHIFT;
+ unichar |= *s++ & 0x3F;
+ unichar <<= UTF8_V_SHIFT;
+ unichar |= *s & 0x3F;
+ } else {
+ unichar = *s++ & 0x0F;
+ unichar <<= UTF8_V_SHIFT;
+ unichar |= *s++ & 0x3F;
+ unichar <<= UTF8_V_SHIFT;
+ unichar |= *s++ & 0x3F;
+ unichar <<= UTF8_V_SHIFT;
+ unichar |= *s & 0x3F;
+ }
+ return unichar;
+}
+
+static int utf32valid(unsigned int unichar)
+{
+ return unichar < 0x110000;
+}
+
+#define NODE 1
+#define LEAF 0
+
+struct tree {
+ void *root;
+ int childnode;
+ const char *type;
+ unsigned int maxage;
+ struct tree *next;
+ int (*leaf_equal)(void *, void *);
+ void (*leaf_print)(void *, int);
+ int (*leaf_mark)(void *);
+ int (*leaf_size)(void *);
+ int *(*leaf_index)(struct tree *, void *);
+ unsigned char *(*leaf_emit)(void *, unsigned char *);
+ int leafindex[0x110000];
+ int index;
+};
+
+struct node {
+ int index;
+ int offset;
+ int mark;
+ int size;
+ struct node *parent;
+ void *left;
+ void *right;
+ unsigned char bitnum;
+ unsigned char nextbyte;
+ unsigned char leftnode;
+ unsigned char rightnode;
+ unsigned int keybits;
+ unsigned int keymask;
+};
+
+/*
+ * Example lookup function for a tree.
+ */
+static void *lookup(struct tree *tree, const char *key)
+{
+ struct node *node;
+ void *leaf = NULL;
+
+ node = tree->root;
+ while (!leaf && node) {
+ if (node->nextbyte)
+ key++;
+ if (*key & (1 << (node->bitnum & 7))) {
+ /* Right leg */
+ if (node->rightnode == NODE) {
+ node = node->right;
+ } else if (node->rightnode == LEAF) {
+ leaf = node->right;
+ } else {
+ node = NULL;
+ }
+ } else {
+ /* Left leg */
+ if (node->leftnode == NODE) {
+ node = node->left;
+ } else if (node->leftnode == LEAF) {
+ leaf = node->left;
+ } else {
+ node = NULL;
+ }
+ }
+ }
+
+ return leaf;
+}
+
+/*
+ * A simple non-recursive tree walker: keep track of visits to the
+ * left and right branches in the leftmask and rightmask.
+ */
+static void tree_walk(struct tree *tree)
+{
+ struct node *node;
+ unsigned int leftmask;
+ unsigned int rightmask;
+ unsigned int bitmask;
+ int indent = 1;
+ int nodes, singletons, leaves;
+
+ nodes = singletons = leaves = 0;
+
+ printf("%s_%x root %p\n", tree->type, tree->maxage, tree->root);
+ if (tree->childnode == LEAF) {
+ assert(tree->root);
+ tree->leaf_print(tree->root, indent);
+ leaves = 1;
+ } else {
+ assert(tree->childnode == NODE);
+ node = tree->root;
+ leftmask = rightmask = 0;
+ while (node) {
+ printf("%*snode @ %p bitnum %d nextbyte %d"
+ " left %p right %p mask %x bits %x\n",
+ indent, "", node,
+ node->bitnum, node->nextbyte,
+ node->left, node->right,
+ node->keymask, node->keybits);
+ nodes += 1;
+ if (!(node->left && node->right))
+ singletons += 1;
+
+ while (node) {
+ bitmask = 1 << node->bitnum;
+ if ((leftmask & bitmask) == 0) {
+ leftmask |= bitmask;
+ if (node->leftnode == LEAF) {
+ assert(node->left);
+ tree->leaf_print(node->left,
+ indent+1);
+ leaves += 1;
+ } else if (node->left) {
+ assert(node->leftnode == NODE);
+ indent += 1;
+ node = node->left;
+ break;
+ }
+ }
+ if ((rightmask & bitmask) == 0) {
+ rightmask |= bitmask;
+ if (node->rightnode == LEAF) {
+ assert(node->right);
+ tree->leaf_print(node->right,
+ indent+1);
+ leaves += 1;
+ } else if (node->right) {
+ assert(node->rightnode==NODE);
+ indent += 1;
+ node = node->right;
+ break;
+ }
+ }
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ node = node->parent;
+ indent -= 1;
+ }
+ }
+ }
+ printf("nodes %d leaves %d singletons %d\n",
+ nodes, leaves, singletons);
+}
+
+/*
+ * Allocate an initialize a new internal node.
+ */
+static struct node *alloc_node(struct node *parent)
+{
+ struct node *node;
+ int bitnum;
+
+ node = malloc(sizeof(*node));
+ node->left = node->right = NULL;
+ node->parent = parent;
+ node->leftnode = NODE;
+ node->rightnode = NODE;
+ node->keybits = 0;
+ node->keymask = 0;
+ node->mark = 0;
+ node->index = 0;
+ node->offset = -1;
+ node->size = 4;
+
+ if (node->parent) {
+ bitnum = parent->bitnum;
+ if ((bitnum & 7) == 0) {
+ node->bitnum = bitnum + 7 + 8;
+ node->nextbyte = 1;
+ } else {
+ node->bitnum = bitnum - 1;
+ node->nextbyte = 0;
+ }
+ } else {
+ node->bitnum = 7;
+ node->nextbyte = 0;
+ }
+
+ return node;
+}
+
+/*
+ * Insert a new leaf into the tree, and collapse any subtrees that are
+ * fully populated and end in identical leaves. A nextbyte tagged
+ * internal node will not be removed to preserve the tree's integrity.
+ * Note that due to the structure of utf8, no nextbyte tagged node
+ * will be a candidate for removal.
+ */
+static int insert(struct tree *tree, char *key, int keylen, void *leaf)
+{
+ struct node *node;
+ struct node *parent;
+ void **cursor;
+ int keybits;
+
+ assert(keylen >= 1 && keylen <= 4);
+
+ node = NULL;
+ cursor = &tree->root;
+ keybits = 8 * keylen;
+
+ /* Insert, creating path along the way. */
+ while (keybits) {
+ if (!*cursor)
+ *cursor = alloc_node(node);
+ node = *cursor;
+ if (node->nextbyte)
+ key++;
+ if (*key & (1 << (node->bitnum & 7)))
+ cursor = &node->right;
+ else
+ cursor = &node->left;
+ keybits--;
+ }
+ *cursor = leaf;
+
+ /* Merge subtrees if possible. */
+ while (node) {
+ if (*key & (1 << (node->bitnum & 7)))
+ node->rightnode = LEAF;
+ else
+ node->leftnode = LEAF;
+ if (node->nextbyte)
+ break;
+ if (node->leftnode == NODE || node->rightnode == NODE)
+ break;
+ assert(node->left);
+ assert(node->right);
+ /* Compare */
+ if (! tree->leaf_equal(node->left, node->right))
+ break;
+ /* Keep left, drop right leaf. */
+ leaf = node->left;
+ /* Check in parent */
+ parent = node->parent;
+ if (!parent) {
+ /* root of tree! */
+ tree->root = leaf;
+ tree->childnode = LEAF;
+ } else if (parent->left == node) {
+ parent->left = leaf;
+ parent->leftnode = LEAF;
+ if (parent->right) {
+ parent->keymask = 0;
+ parent->keybits = 0;
+ } else {
+ parent->keymask |= (1 << node->bitnum);
+ }
+ } else if (parent->right == node) {
+ parent->right = leaf;
+ parent->rightnode = LEAF;
+ if (parent->left) {
+ parent->keymask = 0;
+ parent->keybits = 0;
+ } else {
+ parent->keymask |= (1 << node->bitnum);
+ parent->keybits |= (1 << node->bitnum);
+ }
+ } else {
+ /* internal tree error */
+ assert(0);
+ }
+ free(node);
+ node = parent;
+ }
+
+ /* Propagate keymasks up along singleton chains. */
+ while (node) {
+ parent = node->parent;
+ if (!parent)
+ break;
+ /* Nix the mask for parents with two children. */
+ if (node->keymask == 0) {
+ parent->keymask = 0;
+ parent->keybits = 0;
+ } else if (parent->left && parent->right) {
+ parent->keymask = 0;
+ parent->keybits = 0;
+ } else {
+ assert((parent->keymask & node->keymask) == 0);
+ parent->keymask |= node->keymask;
+ parent->keymask |= (1 << parent->bitnum);
+ parent->keybits |= node->keybits;
+ if (parent->right)
+ parent->keybits |= (1 << parent->bitnum);
+ }
+ node = parent;
+ }
+
+ return 0;
+}
+
+/*
+ * Prune internal nodes.
+ *
+ * Fully populated subtrees that end at the same leaf have already
+ * been collapsed. There are still internal nodes that have for both
+ * their left and right branches a sequence of singletons that make
+ * identical choices and end in identical leaves. The keymask and
+ * keybits collected in the nodes describe the choices made in these
+ * singleton chains. When they are identical for the left and right
+ * branch of a node, and the two leaves comare identical, the node in
+ * question can be removed.
+ *
+ * Note that nodes with the nextbyte tag set will not be removed by
+ * this to ensure tree integrity. Note as well that the structure of
+ * utf8 ensures that these nodes would not have been candidates for
+ * removal in any case.
+ */
+static void prune(struct tree *tree)
+{
+ struct node *node;
+ struct node *left;
+ struct node *right;
+ struct node *parent;
+ void *leftleaf;
+ void *rightleaf;
+ unsigned int leftmask;
+ unsigned int rightmask;
+ unsigned int bitmask;
+ int count;
+
+ if (verbose > 0)
+ printf("Pruning %s_%x\n", tree->type, tree->maxage);
+
+ count = 0;
+ if (tree->childnode == LEAF)
+ return;
+ if (!tree->root)
+ return;
+
+ leftmask = rightmask = 0;
+ node = tree->root;
+ while (node) {
+ if (node->nextbyte)
+ goto advance;
+ if (node->leftnode == LEAF)
+ goto advance;
+ if (node->rightnode == LEAF)
+ goto advance;
+ if (!node->left)
+ goto advance;
+ if (!node->right)
+ goto advance;
+ left = node->left;
+ right = node->right;
+ if (left->keymask == 0)
+ goto advance;
+ if (right->keymask == 0)
+ goto advance;
+ if (left->keymask != right->keymask)
+ goto advance;
+ if (left->keybits != right->keybits)
+ goto advance;
+ leftleaf = NULL;
+ while (!leftleaf) {
+ assert(left->left || left->right);
+ if (left->leftnode == LEAF)
+ leftleaf = left->left;
+ else if (left->rightnode == LEAF)
+ leftleaf = left->right;
+ else if (left->left)
+ left = left->left;
+ else if (left->right)
+ left = left->right;
+ else
+ assert(0);
+ }
+ rightleaf = NULL;
+ while (!rightleaf) {
+ assert(right->left || right->right);
+ if (right->leftnode == LEAF)
+ rightleaf = right->left;
+ else if (right->rightnode == LEAF)
+ rightleaf = right->right;
+ else if (right->left)
+ right = right->left;
+ else if (right->right)
+ right = right->right;
+ else
+ assert(0);
+ }
+ if (! tree->leaf_equal(leftleaf, rightleaf))
+ goto advance;
+ /*
+ * This node has identical singleton-only subtrees.
+ * Remove it.
+ */
+ parent = node->parent;
+ left = node->left;
+ right = node->right;
+ if (parent->left == node)
+ parent->left = left;
+ else if (parent->right == node)
+ parent->right = left;
+ else
+ assert(0);
+ left->parent = parent;
+ left->keymask |= (1 << node->bitnum);
+ node->left = NULL;
+ while (node) {
+ bitmask = 1 << node->bitnum;
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ if (node->leftnode == NODE && node->left) {
+ left = node->left;
+ free(node);
+ count++;
+ node = left;
+ } else if (node->rightnode == NODE && node->right) {
+ right = node->right;
+ free(node);
+ count++;
+ node = right;
+ } else {
+ node = NULL;
+ }
+ }
+ /* Propagate keymasks up along singleton chains. */
+ node = parent;
+ /* Force re-check */
+ bitmask = 1 << node->bitnum;
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ for (;;) {
+ if (node->left && node->right)
+ break;
+ if (node->left) {
+ left = node->left;
+ node->keymask |= left->keymask;
+ node->keybits |= left->keybits;
+ }
+ if (node->right) {
+ right = node->right;
+ node->keymask |= right->keymask;
+ node->keybits |= right->keybits;
+ }
+ node->keymask |= (1 << node->bitnum);
+ node = node->parent;
+ /* Force re-check */
+ bitmask = 1 << node->bitnum;
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ }
+ advance:
+ bitmask = 1 << node->bitnum;
+ if ((leftmask & bitmask) == 0 &&
+ node->leftnode == NODE &&
+ node->left) {
+ leftmask |= bitmask;
+ node = node->left;
+ } else if ((rightmask & bitmask) == 0 &&
+ node->rightnode == NODE &&
+ node->right) {
+ rightmask |= bitmask;
+ node = node->right;
+ } else {
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ node = node->parent;
+ }
+ }
+ if (verbose > 0)
+ printf("Pruned %d nodes\n", count);
+}
+
+/*
+ * Mark the nodes in the tree that lead to leaves that must be
+ * emitted.
+ */
+static void mark_nodes(struct tree *tree)
+{
+ struct node *node;
+ struct node *n;
+ unsigned int leftmask;
+ unsigned int rightmask;
+ unsigned int bitmask;
+ int marked;
+
+ marked = 0;
+ if (verbose > 0)
+ printf("Marking %s_%x\n", tree->type, tree->maxage);
+ if (tree->childnode == LEAF)
+ goto done;
+
+ assert(tree->childnode == NODE);
+ node = tree->root;
+ leftmask = rightmask = 0;
+ while (node) {
+ bitmask = 1 << node->bitnum;
+ if ((leftmask & bitmask) == 0) {
+ leftmask |= bitmask;
+ if (node->leftnode == LEAF) {
+ assert(node->left);
+ if (tree->leaf_mark(node->left)) {
+ n = node;
+ while (n && !n->mark) {
+ marked++;
+ n->mark = 1;
+ n = n->parent;
+ }
+ }
+ } else if (node->left) {
+ assert(node->leftnode == NODE);
+ node = node->left;
+ continue;
+ }
+ }
+ if ((rightmask & bitmask) == 0) {
+ rightmask |= bitmask;
+ if (node->rightnode == LEAF) {
+ assert(node->right);
+ if (tree->leaf_mark(node->right)) {
+ n = node;
+ while (n && !n->mark) {
+ marked++;
+ n->mark = 1;
+ n = n->parent;
+ }
+ }
+ } else if (node->right) {
+ assert(node->rightnode==NODE);
+ node = node->right;
+ continue;
+ }
+ }
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ node = node->parent;
+ }
+
+ /* second pass: left siblings and singletons */
+
+ assert(tree->childnode == NODE);
+ node = tree->root;
+ leftmask = rightmask = 0;
+ while (node) {
+ bitmask = 1 << node->bitnum;
+ if ((leftmask & bitmask) == 0) {
+ leftmask |= bitmask;
+ if (node->leftnode == LEAF) {
+ assert(node->left);
+ if (tree->leaf_mark(node->left)) {
+ n = node;
+ while (n && !n->mark) {
+ marked++;
+ n->mark = 1;
+ n = n->parent;
+ }
+ }
+ } else if (node->left) {
+ assert(node->leftnode == NODE);
+ node = node->left;
+ if (!node->mark && node->parent->mark) {
+ marked++;
+ node->mark = 1;
+ }
+ continue;
+ }
+ }
+ if ((rightmask & bitmask) == 0) {
+ rightmask |= bitmask;
+ if (node->rightnode == LEAF) {
+ assert(node->right);
+ if (tree->leaf_mark(node->right)) {
+ n = node;
+ while (n && !n->mark) {
+ marked++;
+ n->mark = 1;
+ n = n->parent;
+ }
+ }
+ } else if (node->right) {
+ assert(node->rightnode==NODE);
+ node = node->right;
+ if (!node->mark && node->parent->mark &&
+ !node->parent->left) {
+ marked++;
+ node->mark = 1;
+ }
+ continue;
+ }
+ }
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ node = node->parent;
+ }
+done:
+ if (verbose > 0)
+ printf("Marked %d nodes\n", marked);
+}
+
+/*
+ * Compute the index of each node and leaf, which is the offset in the
+ * emitted trie. These values must be pre-computed because relative
+ * offsets between nodes are used to navigate the tree.
+ */
+static int index_nodes(struct tree *tree, int index)
+{
+ struct node *node;
+ unsigned int leftmask;
+ unsigned int rightmask;
+ unsigned int bitmask;
+ int count;
+ int indent;
+
+ /* Align to a cache line (or half a cache line?). */
+ while (index % 64)
+ index++;
+ tree->index = index;
+ indent = 1;
+ count = 0;
+
+ if (verbose > 0)
+ printf("Indexing %s_%x: %d\n", tree->type, tree->maxage, index);
+ if (tree->childnode == LEAF) {
+ index += tree->leaf_size(tree->root);
+ goto done;
+ }
+
+ assert(tree->childnode == NODE);
+ node = tree->root;
+ leftmask = rightmask = 0;
+ while (node) {
+ if (!node->mark)
+ goto skip;
+ count++;
+ if (node->index != index)
+ node->index = index;
+ index += node->size;
+skip:
+ while (node) {
+ bitmask = 1 << node->bitnum;
+ if (node->mark && (leftmask & bitmask) == 0) {
+ leftmask |= bitmask;
+ if (node->leftnode == LEAF) {
+ assert(node->left);
+ *tree->leaf_index(tree, node->left) =
+ index;
+ index += tree->leaf_size(node->left);
+ count++;
+ } else if (node->left) {
+ assert(node->leftnode == NODE);
+ indent += 1;
+ node = node->left;
+ break;
+ }
+ }
+ if (node->mark && (rightmask & bitmask) == 0) {
+ rightmask |= bitmask;
+ if (node->rightnode == LEAF) {
+ assert(node->right);
+ *tree->leaf_index(tree, node->right) = index;
+ index += tree->leaf_size(node->right);
+ count++;
+ } else if (node->right) {
+ assert(node->rightnode==NODE);
+ indent += 1;
+ node = node->right;
+ break;
+ }
+ }
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ node = node->parent;
+ indent -= 1;
+ }
+ }
+done:
+ /* Round up to a multiple of 16 */
+ while (index % 16)
+ index++;
+ if (verbose > 0)
+ printf("Final index %d\n", index);
+ return index;
+}
+
+/*
+ * Compute the size of nodes and leaves. We start by assuming that
+ * each node needs to store a three-byte offset. The indexes of the
+ * nodes are calculated based on that, and then this function is
+ * called to see if the sizes of some nodes can be reduced. This is
+ * repeated until no more changes are seen.
+ */
+static int size_nodes(struct tree *tree)
+{
+ struct tree *next;
+ struct node *node;
+ struct node *right;
+ struct node *n;
+ unsigned int leftmask;
+ unsigned int rightmask;
+ unsigned int bitmask;
+ unsigned int pathbits;
+ unsigned int pathmask;
+ int changed;
+ int offset;
+ int size;
+ int indent;
+
+ indent = 1;
+ changed = 0;
+ size = 0;
+
+ if (verbose > 0)
+ printf("Sizing %s_%x\n", tree->type, tree->maxage);
+ if (tree->childnode == LEAF)
+ goto done;
+
+ assert(tree->childnode == NODE);
+ pathbits = 0;
+ pathmask = 0;
+ node = tree->root;
+ leftmask = rightmask = 0;
+ while (node) {
+ if (!node->mark)
+ goto skip;
+ offset = 0;
+ if (!node->left || !node->right) {
+ size = 1;
+ } else {
+ if (node->rightnode == NODE) {
+ right = node->right;
+ next = tree->next;
+ while (!right->mark) {
+ assert(next);
+ n = next->root;
+ while (n->bitnum != node->bitnum) {
+ if (pathbits & (1<<n->bitnum))
+ n = n->right;
+ else
+ n = n->left;
+ }
+ n = n->right;
+ assert(right->bitnum == n->bitnum);
+ right = n;
+ next = next->next;
+ }
+ offset = right->index - node->index;
+ } else {
+ offset = *tree->leaf_index(tree, node->right);
+ offset -= node->index;
+ }
+ assert(offset >= 0);
+ assert(offset <= 0xffffff);
+ if (offset <= 0xff) {
+ size = 2;
+ } else if (offset <= 0xffff) {
+ size = 3;
+ } else { /* offset <= 0xffffff */
+ size = 4;
+ }
+ }
+ if (node->size != size || node->offset != offset) {
+ node->size = size;
+ node->offset = offset;
+ changed++;
+ }
+skip:
+ while (node) {
+ bitmask = 1 << node->bitnum;
+ pathmask |= bitmask;
+ if (node->mark && (leftmask & bitmask) == 0) {
+ leftmask |= bitmask;
+ if (node->leftnode == LEAF) {
+ assert(node->left);
+ } else if (node->left) {
+ assert(node->leftnode == NODE);
+ indent += 1;
+ node = node->left;
+ break;
+ }
+ }
+ if (node->mark && (rightmask & bitmask) == 0) {
+ rightmask |= bitmask;
+ pathbits |= bitmask;
+ if (node->rightnode == LEAF) {
+ assert(node->right);
+ } else if (node->right) {
+ assert(node->rightnode==NODE);
+ indent += 1;
+ node = node->right;
+ break;
+ }
+ }
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ pathmask &= ~bitmask;
+ pathbits &= ~bitmask;
+ node = node->parent;
+ indent -= 1;
+ }
+ }
+done:
+ if (verbose > 0)
+ printf("Found %d changes\n", changed);
+ return changed;
+}
+
+/*
+ * Emit a trie for the given tree into the data array.
+ */
+static void emit(struct tree *tree, unsigned char *data)
+{
+ struct node *node;
+ unsigned int leftmask;
+ unsigned int rightmask;
+ unsigned int bitmask;
+ int offlen;
+ int offset;
+ int index;
+ int indent;
+ unsigned char byte;
+
+ index = tree->index;
+ data += index;
+ indent = 1;
+ if (verbose > 0)
+ printf("Emitting %s_%x\n", tree->type, tree->maxage);
+ if (tree->childnode == LEAF) {
+ assert(tree->root);
+ tree->leaf_emit(tree->root, data);
+ return;
+ }
+
+ assert(tree->childnode == NODE);
+ node = tree->root;
+ leftmask = rightmask = 0;
+ while (node) {
+ if (!node->mark)
+ goto skip;
+ assert(node->offset != -1);
+ assert(node->index == index);
+
+ byte = 0;
+ if (node->nextbyte)
+ byte |= NEXTBYTE;
+ byte |= (node->bitnum & BITNUM);
+ if (node->left && node->right) {
+ if (node->leftnode == NODE)
+ byte |= LEFTNODE;
+ if (node->rightnode == NODE)
+ byte |= RIGHTNODE;
+ if (node->offset <= 0xff)
+ offlen = 1;
+ else if (node->offset <= 0xffff)
+ offlen = 2;
+ else
+ offlen = 3;
+ offset = node->offset;
+ byte |= offlen << OFFLEN_SHIFT;
+ *data++ = byte;
+ index++;
+ while (offlen--) {
+ *data++ = offset & 0xff;
+ index++;
+ offset >>= 8;
+ }
+ } else if (node->left) {
+ if (node->leftnode == NODE)
+ byte |= TRIENODE;
+ *data++ = byte;
+ index++;
+ } else if (node->right) {
+ byte |= RIGHTNODE;
+ if (node->rightnode == NODE)
+ byte |= TRIENODE;
+ *data++ = byte;
+ index++;
+ } else {
+ assert(0);
+ }
+skip:
+ while (node) {
+ bitmask = 1 << node->bitnum;
+ if (node->mark && (leftmask & bitmask) == 0) {
+ leftmask |= bitmask;
+ if (node->leftnode == LEAF) {
+ assert(node->left);
+ data = tree->leaf_emit(node->left,
+ data);
+ index += tree->leaf_size(node->left);
+ } else if (node->left) {
+ assert(node->leftnode == NODE);
+ indent += 1;
+ node = node->left;
+ break;
+ }
+ }
+ if (node->mark && (rightmask & bitmask) == 0) {
+ rightmask |= bitmask;
+ if (node->rightnode == LEAF) {
+ assert(node->right);
+ data = tree->leaf_emit(node->right,
+ data);
+ index += tree->leaf_size(node->right);
+ } else if (node->right) {
+ assert(node->rightnode==NODE);
+ indent += 1;
+ node = node->right;
+ break;
+ }
+ }
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ node = node->parent;
+ indent -= 1;
+ }
+ }
+}
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * Unicode data.
+ *
+ * We need to keep track of the Canonical Combining Class, the Age,
+ * and decompositions for a code point.
+ *
+ * For the Age, we store the index into the ages table. Effectively
+ * this is a generation number that the table maps to a unicode
+ * version.
+ *
+ * The correction field is used to indicate that this entry is in the
+ * corrections array, which contains decompositions that were
+ * corrected in later revisions. The value of the correction field is
+ * the Unicode version in which the mapping was corrected.
+ */
+struct unicode_data {
+ unsigned int code;
+ int ccc;
+ int gen;
+ int correction;
+ unsigned int *utf32nfdi;
+ unsigned int *utf32nfdicf;
+ char *utf8nfdi;
+ char *utf8nfdicf;
+};
+
+struct unicode_data unicode_data[0x110000];
+struct unicode_data *corrections;
+int corrections_count;
+
+struct tree *nfdi_tree;
+struct tree *nfdicf_tree;
+
+struct tree *trees;
+int trees_count;
+
+/*
+ * Check the corrections array to see if this entry was corrected at
+ * some point.
+ */
+static struct unicode_data *corrections_lookup(struct unicode_data *u)
+{
+ int i;
+
+ for (i = 0; i != corrections_count; i++)
+ if (u->code == corrections[i].code)
+ return &corrections[i];
+ return u;
+}
+
+static int nfdi_equal(void *l, void *r)
+{
+ struct unicode_data *left = l;
+ struct unicode_data *right = r;
+
+ if (left->gen != right->gen)
+ return 0;
+ if (left->ccc != right->ccc)
+ return 0;
+ if (left->utf8nfdi && right->utf8nfdi &&
+ strcmp(left->utf8nfdi, right->utf8nfdi) == 0)
+ return 1;
+ if (left->utf8nfdi || right->utf8nfdi)
+ return 0;
+ return 1;
+}
+
+static int nfdicf_equal(void *l, void *r)
+{
+ struct unicode_data *left = l;
+ struct unicode_data *right = r;
+
+ if (left->gen != right->gen)
+ return 0;
+ if (left->ccc != right->ccc)
+ return 0;
+ if (left->utf8nfdicf && right->utf8nfdicf &&
+ strcmp(left->utf8nfdicf, right->utf8nfdicf) == 0)
+ return 1;
+ if (left->utf8nfdicf && right->utf8nfdicf)
+ return 0;
+ if (left->utf8nfdicf || right->utf8nfdicf)
+ return 0;
+ if (left->utf8nfdi && right->utf8nfdi &&
+ strcmp(left->utf8nfdi, right->utf8nfdi) == 0)
+ return 1;
+ if (left->utf8nfdi || right->utf8nfdi)
+ return 0;
+ return 1;
+}
+
+static void nfdi_print(void *l, int indent)
+{
+ struct unicode_data *leaf = l;
+
+ printf("%*sleaf @ %p code %X ccc %d gen %d", indent, "", leaf,
+ leaf->code, leaf->ccc, leaf->gen);
+ if (leaf->utf8nfdi)
+ printf(" nfdi \"%s\"", (const char*)leaf->utf8nfdi);
+ printf("\n");
+}
+
+static void nfdicf_print(void *l, int indent)
+{
+ struct unicode_data *leaf = l;
+
+ printf("%*sleaf @ %p code %X ccc %d gen %d", indent, "", leaf,
+ leaf->code, leaf->ccc, leaf->gen);
+ if (leaf->utf8nfdicf)
+ printf(" nfdicf \"%s\"", (const char*)leaf->utf8nfdicf);
+ else if (leaf->utf8nfdi)
+ printf(" nfdi \"%s\"", (const char*)leaf->utf8nfdi);
+ printf("\n");
+}
+
+static int nfdi_mark(void *l)
+{
+ return 1;
+}
+
+static int nfdicf_mark(void *l)
+{
+ struct unicode_data *leaf = l;
+
+ if (leaf->utf8nfdicf)
+ return 1;
+ return 0;
+}
+
+static int correction_mark(void *l)
+{
+ struct unicode_data *leaf = l;
+
+ return leaf->correction;
+}
+
+static int nfdi_size(void *l)
+{
+ struct unicode_data *leaf = l;
+
+ int size = 2;
+ if (leaf->utf8nfdi)
+ size += strlen(leaf->utf8nfdi) + 1;
+ return size;
+}
+
+static int nfdicf_size(void *l)
+{
+ struct unicode_data *leaf = l;
+
+ int size = 2;
+ if (leaf->utf8nfdicf)
+ size += strlen(leaf->utf8nfdicf) + 1;
+ else if (leaf->utf8nfdi)
+ size += strlen(leaf->utf8nfdi) + 1;
+ return size;
+}
+
+static int *nfdi_index(struct tree *tree, void *l)
+{
+ struct unicode_data *leaf = l;
+
+ return &tree->leafindex[leaf->code];
+}
+
+static int *nfdicf_index(struct tree *tree, void *l)
+{
+ struct unicode_data *leaf = l;
+
+ return &tree->leafindex[leaf->code];
+}
+
+static unsigned char *nfdi_emit(void *l, unsigned char *data)
+{
+ struct unicode_data *leaf = l;
+ unsigned char *s;
+
+ *data++ = leaf->gen;
+ if (leaf->utf8nfdi) {
+ *data++ = DECOMPOSE;
+ s = (unsigned char*)leaf->utf8nfdi;
+ while ((*data++ = *s++) != 0)
+ ;
+ } else {
+ *data++ = leaf->ccc;
+ }
+ return data;
+}
+
+static unsigned char *nfdicf_emit(void *l, unsigned char *data)
+{
+ struct unicode_data *leaf = l;
+ unsigned char *s;
+
+ *data++ = leaf->gen;
+ if (leaf->utf8nfdicf) {
+ *data++ = DECOMPOSE;
+ s = (unsigned char*)leaf->utf8nfdicf;
+ while ((*data++ = *s++) != 0)
+ ;
+ } else if (leaf->utf8nfdi) {
+ *data++ = DECOMPOSE;
+ s = (unsigned char*)leaf->utf8nfdi;
+ while ((*data++ = *s++) != 0)
+ ;
+ } else {
+ *data++ = leaf->ccc;
+ }
+ return data;
+}
+
+static void utf8_create(struct unicode_data *data)
+{
+ char utf[18*4+1];
+ char *u;
+ unsigned int *um;
+ int i;
+
+ u = utf;
+ um = data->utf32nfdi;
+ if (um) {
+ for (i = 0; um[i]; i++)
+ u += utf8encode(u, um[i]);
+ *u = '\0';
+ data->utf8nfdi = strdup(utf);
+ }
+ u = utf;
+ um = data->utf32nfdicf;
+ if (um) {
+ for (i = 0; um[i]; i++)
+ u += utf8encode(u, um[i]);
+ *u = '\0';
+ if (!data->utf8nfdi || strcmp(data->utf8nfdi, utf))
+ data->utf8nfdicf = strdup(utf);
+ }
+}
+
+static void utf8_init(void)
+{
+ unsigned int unichar;
+ int i;
+
+ for (unichar = 0; unichar != 0x110000; unichar++)
+ utf8_create(&unicode_data[unichar]);
+
+ for (i = 0; i != corrections_count; i++)
+ utf8_create(&corrections[i]);
+}
+
+static void trees_init(void)
+{
+ struct unicode_data *data;
+ unsigned int maxage;
+ unsigned int nextage;
+ int count;
+ int i;
+ int j;
+
+ /* Count the number of different ages. */
+ count = 0;
+ nextage = (unsigned int)-1;
+ do {
+ maxage = nextage;
+ nextage = 0;
+ for (i = 0; i <= corrections_count; i++) {
+ data = &corrections[i];
+ if (nextage < data->correction &&
+ data->correction < maxage)
+ nextage = data->correction;
+ }
+ count++;
+ } while (nextage);
+
+ /* Two trees per age: nfdi and nfdicf */
+ trees_count = count * 2;
+ trees = calloc(trees_count, sizeof(struct tree));
+
+ /* Assign ages to the trees. */
+ count = trees_count;
+ nextage = (unsigned int)-1;
+ do {
+ maxage = nextage;
+ trees[--count].maxage = maxage;
+ trees[--count].maxage = maxage;
+ nextage = 0;
+ for (i = 0; i <= corrections_count; i++) {
+ data = &corrections[i];
+ if (nextage < data->correction &&
+ data->correction < maxage)
+ nextage = data->correction;
+ }
+ } while (nextage);
+
+ /* The ages assigned above are off by one. */
+ for (i = 0; i != trees_count; i++) {
+ j = 0;
+ while (ages[j] < trees[i].maxage)
+ j++;
+ trees[i].maxage = ages[j-1];
+ }
+
+ /* Set up the forwarding between trees. */
+ trees[trees_count-2].next = &trees[trees_count-1];
+ trees[trees_count-1].leaf_mark = nfdi_mark;
+ trees[trees_count-2].leaf_mark = nfdicf_mark;
+ for (i = 0; i != trees_count-2; i += 2) {
+ trees[i].next = &trees[trees_count-2];
+ trees[i].leaf_mark = correction_mark;
+ trees[i+1].next = &trees[trees_count-1];
+ trees[i+1].leaf_mark = correction_mark;
+ }
+
+ /* Assign the callouts. */
+ for (i = 0; i != trees_count; i += 2) {
+ trees[i].type = "nfdicf";
+ trees[i].leaf_equal = nfdicf_equal;
+ trees[i].leaf_print = nfdicf_print;
+ trees[i].leaf_size = nfdicf_size;
+ trees[i].leaf_index = nfdicf_index;
+ trees[i].leaf_emit = nfdicf_emit;
+
+ trees[i+1].type = "nfdi";
+ trees[i+1].leaf_equal = nfdi_equal;
+ trees[i+1].leaf_print = nfdi_print;
+ trees[i+1].leaf_size = nfdi_size;
+ trees[i+1].leaf_index = nfdi_index;
+ trees[i+1].leaf_emit = nfdi_emit;
+ }
+
+ /* Finish init. */
+ for (i = 0; i != trees_count; i++)
+ trees[i].childnode = NODE;
+}
+
+static void trees_populate(void)
+{
+ struct unicode_data *data;
+ unsigned int unichar;
+ char keyval[4];
+ int keylen;
+ int i;
+
+ for (i = 0; i != trees_count; i++) {
+ if (verbose > 0) {
+ printf("Populating %s_%x\n",
+ trees[i].type, trees[i].maxage);
+ }
+ for (unichar = 0; unichar != 0x110000; unichar++) {
+ if (unicode_data[unichar].gen < 0)
+ continue;
+ keylen = utf8encode(keyval, unichar);
+ data = corrections_lookup(&unicode_data[unichar]);
+ if (data->correction <= trees[i].maxage)
+ data = &unicode_data[unichar];
+ insert(&trees[i], keyval, keylen, data);
+ }
+ }
+}
+
+static void trees_reduce(void)
+{
+ int i;
+ int size;
+ int changed;
+
+ for (i = 0; i != trees_count; i++)
+ prune(&trees[i]);
+ for (i = 0; i != trees_count; i++)
+ mark_nodes(&trees[i]);
+ do {
+ size = 0;
+ for (i = 0; i != trees_count; i++)
+ size = index_nodes(&trees[i], size);
+ changed = 0;
+ for (i = 0; i != trees_count; i++)
+ changed += size_nodes(&trees[i]);
+ } while (changed);
+
+ utf8data = calloc(size, 1);
+ utf8data_size = size;
+ for (i = 0; i != trees_count; i++)
+ emit(&trees[i], utf8data);
+
+ if (verbose > 0) {
+ for (i = 0; i != trees_count; i++) {
+ printf("%s_%x idx %d\n",
+ trees[i].type, trees[i].maxage, trees[i].index);
+ }
+ }
+
+ nfdi = utf8data + trees[trees_count-1].index;
+ nfdicf = utf8data + trees[trees_count-2].index;
+
+ nfdi_tree = &trees[trees_count-1];
+ nfdicf_tree = &trees[trees_count-2];
+}
+
+static void verify(struct tree *tree)
+{
+ struct unicode_data *data;
+ utf8leaf_t *leaf;
+ unsigned int unichar;
+ char key[4];
+ int report;
+ int nocf;
+
+ if (verbose > 0)
+ printf("Verifying %s_%x\n", tree->type, tree->maxage);
+ nocf = strcmp(tree->type, "nfdicf");
+
+ for (unichar = 0; unichar != 0x110000; unichar++) {
+ report = 0;
+ data = corrections_lookup(&unicode_data[unichar]);
+ if (data->correction <= tree->maxage)
+ data = &unicode_data[unichar];
+ utf8encode(key,unichar);
+ leaf = utf8lookup(tree, key);
+ if (!leaf) {
+ if (data->gen != -1)
+ report++;
+ if (unichar < 0xd800 || unichar > 0xdfff)
+ report++;
+ } else {
+ if (unichar >= 0xd800 && unichar <= 0xdfff)
+ report++;
+ if (data->gen == -1)
+ report++;
+ if (data->gen != LEAF_GEN(leaf))
+ report++;
+ if (LEAF_CCC(leaf) == DECOMPOSE) {
+ if (nocf) {
+ if (!data->utf8nfdi) {
+ report++;
+ } else if (strcmp(data->utf8nfdi,
+ LEAF_STR(leaf))) {
+ report++;
+ }
+ } else {
+ if (!data->utf8nfdicf &&
+ !data->utf8nfdi) {
+ report++;
+ } else if (data->utf8nfdicf) {
+ if (strcmp(data->utf8nfdicf,
+ LEAF_STR(leaf)))
+ report++;
+ } else if (strcmp(data->utf8nfdi,
+ LEAF_STR(leaf))) {
+ report++;
+ }
+ }
+ } else if (data->ccc != LEAF_CCC(leaf)) {
+ report++;
+ }
+ }
+ if (report) {
+ printf("%X code %X gen %d ccc %d"
+ " nfdi -> \"%s\"",
+ unichar, data->code, data->gen,
+ data->ccc,
+ data->utf8nfdi);
+ if (leaf) {
+ printf(" gen %d ccc %d"
+ " nfdi -> \"%s\"",
+ LEAF_GEN(leaf),
+ LEAF_CCC(leaf),
+ LEAF_CCC(leaf) == DECOMPOSE ?
+ LEAF_STR(leaf) : "");
+ }
+ printf("\n");
+ }
+ }
+}
+
+static void trees_verify(void)
+{
+ int i;
+
+ for (i = 0; i != trees_count; i++)
+ verify(&trees[i]);
+}
+
+/* ------------------------------------------------------------------ */
+
+static void help(void)
+{
+ printf("Usage: %s [options]\n", argv0);
+ printf("\n");
+ printf("This program creates an a data trie used for parsing and\n");
+ printf("normalization of UTF-8 strings. The trie is derived from\n");
+ printf("a set of input files from the Unicode character database\n");
+ printf("found at: http://www.unicode.org/Public/UCD/latest/ucd/\n");
+ printf("\n");
+ printf("The generated tree supports two normalization forms:\n");
+ printf("\n");
+ printf("\tnfdi:\n");
+ printf("\t- Apply unicode normalization form NFD.\n");
+ printf("\t- Remove any Default_Ignorable_Code_Point.\n");
+ printf("\n");
+ printf("\tnfdicf:\n");
+ printf("\t- Apply unicode normalization form NFD.\n");
+ printf("\t- Remove any Default_Ignorable_Code_Point.\n");
+ printf("\t- Apply a full casefold (C + F).\n");
+ printf("\n");
+ printf("These forms were chosen as being most useful when dealing\n");
+ printf("with file names: NFD catches most cases where characters\n");
+ printf("should be considered equivalent. The ignorables are mostly\n");
+ printf("invisible, making names hard to type.\n");
+ printf("\n");
+ printf("The options to specify the files to be used are listed\n");
+ printf("below with their default values, which are the names used\n");
+ printf("by version 11.0.0 of the Unicode Character Database.\n");
+ printf("\n");
+ printf("The input files:\n");
+ printf("\t-a %s\n", AGE_NAME);
+ printf("\t-c %s\n", CCC_NAME);
+ printf("\t-p %s\n", PROP_NAME);
+ printf("\t-d %s\n", DATA_NAME);
+ printf("\t-f %s\n", FOLD_NAME);
+ printf("\t-n %s\n", NORM_NAME);
+ printf("\n");
+ printf("Additionally, the generated tables are tested using:\n");
+ printf("\t-t %s\n", TEST_NAME);
+ printf("\n");
+ printf("Finally, the output file:\n");
+ printf("\t-o %s\n", UTF8_NAME);
+ printf("\n");
+}
+
+static void usage(void)
+{
+ help();
+ exit(1);
+}
+
+static void open_fail(const char *name, int error)
+{
+ printf("Error %d opening %s: %s\n", error, name, strerror(error));
+ exit(1);
+}
+
+static void file_fail(const char *filename)
+{
+ printf("Error parsing %s\n", filename);
+ exit(1);
+}
+
+static void line_fail(const char *filename, const char *line)
+{
+ printf("Error parsing %s:%s\n", filename, line);
+ exit(1);
+}
+
+/* ------------------------------------------------------------------ */
+
+static void print_utf32(unsigned int *utf32str)
+{
+ int i;
+
+ for (i = 0; utf32str[i]; i++)
+ printf(" %X", utf32str[i]);
+}
+
+static void print_utf32nfdi(unsigned int unichar)
+{
+ printf(" %X ->", unichar);
+ print_utf32(unicode_data[unichar].utf32nfdi);
+ printf("\n");
+}
+
+static void print_utf32nfdicf(unsigned int unichar)
+{
+ printf(" %X ->", unichar);
+ print_utf32(unicode_data[unichar].utf32nfdicf);
+ printf("\n");
+}
+
+/* ------------------------------------------------------------------ */
+
+static void age_init(void)
+{
+ FILE *file;
+ unsigned int first;
+ unsigned int last;
+ unsigned int unichar;
+ unsigned int major;
+ unsigned int minor;
+ unsigned int revision;
+ int gen;
+ int count;
+ int ret;
+
+ if (verbose > 0)
+ printf("Parsing %s\n", age_name);
+
+ file = fopen(age_name, "r");
+ if (!file)
+ open_fail(age_name, errno);
+ count = 0;
+
+ gen = 0;
+ while (fgets(line, LINESIZE, file)) {
+ ret = sscanf(line, "# Age=V%d_%d_%d",
+ &major, &minor, &revision);
+ if (ret == 3) {
+ ages_count++;
+ if (verbose > 1)
+ printf(" Age V%d_%d_%d\n",
+ major, minor, revision);
+ if (!age_valid(major, minor, revision))
+ line_fail(age_name, line);
+ continue;
+ }
+ ret = sscanf(line, "# Age=V%d_%d", &major, &minor);
+ if (ret == 2) {
+ ages_count++;
+ if (verbose > 1)
+ printf(" Age V%d_%d\n", major, minor);
+ if (!age_valid(major, minor, 0))
+ line_fail(age_name, line);
+ continue;
+ }
+ }
+
+ /* We must have found something above. */
+ if (verbose > 1)
+ printf("%d age entries\n", ages_count);
+ if (ages_count == 0 || ages_count > MAXGEN)
+ file_fail(age_name);
+
+ /* There is a 0 entry. */
+ ages_count++;
+ ages = calloc(ages_count + 1, sizeof(*ages));
+ /* And a guard entry. */
+ ages[ages_count] = (unsigned int)-1;
+
+ rewind(file);
+ count = 0;
+ gen = 0;
+ while (fgets(line, LINESIZE, file)) {
+ ret = sscanf(line, "# Age=V%d_%d_%d",
+ &major, &minor, &revision);
+ if (ret == 3) {
+ ages[++gen] =
+ UNICODE_AGE(major, minor, revision);
+ if (verbose > 1)
+ printf(" Age V%d_%d_%d = gen %d\n",
+ major, minor, revision, gen);
+ if (!age_valid(major, minor, revision))
+ line_fail(age_name, line);
+ continue;
+ }
+ ret = sscanf(line, "# Age=V%d_%d", &major, &minor);
+ if (ret == 2) {
+ ages[++gen] = UNICODE_AGE(major, minor, 0);
+ if (verbose > 1)
+ printf(" Age V%d_%d = %d\n",
+ major, minor, gen);
+ if (!age_valid(major, minor, 0))
+ line_fail(age_name, line);
+ continue;
+ }
+ ret = sscanf(line, "%X..%X ; %d.%d #",
+ &first, &last, &major, &minor);
+ if (ret == 4) {
+ for (unichar = first; unichar <= last; unichar++)
+ unicode_data[unichar].gen = gen;
+ count += 1 + last - first;
+ if (verbose > 1)
+ printf(" %X..%X gen %d\n", first, last, gen);
+ if (!utf32valid(first) || !utf32valid(last))
+ line_fail(age_name, line);
+ continue;
+ }
+ ret = sscanf(line, "%X ; %d.%d #", &unichar, &major, &minor);
+ if (ret == 3) {
+ unicode_data[unichar].gen = gen;
+ count++;
+ if (verbose > 1)
+ printf(" %X gen %d\n", unichar, gen);
+ if (!utf32valid(unichar))
+ line_fail(age_name, line);
+ continue;
+ }
+ }
+ unicode_maxage = ages[gen];
+ fclose(file);
+
+ /* Nix surrogate block */
+ if (verbose > 1)
+ printf(" Removing surrogate block D800..DFFF\n");
+ for (unichar = 0xd800; unichar <= 0xdfff; unichar++)
+ unicode_data[unichar].gen = -1;
+
+ if (verbose > 0)
+ printf("Found %d entries\n", count);
+ if (count == 0)
+ file_fail(age_name);
+}
+
+static void ccc_init(void)
+{
+ FILE *file;
+ unsigned int first;
+ unsigned int last;
+ unsigned int unichar;
+ unsigned int value;
+ int count;
+ int ret;
+
+ if (verbose > 0)
+ printf("Parsing %s\n", ccc_name);
+
+ file = fopen(ccc_name, "r");
+ if (!file)
+ open_fail(ccc_name, errno);
+
+ count = 0;
+ while (fgets(line, LINESIZE, file)) {
+ ret = sscanf(line, "%X..%X ; %d #", &first, &last, &value);
+ if (ret == 3) {
+ for (unichar = first; unichar <= last; unichar++) {
+ unicode_data[unichar].ccc = value;
+ count++;
+ }
+ if (verbose > 1)
+ printf(" %X..%X ccc %d\n", first, last, value);
+ if (!utf32valid(first) || !utf32valid(last))
+ line_fail(ccc_name, line);
+ continue;
+ }
+ ret = sscanf(line, "%X ; %d #", &unichar, &value);
+ if (ret == 2) {
+ unicode_data[unichar].ccc = value;
+ count++;
+ if (verbose > 1)
+ printf(" %X ccc %d\n", unichar, value);
+ if (!utf32valid(unichar))
+ line_fail(ccc_name, line);
+ continue;
+ }
+ }
+ fclose(file);
+
+ if (verbose > 0)
+ printf("Found %d entries\n", count);
+ if (count == 0)
+ file_fail(ccc_name);
+}
+
+static int ignore_compatibility_form(char *type)
+{
+ int i;
+ char *ignored_types[] = {"font", "noBreak", "initial", "medial",
+ "final", "isolated", "circle", "super",
+ "sub", "vertical", "wide", "narrow",
+ "small", "square", "fraction", "compat"};
+
+ for (i = 0 ; i < ARRAY_SIZE(ignored_types); i++)
+ if (strcmp(type, ignored_types[i]) == 0)
+ return 1;
+ return 0;
+}
+
+static void nfdi_init(void)
+{
+ FILE *file;
+ unsigned int unichar;
+ unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+ char *s;
+ char *type;
+ unsigned int *um;
+ int count;
+ int i;
+ int ret;
+
+ if (verbose > 0)
+ printf("Parsing %s\n", data_name);
+ file = fopen(data_name, "r");
+ if (!file)
+ open_fail(data_name, errno);
+
+ count = 0;
+ while (fgets(line, LINESIZE, file)) {
+ ret = sscanf(line, "%X;%*[^;];%*[^;];%*[^;];%*[^;];%[^;];",
+ &unichar, buf0);
+ if (ret != 2)
+ continue;
+ if (!utf32valid(unichar))
+ line_fail(data_name, line);
+
+ s = buf0;
+ /* skip over <tag> */
+ if (*s == '<') {
+ type = ++s;
+ while (*++s != '>');
+ *s++ = '\0';
+ if(ignore_compatibility_form(type))
+ continue;
+ }
+ /* decode the decomposition into UTF-32 */
+ i = 0;
+ while (*s) {
+ mapping[i] = strtoul(s, &s, 16);
+ if (!utf32valid(mapping[i]))
+ line_fail(data_name, line);
+ i++;
+ }
+ mapping[i++] = 0;
+
+ um = malloc(i * sizeof(unsigned int));
+ memcpy(um, mapping, i * sizeof(unsigned int));
+ unicode_data[unichar].utf32nfdi = um;
+
+ if (verbose > 1)
+ print_utf32nfdi(unichar);
+ count++;
+ }
+ fclose(file);
+ if (verbose > 0)
+ printf("Found %d entries\n", count);
+ if (count == 0)
+ file_fail(data_name);
+}
+
+static void nfdicf_init(void)
+{
+ FILE *file;
+ unsigned int unichar;
+ unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+ char status;
+ char *s;
+ unsigned int *um;
+ int i;
+ int count;
+ int ret;
+
+ if (verbose > 0)
+ printf("Parsing %s\n", fold_name);
+ file = fopen(fold_name, "r");
+ if (!file)
+ open_fail(fold_name, errno);
+
+ count = 0;
+ while (fgets(line, LINESIZE, file)) {
+ ret = sscanf(line, "%X; %c; %[^;];", &unichar, &status, buf0);
+ if (ret != 3)
+ continue;
+ if (!utf32valid(unichar))
+ line_fail(fold_name, line);
+ /* Use the C+F casefold. */
+ if (status != 'C' && status != 'F')
+ continue;
+ s = buf0;
+ if (*s == '<')
+ while (*s++ != ' ')
+ ;
+ i = 0;
+ while (*s) {
+ mapping[i] = strtoul(s, &s, 16);
+ if (!utf32valid(mapping[i]))
+ line_fail(fold_name, line);
+ i++;
+ }
+ mapping[i++] = 0;
+
+ um = malloc(i * sizeof(unsigned int));
+ memcpy(um, mapping, i * sizeof(unsigned int));
+ unicode_data[unichar].utf32nfdicf = um;
+
+ if (verbose > 1)
+ print_utf32nfdicf(unichar);
+ count++;
+ }
+ fclose(file);
+ if (verbose > 0)
+ printf("Found %d entries\n", count);
+ if (count == 0)
+ file_fail(fold_name);
+}
+
+static void ignore_init(void)
+{
+ FILE *file;
+ unsigned int unichar;
+ unsigned int first;
+ unsigned int last;
+ unsigned int *um;
+ int count;
+ int ret;
+
+ if (verbose > 0)
+ printf("Parsing %s\n", prop_name);
+ file = fopen(prop_name, "r");
+ if (!file)
+ open_fail(prop_name, errno);
+ assert(file);
+ count = 0;
+ while (fgets(line, LINESIZE, file)) {
+ ret = sscanf(line, "%X..%X ; %s # ", &first, &last, buf0);
+ if (ret == 3) {
+ if (strcmp(buf0, "Default_Ignorable_Code_Point"))
+ continue;
+ if (!utf32valid(first) || !utf32valid(last))
+ line_fail(prop_name, line);
+ for (unichar = first; unichar <= last; unichar++) {
+ free(unicode_data[unichar].utf32nfdi);
+ um = malloc(sizeof(unsigned int));
+ *um = 0;
+ unicode_data[unichar].utf32nfdi = um;
+ free(unicode_data[unichar].utf32nfdicf);
+ um = malloc(sizeof(unsigned int));
+ *um = 0;
+ unicode_data[unichar].utf32nfdicf = um;
+ count++;
+ }
+ if (verbose > 1)
+ printf(" %X..%X Default_Ignorable_Code_Point\n",
+ first, last);
+ continue;
+ }
+ ret = sscanf(line, "%X ; %s # ", &unichar, buf0);
+ if (ret == 2) {
+ if (strcmp(buf0, "Default_Ignorable_Code_Point"))
+ continue;
+ if (!utf32valid(unichar))
+ line_fail(prop_name, line);
+ free(unicode_data[unichar].utf32nfdi);
+ um = malloc(sizeof(unsigned int));
+ *um = 0;
+ unicode_data[unichar].utf32nfdi = um;
+ free(unicode_data[unichar].utf32nfdicf);
+ um = malloc(sizeof(unsigned int));
+ *um = 0;
+ unicode_data[unichar].utf32nfdicf = um;
+ if (verbose > 1)
+ printf(" %X Default_Ignorable_Code_Point\n",
+ unichar);
+ count++;
+ continue;
+ }
+ }
+ fclose(file);
+
+ if (verbose > 0)
+ printf("Found %d entries\n", count);
+ if (count == 0)
+ file_fail(prop_name);
+}
+
+static void corrections_init(void)
+{
+ FILE *file;
+ unsigned int unichar;
+ unsigned int major;
+ unsigned int minor;
+ unsigned int revision;
+ unsigned int age;
+ unsigned int *um;
+ unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+ char *s;
+ int i;
+ int count;
+ int ret;
+
+ if (verbose > 0)
+ printf("Parsing %s\n", norm_name);
+ file = fopen(norm_name, "r");
+ if (!file)
+ open_fail(norm_name, errno);
+
+ count = 0;
+ while (fgets(line, LINESIZE, file)) {
+ ret = sscanf(line, "%X;%[^;];%[^;];%d.%d.%d #",
+ &unichar, buf0, buf1,
+ &major, &minor, &revision);
+ if (ret != 6)
+ continue;
+ if (!utf32valid(unichar) || !age_valid(major, minor, revision))
+ line_fail(norm_name, line);
+ count++;
+ }
+ corrections = calloc(count, sizeof(struct unicode_data));
+ corrections_count = count;
+ rewind(file);
+
+ count = 0;
+ while (fgets(line, LINESIZE, file)) {
+ ret = sscanf(line, "%X;%[^;];%[^;];%d.%d.%d #",
+ &unichar, buf0, buf1,
+ &major, &minor, &revision);
+ if (ret != 6)
+ continue;
+ if (!utf32valid(unichar) || !age_valid(major, minor, revision))
+ line_fail(norm_name, line);
+ corrections[count] = unicode_data[unichar];
+ assert(corrections[count].code == unichar);
+ age = UNICODE_AGE(major, minor, revision);
+ corrections[count].correction = age;
+
+ i = 0;
+ s = buf0;
+ while (*s) {
+ mapping[i] = strtoul(s, &s, 16);
+ if (!utf32valid(mapping[i]))
+ line_fail(norm_name, line);
+ i++;
+ }
+ mapping[i++] = 0;
+
+ um = malloc(i * sizeof(unsigned int));
+ memcpy(um, mapping, i * sizeof(unsigned int));
+ corrections[count].utf32nfdi = um;
+
+ if (verbose > 1)
+ printf(" %X -> %s -> %s V%d_%d_%d\n",
+ unichar, buf0, buf1, major, minor, revision);
+ count++;
+ }
+ fclose(file);
+
+ if (verbose > 0)
+ printf("Found %d entries\n", count);
+ if (count == 0)
+ file_fail(norm_name);
+}
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * Hangul decomposition (algorithm from Section 3.12 of Unicode 6.3.0)
+ *
+ * AC00;<Hangul Syllable, First>;Lo;0;L;;;;;N;;;;;
+ * D7A3;<Hangul Syllable, Last>;Lo;0;L;;;;;N;;;;;
+ *
+ * SBase = 0xAC00
+ * LBase = 0x1100
+ * VBase = 0x1161
+ * TBase = 0x11A7
+ * LCount = 19
+ * VCount = 21
+ * TCount = 28
+ * NCount = 588 (VCount * TCount)
+ * SCount = 11172 (LCount * NCount)
+ *
+ * Decomposition:
+ * SIndex = s - SBase
+ *
+ * LV (Canonical/Full)
+ * LIndex = SIndex / NCount
+ * VIndex = (Sindex % NCount) / TCount
+ * LPart = LBase + LIndex
+ * VPart = VBase + VIndex
+ *
+ * LVT (Canonical)
+ * LVIndex = (SIndex / TCount) * TCount
+ * TIndex = (Sindex % TCount)
+ * LVPart = SBase + LVIndex
+ * TPart = TBase + TIndex
+ *
+ * LVT (Full)
+ * LIndex = SIndex / NCount
+ * VIndex = (Sindex % NCount) / TCount
+ * TIndex = (Sindex % TCount)
+ * LPart = LBase + LIndex
+ * VPart = VBase + VIndex
+ * if (TIndex == 0) {
+ * d = <LPart, VPart>
+ * } else {
+ * TPart = TBase + TIndex
+ * d = <LPart, VPart, TPart>
+ * }
+ *
+ */
+
+static void
+hangul_decompose(void)
+{
+ unsigned int sb = 0xAC00;
+ unsigned int lb = 0x1100;
+ unsigned int vb = 0x1161;
+ unsigned int tb = 0x11a7;
+ /* unsigned int lc = 19; */
+ unsigned int vc = 21;
+ unsigned int tc = 28;
+ unsigned int nc = (vc * tc);
+ /* unsigned int sc = (lc * nc); */
+ unsigned int unichar;
+ unsigned int mapping[4];
+ unsigned int *um;
+ int count;
+ int i;
+
+ if (verbose > 0)
+ printf("Decomposing hangul\n");
+ /* Hangul */
+ count = 0;
+ for (unichar = 0xAC00; unichar <= 0xD7A3; unichar++) {
+ unsigned int si = unichar - sb;
+ unsigned int li = si / nc;
+ unsigned int vi = (si % nc) / tc;
+ unsigned int ti = si % tc;
+
+ i = 0;
+ mapping[i++] = lb + li;
+ mapping[i++] = vb + vi;
+ if (ti)
+ mapping[i++] = tb + ti;
+ mapping[i++] = 0;
+
+ assert(!unicode_data[unichar].utf32nfdi);
+ um = malloc(i * sizeof(unsigned int));
+ memcpy(um, mapping, i * sizeof(unsigned int));
+ unicode_data[unichar].utf32nfdi = um;
+
+ assert(!unicode_data[unichar].utf32nfdicf);
+ um = malloc(i * sizeof(unsigned int));
+ memcpy(um, mapping, i * sizeof(unsigned int));
+ unicode_data[unichar].utf32nfdicf = um;
+
+ if (verbose > 1)
+ print_utf32nfdi(unichar);
+
+ count++;
+ }
+ if (verbose > 0)
+ printf("Created %d entries\n", count);
+}
+
+static void nfdi_decompose(void)
+{
+ unsigned int unichar;
+ unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+ unsigned int *um;
+ unsigned int *dc;
+ int count;
+ int i;
+ int j;
+ int ret;
+
+ if (verbose > 0)
+ printf("Decomposing nfdi\n");
+
+ count = 0;
+ for (unichar = 0; unichar != 0x110000; unichar++) {
+ if (!unicode_data[unichar].utf32nfdi)
+ continue;
+ for (;;) {
+ ret = 1;
+ i = 0;
+ um = unicode_data[unichar].utf32nfdi;
+ while (*um) {
+ dc = unicode_data[*um].utf32nfdi;
+ if (dc) {
+ for (j = 0; dc[j]; j++)
+ mapping[i++] = dc[j];
+ ret = 0;
+ } else {
+ mapping[i++] = *um;
+ }
+ um++;
+ }
+ mapping[i++] = 0;
+ if (ret)
+ break;
+ free(unicode_data[unichar].utf32nfdi);
+ um = malloc(i * sizeof(unsigned int));
+ memcpy(um, mapping, i * sizeof(unsigned int));
+ unicode_data[unichar].utf32nfdi = um;
+ }
+ /* Add this decomposition to nfdicf if there is no entry. */
+ if (!unicode_data[unichar].utf32nfdicf) {
+ um = malloc(i * sizeof(unsigned int));
+ memcpy(um, mapping, i * sizeof(unsigned int));
+ unicode_data[unichar].utf32nfdicf = um;
+ }
+ if (verbose > 1)
+ print_utf32nfdi(unichar);
+ count++;
+ }
+ if (verbose > 0)
+ printf("Processed %d entries\n", count);
+}
+
+static void nfdicf_decompose(void)
+{
+ unsigned int unichar;
+ unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+ unsigned int *um;
+ unsigned int *dc;
+ int count;
+ int i;
+ int j;
+ int ret;
+
+ if (verbose > 0)
+ printf("Decomposing nfdicf\n");
+ count = 0;
+ for (unichar = 0; unichar != 0x110000; unichar++) {
+ if (!unicode_data[unichar].utf32nfdicf)
+ continue;
+ for (;;) {
+ ret = 1;
+ i = 0;
+ um = unicode_data[unichar].utf32nfdicf;
+ while (*um) {
+ dc = unicode_data[*um].utf32nfdicf;
+ if (dc) {
+ for (j = 0; dc[j]; j++)
+ mapping[i++] = dc[j];
+ ret = 0;
+ } else {
+ mapping[i++] = *um;
+ }
+ um++;
+ }
+ mapping[i++] = 0;
+ if (ret)
+ break;
+ free(unicode_data[unichar].utf32nfdicf);
+ um = malloc(i * sizeof(unsigned int));
+ memcpy(um, mapping, i * sizeof(unsigned int));
+ unicode_data[unichar].utf32nfdicf = um;
+ }
+ if (verbose > 1)
+ print_utf32nfdicf(unichar);
+ count++;
+ }
+ if (verbose > 0)
+ printf("Processed %d entries\n", count);
+}
+
+/* ------------------------------------------------------------------ */
+
+int utf8agemax(struct tree *, const char *);
+int utf8nagemax(struct tree *, const char *, size_t);
+int utf8agemin(struct tree *, const char *);
+int utf8nagemin(struct tree *, const char *, size_t);
+ssize_t utf8len(struct tree *, const char *);
+ssize_t utf8nlen(struct tree *, const char *, size_t);
+struct utf8cursor;
+int utf8cursor(struct utf8cursor *, struct tree *, const char *);
+int utf8ncursor(struct utf8cursor *, struct tree *, const char *, size_t);
+int utf8byte(struct utf8cursor *);
+
+/*
+ * Use trie to scan s, touching at most len bytes.
+ * Returns the leaf if one exists, NULL otherwise.
+ *
+ * A non-NULL return guarantees that the UTF-8 sequence starting at s
+ * is well-formed and corresponds to a known unicode code point. The
+ * shorthand for this will be "is valid UTF-8 unicode".
+ */
+static utf8leaf_t *utf8nlookup(struct tree *tree, const char *s, size_t len)
+{
+ utf8trie_t *trie = utf8data + tree->index;
+ int offlen;
+ int offset;
+ int mask;
+ int node;
+
+ if (!tree)
+ return NULL;
+ if (len == 0)
+ return NULL;
+ node = 1;
+ while (node) {
+ offlen = (*trie & OFFLEN) >> OFFLEN_SHIFT;
+ if (*trie & NEXTBYTE) {
+ if (--len == 0)
+ return NULL;
+ s++;
+ }
+ mask = 1 << (*trie & BITNUM);
+ if (*s & mask) {
+ /* Right leg */
+ if (offlen) {
+ /* Right node at offset of trie */
+ node = (*trie & RIGHTNODE);
+ offset = trie[offlen];
+ while (--offlen) {
+ offset <<= 8;
+ offset |= trie[offlen];
+ }
+ trie += offset;
+ } else if (*trie & RIGHTPATH) {
+ /* Right node after this node */
+ node = (*trie & TRIENODE);
+ trie++;
+ } else {
+ /* No right node. */
+ return NULL;
+ }
+ } else {
+ /* Left leg */
+ if (offlen) {
+ /* Left node after this node. */
+ node = (*trie & LEFTNODE);
+ trie += offlen + 1;
+ } else if (*trie & RIGHTPATH) {
+ /* No left node. */
+ return NULL;
+ } else {
+ /* Left node after this node */
+ node = (*trie & TRIENODE);
+ trie++;
+ }
+ }
+ }
+ return trie;
+}
+
+/*
+ * Use trie to scan s.
+ * Returns the leaf if one exists, NULL otherwise.
+ *
+ * Forwards to trie_nlookup().
+ */
+static utf8leaf_t *utf8lookup(struct tree *tree, const char *s)
+{
+ return utf8nlookup(tree, s, (size_t)-1);
+}
+
+/*
+ * Return the number of bytes used by the current UTF-8 sequence.
+ * Assumes the input points to the first byte of a valid UTF-8
+ * sequence.
+ */
+static inline int utf8clen(const char *s)
+{
+ unsigned char c = *s;
+ return 1 + (c >= 0xC0) + (c >= 0xE0) + (c >= 0xF0);
+}
+
+/*
+ * Maximum age of any character in s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ * Return 0 if only non-assigned code points are used.
+ */
+int utf8agemax(struct tree *tree, const char *s)
+{
+ utf8leaf_t *leaf;
+ int age = 0;
+ int leaf_age;
+
+ if (!tree)
+ return -1;
+ while (*s) {
+ if (!(leaf = utf8lookup(tree, s)))
+ return -1;
+ leaf_age = ages[LEAF_GEN(leaf)];
+ if (leaf_age <= tree->maxage && leaf_age > age)
+ age = leaf_age;
+ s += utf8clen(s);
+ }
+ return age;
+}
+
+/*
+ * Minimum age of any character in s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ * Return 0 if non-assigned code points are used.
+ */
+int utf8agemin(struct tree *tree, const char *s)
+{
+ utf8leaf_t *leaf;
+ int age;
+ int leaf_age;
+
+ if (!tree)
+ return -1;
+ age = tree->maxage;
+ while (*s) {
+ if (!(leaf = utf8lookup(tree, s)))
+ return -1;
+ leaf_age = ages[LEAF_GEN(leaf)];
+ if (leaf_age <= tree->maxage && leaf_age < age)
+ age = leaf_age;
+ s += utf8clen(s);
+ }
+ return age;
+}
+
+/*
+ * Maximum age of any character in s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+int utf8nagemax(struct tree *tree, const char *s, size_t len)
+{
+ utf8leaf_t *leaf;
+ int age = 0;
+ int leaf_age;
+
+ if (!tree)
+ return -1;
+ while (len && *s) {
+ if (!(leaf = utf8nlookup(tree, s, len)))
+ return -1;
+ leaf_age = ages[LEAF_GEN(leaf)];
+ if (leaf_age <= tree->maxage && leaf_age > age)
+ age = leaf_age;
+ len -= utf8clen(s);
+ s += utf8clen(s);
+ }
+ return age;
+}
+
+/*
+ * Maximum age of any character in s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+int utf8nagemin(struct tree *tree, const char *s, size_t len)
+{
+ utf8leaf_t *leaf;
+ int leaf_age;
+ int age;
+
+ if (!tree)
+ return -1;
+ age = tree->maxage;
+ while (len && *s) {
+ if (!(leaf = utf8nlookup(tree, s, len)))
+ return -1;
+ leaf_age = ages[LEAF_GEN(leaf)];
+ if (leaf_age <= tree->maxage && leaf_age < age)
+ age = leaf_age;
+ len -= utf8clen(s);
+ s += utf8clen(s);
+ }
+ return age;
+}
+
+/*
+ * Length of the normalization of s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ *
+ * A string of Default_Ignorable_Code_Point has length 0.
+ */
+ssize_t utf8len(struct tree *tree, const char *s)
+{
+ utf8leaf_t *leaf;
+ size_t ret = 0;
+
+ if (!tree)
+ return -1;
+ while (*s) {
+ if (!(leaf = utf8lookup(tree, s)))
+ return -1;
+ if (ages[LEAF_GEN(leaf)] > tree->maxage)
+ ret += utf8clen(s);
+ else if (LEAF_CCC(leaf) == DECOMPOSE)
+ ret += strlen(LEAF_STR(leaf));
+ else
+ ret += utf8clen(s);
+ s += utf8clen(s);
+ }
+ return ret;
+}
+
+/*
+ * Length of the normalization of s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+ssize_t utf8nlen(struct tree *tree, const char *s, size_t len)
+{
+ utf8leaf_t *leaf;
+ size_t ret = 0;
+
+ if (!tree)
+ return -1;
+ while (len && *s) {
+ if (!(leaf = utf8nlookup(tree, s, len)))
+ return -1;
+ if (ages[LEAF_GEN(leaf)] > tree->maxage)
+ ret += utf8clen(s);
+ else if (LEAF_CCC(leaf) == DECOMPOSE)
+ ret += strlen(LEAF_STR(leaf));
+ else
+ ret += utf8clen(s);
+ len -= utf8clen(s);
+ s += utf8clen(s);
+ }
+ return ret;
+}
+
+/*
+ * Cursor structure used by the normalizer.
+ */
+struct utf8cursor {
+ struct tree *tree;
+ const char *s;
+ const char *p;
+ const char *ss;
+ const char *sp;
+ unsigned int len;
+ unsigned int slen;
+ short int ccc;
+ short int nccc;
+ unsigned int unichar;
+};
+
+/*
+ * Set up an utf8cursor for use by utf8byte().
+ *
+ * s : string.
+ * len : length of s.
+ * u8c : pointer to cursor.
+ * trie : utf8trie_t to use for normalization.
+ *
+ * Returns -1 on error, 0 on success.
+ */
+int utf8ncursor(struct utf8cursor *u8c, struct tree *tree, const char *s,
+ size_t len)
+{
+ if (!tree)
+ return -1;
+ if (!s)
+ return -1;
+ u8c->tree = tree;
+ u8c->s = s;
+ u8c->p = NULL;
+ u8c->ss = NULL;
+ u8c->sp = NULL;
+ u8c->len = len;
+ u8c->slen = 0;
+ u8c->ccc = STOPPER;
+ u8c->nccc = STOPPER;
+ u8c->unichar = 0;
+ /* Check we didn't clobber the maximum length. */
+ if (u8c->len != len)
+ return -1;
+ /* The first byte of s may not be an utf8 continuation. */
+ if (len > 0 && (*s & 0xC0) == 0x80)
+ return -1;
+ return 0;
+}
+
+/*
+ * Set up an utf8cursor for use by utf8byte().
+ *
+ * s : NUL-terminated string.
+ * u8c : pointer to cursor.
+ * trie : utf8trie_t to use for normalization.
+ *
+ * Returns -1 on error, 0 on success.
+ */
+int utf8cursor(struct utf8cursor *u8c, struct tree *tree, const char *s)
+{
+ return utf8ncursor(u8c, tree, s, (unsigned int)-1);
+}
+
+/*
+ * Get one byte from the normalized form of the string described by u8c.
+ *
+ * Returns the byte cast to an unsigned char on succes, and -1 on failure.
+ *
+ * The cursor keeps track of the location in the string in u8c->s.
+ * When a character is decomposed, the current location is stored in
+ * u8c->p, and u8c->s is set to the start of the decomposition. Note
+ * that bytes from a decomposition do not count against u8c->len.
+ *
+ * Characters are emitted if they match the current CCC in u8c->ccc.
+ * Hitting end-of-string while u8c->ccc == STOPPER means we're done,
+ * and the function returns 0 in that case.
+ *
+ * Sorting by CCC is done by repeatedly scanning the string. The
+ * values of u8c->s and u8c->p are stored in u8c->ss and u8c->sp at
+ * the start of the scan. The first pass finds the lowest CCC to be
+ * emitted and stores it in u8c->nccc, the second pass emits the
+ * characters with this CCC and finds the next lowest CCC. This limits
+ * the number of passes to 1 + the number of different CCCs in the
+ * sequence being scanned.
+ *
+ * Therefore:
+ * u8c->p != NULL -> a decomposition is being scanned.
+ * u8c->ss != NULL -> this is a repeating scan.
+ * u8c->ccc == -1 -> this is the first scan of a repeating scan.
+ */
+int utf8byte(struct utf8cursor *u8c)
+{
+ utf8leaf_t *leaf;
+ int ccc;
+
+ for (;;) {
+ /* Check for the end of a decomposed character. */
+ if (u8c->p && *u8c->s == '\0') {
+ u8c->s = u8c->p;
+ u8c->p = NULL;
+ }
+
+ /* Check for end-of-string. */
+ if (!u8c->p && (u8c->len == 0 || *u8c->s == '\0')) {
+ /* There is no next byte. */
+ if (u8c->ccc == STOPPER)
+ return 0;
+ /* End-of-string during a scan counts as a stopper. */
+ ccc = STOPPER;
+ goto ccc_mismatch;
+ } else if ((*u8c->s & 0xC0) == 0x80) {
+ /* This is a continuation of the current character. */
+ if (!u8c->p)
+ u8c->len--;
+ return (unsigned char)*u8c->s++;
+ }
+
+ /* Look up the data for the current character. */
+ if (u8c->p)
+ leaf = utf8lookup(u8c->tree, u8c->s);
+ else
+ leaf = utf8nlookup(u8c->tree, u8c->s, u8c->len);
+
+ /* No leaf found implies that the input is a binary blob. */
+ if (!leaf)
+ return -1;
+
+ /* Characters that are too new have CCC 0. */
+ if (ages[LEAF_GEN(leaf)] > u8c->tree->maxage) {
+ ccc = STOPPER;
+ } else if ((ccc = LEAF_CCC(leaf)) == DECOMPOSE) {
+ u8c->len -= utf8clen(u8c->s);
+ u8c->p = u8c->s + utf8clen(u8c->s);
+ u8c->s = LEAF_STR(leaf);
+ /* Empty decomposition implies CCC 0. */
+ if (*u8c->s == '\0') {
+ if (u8c->ccc == STOPPER)
+ continue;
+ ccc = STOPPER;
+ goto ccc_mismatch;
+ }
+ leaf = utf8lookup(u8c->tree, u8c->s);
+ ccc = LEAF_CCC(leaf);
+ }
+ u8c->unichar = utf8decode(u8c->s);
+
+ /*
+ * If this is not a stopper, then see if it updates
+ * the next canonical class to be emitted.
+ */
+ if (ccc != STOPPER && u8c->ccc < ccc && ccc < u8c->nccc)
+ u8c->nccc = ccc;
+
+ /*
+ * Return the current byte if this is the current
+ * combining class.
+ */
+ if (ccc == u8c->ccc) {
+ if (!u8c->p)
+ u8c->len--;
+ return (unsigned char)*u8c->s++;
+ }
+
+ /* Current combining class mismatch. */
+ ccc_mismatch:
+ if (u8c->nccc == STOPPER) {
+ /*
+ * Scan forward for the first canonical class
+ * to be emitted. Save the position from
+ * which to restart.
+ */
+ assert(u8c->ccc == STOPPER);
+ u8c->ccc = MINCCC - 1;
+ u8c->nccc = ccc;
+ u8c->sp = u8c->p;
+ u8c->ss = u8c->s;
+ u8c->slen = u8c->len;
+ if (!u8c->p)
+ u8c->len -= utf8clen(u8c->s);
+ u8c->s += utf8clen(u8c->s);
+ } else if (ccc != STOPPER) {
+ /* Not a stopper, and not the ccc we're emitting. */
+ if (!u8c->p)
+ u8c->len -= utf8clen(u8c->s);
+ u8c->s += utf8clen(u8c->s);
+ } else if (u8c->nccc != MAXCCC + 1) {
+ /* At a stopper, restart for next ccc. */
+ u8c->ccc = u8c->nccc;
+ u8c->nccc = MAXCCC + 1;
+ u8c->s = u8c->ss;
+ u8c->p = u8c->sp;
+ u8c->len = u8c->slen;
+ } else {
+ /* All done, proceed from here. */
+ u8c->ccc = STOPPER;
+ u8c->nccc = STOPPER;
+ u8c->sp = NULL;
+ u8c->ss = NULL;
+ u8c->slen = 0;
+ }
+ }
+}
+
+/* ------------------------------------------------------------------ */
+
+static int normalize_line(struct tree *tree)
+{
+ char *s;
+ char *t;
+ int c;
+ struct utf8cursor u8c;
+
+ /* First test: null-terminated string. */
+ s = buf2;
+ t = buf3;
+ if (utf8cursor(&u8c, tree, s))
+ return -1;
+ while ((c = utf8byte(&u8c)) > 0)
+ if (c != (unsigned char)*t++)
+ return -1;
+ if (c < 0)
+ return -1;
+ if (*t != 0)
+ return -1;
+
+ /* Second test: length-limited string. */
+ s = buf2;
+ /* Replace NUL with a value that will cause an error if seen. */
+ s[strlen(s) + 1] = -1;
+ t = buf3;
+ if (utf8cursor(&u8c, tree, s))
+ return -1;
+ while ((c = utf8byte(&u8c)) > 0)
+ if (c != (unsigned char)*t++)
+ return -1;
+ if (c < 0)
+ return -1;
+ if (*t != 0)
+ return -1;
+
+ return 0;
+}
+
+static void normalization_test(void)
+{
+ FILE *file;
+ unsigned int unichar;
+ struct unicode_data *data;
+ char *s;
+ char *t;
+ int ret;
+ int ignorables;
+ int tests = 0;
+ int failures = 0;
+
+ if (verbose > 0)
+ printf("Parsing %s\n", test_name);
+ /* Step one, read data from file. */
+ file = fopen(test_name, "r");
+ if (!file)
+ open_fail(test_name, errno);
+
+ while (fgets(line, LINESIZE, file)) {
+ ret = sscanf(line, "%[^;];%*[^;];%[^;];%*[^;];%*[^;];",
+ buf0, buf1);
+ if (ret != 2 || *line == '#')
+ continue;
+ s = buf0;
+ t = buf2;
+ while (*s) {
+ unichar = strtoul(s, &s, 16);
+ t += utf8encode(t, unichar);
+ }
+ *t = '\0';
+
+ ignorables = 0;
+ s = buf1;
+ t = buf3;
+ while (*s) {
+ unichar = strtoul(s, &s, 16);
+ data = &unicode_data[unichar];
+ if (data->utf8nfdi && !*data->utf8nfdi)
+ ignorables = 1;
+ else
+ t += utf8encode(t, unichar);
+ }
+ *t = '\0';
+
+ tests++;
+ if (normalize_line(nfdi_tree) < 0) {
+ printf("Line %s -> %s", buf0, buf1);
+ if (ignorables)
+ printf(" (ignorables removed)");
+ printf(" failure\n");
+ failures++;
+ }
+ }
+ fclose(file);
+ if (verbose > 0)
+ printf("Ran %d tests with %d failures\n", tests, failures);
+ if (failures)
+ file_fail(test_name);
+}
+
+/* ------------------------------------------------------------------ */
+
+static void write_file(void)
+{
+ FILE *file;
+ int i;
+ int j;
+ int t;
+ int gen;
+
+ if (verbose > 0)
+ printf("Writing %s\n", utf8_name);
+ file = fopen(utf8_name, "w");
+ if (!file)
+ open_fail(utf8_name, errno);
+
+ fprintf(file, "/* This file is generated code, do not edit. */\n");
+ fprintf(file, "#ifndef __INCLUDED_FROM_UTF8NORM_C__\n");
+ fprintf(file, "#error Only nls_utf8-norm.c should include this file.\n");
+ fprintf(file, "#endif\n");
+ fprintf(file, "\n");
+ fprintf(file, "static const unsigned int utf8vers = %#x;\n",
+ unicode_maxage);
+ fprintf(file, "\n");
+ fprintf(file, "static const unsigned int utf8agetab[] = {\n");
+ for (i = 0; i != ages_count; i++)
+ fprintf(file, "\t%#x%s\n", ages[i],
+ ages[i] == unicode_maxage ? "" : ",");
+ fprintf(file, "};\n");
+ fprintf(file, "\n");
+ fprintf(file, "static const struct utf8data utf8nfdicfdata[] = {\n");
+ t = 0;
+ for (gen = 0; gen < ages_count; gen++) {
+ fprintf(file, "\t{ %#x, %d }%s\n",
+ ages[gen], trees[t].index,
+ ages[gen] == unicode_maxage ? "" : ",");
+ if (trees[t].maxage == ages[gen])
+ t += 2;
+ }
+ fprintf(file, "};\n");
+ fprintf(file, "\n");
+ fprintf(file, "static const struct utf8data utf8nfdidata[] = {\n");
+ t = 1;
+ for (gen = 0; gen < ages_count; gen++) {
+ fprintf(file, "\t{ %#x, %d }%s\n",
+ ages[gen], trees[t].index,
+ ages[gen] == unicode_maxage ? "" : ",");
+ if (trees[t].maxage == ages[gen])
+ t += 2;
+ }
+ fprintf(file, "};\n");
+ fprintf(file, "\n");
+ fprintf(file, "static const unsigned char utf8data[%zd] = {\n",
+ utf8data_size);
+ t = 0;
+ for (i = 0; i != utf8data_size; i += 16) {
+ if (i == trees[t].index) {
+ fprintf(file, "\t/* %s_%x */\n",
+ trees[t].type, trees[t].maxage);
+ if (t < trees_count-1)
+ t++;
+ }
+ fprintf(file, "\t");
+ for (j = i; j != i + 16; j++)
+ fprintf(file, "0x%.2x%s", utf8data[j],
+ (j < utf8data_size -1 ? "," : ""));
+ fprintf(file, "\n");
+ }
+ fprintf(file, "};\n");
+ fclose(file);
+}
+
+/* ------------------------------------------------------------------ */
+
+int main(int argc, char *argv[])
+{
+ unsigned int unichar;
+ int opt;
+
+ argv0 = argv[0];
+
+ while ((opt = getopt(argc, argv, "a:c:d:f:hn:o:p:t:v")) != -1) {
+ switch (opt) {
+ case 'a':
+ age_name = optarg;
+ break;
+ case 'c':
+ ccc_name = optarg;
+ break;
+ case 'd':
+ data_name = optarg;
+ break;
+ case 'f':
+ fold_name = optarg;
+ break;
+ case 'n':
+ norm_name = optarg;
+ break;
+ case 'o':
+ utf8_name = optarg;
+ break;
+ case 'p':
+ prop_name = optarg;
+ break;
+ case 't':
+ test_name = optarg;
+ break;
+ case 'v':
+ verbose++;
+ break;
+ case 'h':
+ help();
+ exit(0);
+ default:
+ usage();
+ }
+ }
+
+ if (verbose > 1)
+ help();
+ for (unichar = 0; unichar != 0x110000; unichar++)
+ unicode_data[unichar].code = unichar;
+ age_init();
+ ccc_init();
+ nfdi_init();
+ nfdicf_init();
+ ignore_init();
+ corrections_init();
+ hangul_decompose();
+ nfdi_decompose();
+ nfdicf_decompose();
+ utf8_init();
+ trees_init();
+ trees_populate();
+ trees_reduce();
+ trees_verify();
+ /* Prevent "unused function" warning. */
+ (void)lookup(nfdi_tree, " ");
+ if (verbose > 2)
+ tree_walk(nfdi_tree);
+ if (verbose > 2)
+ tree_walk(nfdicf_tree);
+ normalization_test();
+ write_file();
+
+ return 0;
+}
--
2.20.1

2019-03-18 20:28:21

by Gabriel Krisman Bertazi

[permalink] [raw]

Subject: [PATCH RFC v6 05/11] unicode: Implement higher level API for string handling

From: Gabriel Krisman Bertazi <[email protected]>

This patch integrates the utf8n patches with some higher level API to
perform UTF-8 string comparison, normalization and casefolding
operations. Implemented is a variation of NFD, and casefold is
performed by doing full casefold on top of NFD. These algorithms are
based on the core implemented by Olaf Weber from SGI.

Changes since v4:
- integrate in fs/unicode

Changes since RFC v1:
- Change error return code from EIO to EINVAL. (Olaf Weber)
- Fix issues with strncmp/strcmp. (Olaf Weber)
- Remove stack buffer in normalization/casefold. (Olaf Weber)
- Include length parameter for second string on comparison functions.
- Change length type to size_t.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
fs/unicode/Makefile | 4 +-
fs/unicode/utf8-core.c | 183 ++++++++++++++++++++++++++++++++++++++++
fs/unicode/utf8-norm.c | 6 ++
fs/unicode/utf8n.h | 1 +
include/linux/unicode.h | 30 +++++++
5 files changed, 223 insertions(+), 1 deletion(-)
create mode 100644 fs/unicode/utf8-core.c
create mode 100644 include/linux/unicode.h

diff --git a/fs/unicode/Makefile b/fs/unicode/Makefile
index 1ed10e40c30d..9a9836fcf38b 100644
--- a/fs/unicode/Makefile
+++ b/fs/unicode/Makefile
@@ -2,7 +2,9 @@

UNICODE_VERSION=11.0.0

-obj-$(CONFIG_UNICODE) += utf8-norm.o
+obj-$(CONFIG_UNICODE) += unicode.o
+
+unicode-y := utf8-norm.o utf8-core.o

$(obj)/utf8-norm.o: $(obj)/utf8data.h
$(obj)/utf8data.h: $(srctree)/$(src)/ucd/*.txt $(objtree)/scripts/mkutf8data FORCE
diff --git a/fs/unicode/utf8-core.c b/fs/unicode/utf8-core.c
new file mode 100644
index 000000000000..39f4b06dded6
--- /dev/null
+++ b/fs/unicode/utf8-core.c
@@ -0,0 +1,183 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/string.h>
+#include <linux/slab.h>
+#include <linux/parser.h>
+#include <linux/errno.h>
+#include <linux/unicode.h>
+
+#include "utf8n.h"
+
+int utf8_validate(const struct unicode_map *um, const struct qstr *str)
+{
+ const struct utf8data *data = utf8nfdi(um->version);
+
+ if (utf8nlen(data, str->name, str->len) < 0)
+ return -1;
+ return 0;
+}
+EXPORT_SYMBOL(utf8_validate);
+
+int utf8_strncmp(const struct unicode_map *um,
+ const struct qstr *s1, const struct qstr *s2)
+{
+ const struct utf8data *data = utf8nfdi(um->version);
+ struct utf8cursor cur1, cur2;
+ int c1, c2;
+
+ if (utf8ncursor(&cur1, data, s1->name, s1->len) < 0)
+ return -EINVAL;
+
+ if (utf8ncursor(&cur2, data, s2->name, s2->len) < 0)
+ return -EINVAL;
+
+ do {
+ c1 = utf8byte(&cur1);
+ c2 = utf8byte(&cur2);
+
+ if (c1 < 0 || c2 < 0)
+ return -EINVAL;
+ if (c1 != c2)
+ return 1;
+ } while (c1);
+
+ return 0;
+}
+EXPORT_SYMBOL(utf8_strncmp);
+
+int utf8_strncasecmp(const struct unicode_map *um,
+ const struct qstr *s1, const struct qstr *s2)
+{
+ const struct utf8data *data = utf8nfdicf(um->version);
+ struct utf8cursor cur1, cur2;
+ int c1, c2;
+
+ if (utf8ncursor(&cur1, data, s1->name, s1->len) < 0)
+ return -EINVAL;
+
+ if (utf8ncursor(&cur2, data, s2->name, s2->len) < 0)
+ return -EINVAL;
+
+ do {
+ c1 = utf8byte(&cur1);
+ c2 = utf8byte(&cur2);
+
+ if (c1 < 0 || c2 < 0)
+ return -EINVAL;
+ if (c1 != c2)
+ return 1;
+ } while (c1);
+
+ return 0;
+}
+EXPORT_SYMBOL(utf8_strncasecmp);
+
+int utf8_casefold(const struct unicode_map *um, const struct qstr *str,
+ unsigned char *dest, size_t dlen)
+{
+ const struct utf8data *data = utf8nfdicf(um->version);
+ struct utf8cursor cur;
+ size_t nlen = 0;
+
+ if (utf8ncursor(&cur, data, str->name, str->len) < 0)
+ return -EINVAL;
+
+ for (nlen = 0; nlen < dlen; nlen++) {
+ dest[nlen] = utf8byte(&cur);
+ if (!dest[nlen])
+ return nlen;
+ if (dest[nlen] == -1)
+ break;
+ }
+ return -EINVAL;
+}
+
+EXPORT_SYMBOL(utf8_casefold);
+
+int utf8_normalize(const struct unicode_map *um, const struct qstr *str,
+ unsigned char *dest, size_t dlen)
+{
+ const struct utf8data *data = utf8nfdi(um->version);
+ struct utf8cursor cur;
+ ssize_t nlen = 0;
+
+ if (utf8ncursor(&cur, data, str->name, str->len) < 0)
+ return -EINVAL;
+
+ for (nlen = 0; nlen < dlen; nlen++) {
+ dest[nlen] = utf8byte(&cur);
+ if (!dest[nlen])
+ return nlen;
+ if (dest[nlen] == -1)
+ break;
+ }
+ return -EINVAL;
+}
+
+EXPORT_SYMBOL(utf8_normalize);
+
+static int utf8_parse_version(const char *version, unsigned int *maj,
+ unsigned int *min, unsigned int *rev)
+{
+ substring_t args[3];
+ char version_string[12];
+ const struct match_token token[] = {
+ {1, "%d.%d.%d"},
+ {0, NULL}
+ };
+
+ strncpy(version_string, version, sizeof(version_string));
+
+ if (match_token(version_string, token, args) != 1)
+ return -EINVAL;
+
+ if (match_int(&args[0], maj) || match_int(&args[1], min) ||
+ match_int(&args[2], rev))
+ return -EINVAL;
+
+ return 0;
+}
+
+struct unicode_map *utf8_load(const char *version)
+{
+ struct unicode_map *um = NULL;
+ int unicode_version;
+
+ if (version) {
+ unsigned int maj, min, rev;
+
+ if (utf8_parse_version(version, &maj, &min, &rev) < 0)
+ return ERR_PTR(-EINVAL);
+
+ if (!utf8version_is_supported(maj, min, rev))
+ return ERR_PTR(-EINVAL);
+
+ unicode_version = UNICODE_AGE(maj, min, rev);
+ } else {
+ unicode_version = utf8version_latest();
+ printk(KERN_WARNING"UTF-8 version not specified. "
+ "Assuming latest supported version (%d.%d.%d).",
+ (unicode_version >> 16) & 0xff,
+ (unicode_version >> 8) & 0xff,
+ (unicode_version & 0xff));
+ }
+
+ um = kzalloc(sizeof(struct unicode_map), GFP_KERNEL);
+ if (!um)
+ return ERR_PTR(-ENOMEM);
+
+ um->charset = "UTF-8";
+ um->version = unicode_version;
+
+ return um;
+}
+EXPORT_SYMBOL(utf8_load);
+
+void utf8_unload(struct unicode_map *um)
+{
+ kfree(um);
+}
+EXPORT_SYMBOL(utf8_unload);
+
+MODULE_LICENSE("GPL v2");
diff --git a/fs/unicode/utf8-norm.c b/fs/unicode/utf8-norm.c
index 845c0f300370..94e066be3ea6 100644
--- a/fs/unicode/utf8-norm.c
+++ b/fs/unicode/utf8-norm.c
@@ -38,6 +38,12 @@ int utf8version_is_supported(u8 maj, u8 min, u8 rev)
}
EXPORT_SYMBOL(utf8version_is_supported);

+int utf8version_latest()
+{
+ return utf8vers;
+}
+EXPORT_SYMBOL(utf8version_latest);
+
/*
* UTF-8 valid ranges.
*
diff --git a/fs/unicode/utf8n.h b/fs/unicode/utf8n.h
index b63a9091dc39..a120638014c1 100644
--- a/fs/unicode/utf8n.h
+++ b/fs/unicode/utf8n.h
@@ -32,6 +32,7 @@

/* Highest unicode version supported by the data tables. */
extern int utf8version_is_supported(u8 maj, u8 min, u8 rev);
+extern int utf8version_latest(void);

/*
* Look for the correct const struct utf8data for a unicode version.
diff --git a/include/linux/unicode.h b/include/linux/unicode.h
new file mode 100644
index 000000000000..aec2c6d800aa
--- /dev/null
+++ b/include/linux/unicode.h
@@ -0,0 +1,30 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_UNICODE_H
+#define _LINUX_UNICODE_H
+
+#include <linux/init.h>
+#include <linux/dcache.h>
+
+struct unicode_map {
+ const char *charset;
+ int version;
+};
+
+int utf8_validate(const struct unicode_map *um, const struct qstr *str);
+
+int utf8_strncmp(const struct unicode_map *um,
+ const struct qstr *s1, const struct qstr *s2);
+
+int utf8_strncasecmp(const struct unicode_map *um,
+ const struct qstr *s1, const struct qstr *s2);
+
+int utf8_normalize(const struct unicode_map *um, const struct qstr *str,
+ unsigned char *dest, size_t dlen);
+
+int utf8_casefold(const struct unicode_map *um, const struct qstr *str,
+ unsigned char *dest, size_t dlen);
+
+struct unicode_map *utf8_load(const char *version);
+void utf8_unload(struct unicode_map *um);
+
+#endif /* _LINUX_UNICODE_H */
--
2.20.1

2019-03-18 20:28:19

by Gabriel Krisman Bertazi

[permalink] [raw]

Subject: [PATCH RFC v6 04/11] unicode: reduce the size of utf8data[]

From: Olaf Weber <[email protected]>

Remove the Hangul decompositions from the utf8data trie, and do
algorithmic decomposition to calculate them on the fly. To store
the decomposition the caller of utf8lookup()/utf8nlookup() must
provide a 12-byte buffer, which is used to synthesize a leaf with
the decomposition. Trie size is reduced from 245kB to 90kB.

Signed-off-by: Olaf Weber <[email protected]>
Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
[Rebase to mainline]
[Fix checkpatch errors]
[Extract robustness fixes and merge back to original mkutf8data.c
patch]
---
fs/unicode/utf8-norm.c | 191 ++++++++++++++++++++++---
fs/unicode/utf8n.h | 4 +
scripts/mkutf8data.c | 307 +++++++++++++++++++++++++++++++++++------
3 files changed, 443 insertions(+), 59 deletions(-)

diff --git a/fs/unicode/utf8-norm.c b/fs/unicode/utf8-norm.c
index 4ed50f3fda6e..845c0f300370 100644
--- a/fs/unicode/utf8-norm.c
+++ b/fs/unicode/utf8-norm.c
@@ -98,6 +98,38 @@ static inline int utf8clen(const char *s)
return 1 + (c >= 0xC0) + (c >= 0xE0) + (c >= 0xF0);
}

+/*
+ * Decode a 3-byte UTF-8 sequence.
+ */
+static unsigned int
+utf8decode3(const char *str)
+{
+ unsigned int uc;
+
+ uc = *str++ & 0x0F;
+ uc <<= 6;
+ uc |= *str++ & 0x3F;
+ uc <<= 6;
+ uc |= *str++ & 0x3F;
+
+ return uc;
+}
+
+/*
+ * Encode a 3-byte UTF-8 sequence.
+ */
+static int
+utf8encode3(char *str, unsigned int val)
+{
+ str[2] = (val & 0x3F) | 0x80;
+ val >>= 6;
+ str[1] = (val & 0x3F) | 0x80;
+ val >>= 6;
+ str[0] = val | 0xE0;
+
+ return 3;
+}
+
/*
* utf8trie_t
*
@@ -159,7 +191,8 @@ typedef const unsigned char utf8trie_t;
* characters with the Default_Ignorable_Code_Point property.
* These do affect normalization, as they all have CCC 0.
*
- * The decompositions in the trie have been fully expanded.
+ * The decompositions in the trie have been fully expanded, with the
+ * exception of Hangul syllables, which are decomposed algorithmically.
*
* Casefolding, if applicable, is also done using decompositions.
*
@@ -179,6 +212,105 @@ typedef const unsigned char utf8leaf_t;
#define STOPPER (0)
#define DECOMPOSE (255)

+/* Marker for hangul syllable decomposition. */
+#define HANGUL ((char)(255))
+/* Size of the synthesized leaf used for Hangul syllable decomposition. */
+#define UTF8HANGULLEAF (12)
+
+/*
+ * Hangul decomposition (algorithm from Section 3.12 of Unicode 6.3.0)
+ *
+ * AC00;<Hangul Syllable, First>;Lo;0;L;;;;;N;;;;;
+ * D7A3;<Hangul Syllable, Last>;Lo;0;L;;;;;N;;;;;
+ *
+ * SBase = 0xAC00
+ * LBase = 0x1100
+ * VBase = 0x1161
+ * TBase = 0x11A7
+ * LCount = 19
+ * VCount = 21
+ * TCount = 28
+ * NCount = 588 (VCount * TCount)
+ * SCount = 11172 (LCount * NCount)
+ *
+ * Decomposition:
+ * SIndex = s - SBase
+ *
+ * LV (Canonical/Full)
+ * LIndex = SIndex / NCount
+ * VIndex = (Sindex % NCount) / TCount
+ * LPart = LBase + LIndex
+ * VPart = VBase + VIndex
+ *
+ * LVT (Canonical)
+ * LVIndex = (SIndex / TCount) * TCount
+ * TIndex = (Sindex % TCount)
+ * LVPart = SBase + LVIndex
+ * TPart = TBase + TIndex
+ *
+ * LVT (Full)
+ * LIndex = SIndex / NCount
+ * VIndex = (Sindex % NCount) / TCount
+ * TIndex = (Sindex % TCount)
+ * LPart = LBase + LIndex
+ * VPart = VBase + VIndex
+ * if (TIndex == 0) {
+ * d = <LPart, VPart>
+ * } else {
+ * TPart = TBase + TIndex
+ * d = <LPart, TPart, VPart>
+ * }
+ */
+
+/* Constants */
+#define SB (0xAC00)
+#define LB (0x1100)
+#define VB (0x1161)
+#define TB (0x11A7)
+#define LC (19)
+#define VC (21)
+#define TC (28)
+#define NC (VC * TC)
+#define SC (LC * NC)
+
+/* Algorithmic decomposition of hangul syllable. */
+static utf8leaf_t *
+utf8hangul(const char *str, unsigned char *hangul)
+{
+ unsigned int si;
+ unsigned int li;
+ unsigned int vi;
+ unsigned int ti;
+ unsigned char *h;
+
+ /* Calculate the SI, LI, VI, and TI values. */
+ si = utf8decode3(str) - SB;
+ li = si / NC;
+ vi = (si % NC) / TC;
+ ti = si % TC;
+
+ /* Fill in base of leaf. */
+ h = hangul;
+ LEAF_GEN(h) = 2;
+ LEAF_CCC(h) = DECOMPOSE;
+ h += 2;
+
+ /* Add LPart, a 3-byte UTF-8 sequence. */
+ h += utf8encode3((char *)h, li + LB);
+
+ /* Add VPart, a 3-byte UTF-8 sequence. */
+ h += utf8encode3((char *)h, vi + VB);
+
+ /* Add TPart if required, also a 3-byte UTF-8 sequence. */
+ if (ti)
+ h += utf8encode3((char *)h, ti + TB);
+
+ /* Terminate string. */
+ h[0] = '\0';
+
+ return hangul;
+}
+
/*
* Use trie to scan s, touching at most len bytes.
* Returns the leaf if one exists, NULL otherwise.
@@ -187,8 +319,8 @@ typedef const unsigned char utf8leaf_t;
* is well-formed and corresponds to a known unicode code point. The
* shorthand for this will be "is valid UTF-8 unicode".
*/
-static utf8leaf_t *utf8nlookup(const struct utf8data *data, const char *s,
- size_t len)
+static utf8leaf_t *utf8nlookup(const struct utf8data *data,
+ unsigned char *hangul, const char *s, size_t len)
{
utf8trie_t *trie = utf8data + data->offset;
int offlen;
@@ -226,8 +358,7 @@ static utf8leaf_t *utf8nlookup(const struct utf8data *data, const char *s,
trie++;
} else {
/* No right node. */
- node = 0;
- trie = NULL;
+ return NULL;
}
} else {
/* Left leg */
@@ -237,8 +368,7 @@ static utf8leaf_t *utf8nlookup(const struct utf8data *data, const char *s,
trie += offlen + 1;
} else if (*trie & RIGHTPATH) {
/* No left node. */
- node = 0;
- trie = NULL;
+ return NULL;
} else {
/* Left node after this node */
node = (*trie & TRIENODE);
@@ -246,6 +376,14 @@ static utf8leaf_t *utf8nlookup(const struct utf8data *data, const char *s,
}
}
}
+ /*
+ * Hangul decomposition is done algorithmically. These are the
+ * codepoints >= 0xAC00 and <= 0xD7A3. Their UTF-8 encoding is
+ * always 3 bytes long, so s has been advanced twice, and the
+ * start of the sequence is at s-2.
+ */
+ if (LEAF_CCC(trie) == DECOMPOSE && LEAF_STR(trie)[0] == HANGUL)
+ trie = utf8hangul(s - 2, hangul);
return trie;
}

@@ -255,9 +393,10 @@ static utf8leaf_t *utf8nlookup(const struct utf8data *data, const char *s,
*
* Forwards to utf8nlookup().
*/
-static utf8leaf_t *utf8lookup(const struct utf8data *data, const char *s)
+static utf8leaf_t *utf8lookup(const struct utf8data *data,
+ unsigned char *hangul, const char *s)
{
- return utf8nlookup(data, s, (size_t)-1);
+ return utf8nlookup(data, hangul, s, (size_t)-1);
}

/*
@@ -270,11 +409,13 @@ int utf8agemax(const struct utf8data *data, const char *s)
utf8leaf_t *leaf;
int age = 0;
int leaf_age;
+ unsigned char hangul[UTF8HANGULLEAF];

if (!data)
return -1;
+
while (*s) {
- leaf = utf8lookup(data, s);
+ leaf = utf8lookup(data, hangul, s);
if (!leaf)
return -1;

@@ -297,12 +438,13 @@ int utf8agemin(const struct utf8data *data, const char *s)
utf8leaf_t *leaf;
int age;
int leaf_age;
+ unsigned char hangul[UTF8HANGULLEAF];

if (!data)
return -1;
age = data->maxage;
while (*s) {
- leaf = utf8lookup(data, s);
+ leaf = utf8lookup(data, hangul, s);
if (!leaf)
return -1;
leaf_age = utf8agetab[LEAF_GEN(leaf)];
@@ -323,11 +465,13 @@ int utf8nagemax(const struct utf8data *data, const char *s, size_t len)
utf8leaf_t *leaf;
int age = 0;
int leaf_age;
+ unsigned char hangul[UTF8HANGULLEAF];

if (!data)
return -1;
+
while (len && *s) {
- leaf = utf8nlookup(data, s, len);
+ leaf = utf8nlookup(data, hangul, s, len);
if (!leaf)
return -1;
leaf_age = utf8agetab[LEAF_GEN(leaf)];
@@ -349,12 +493,13 @@ int utf8nagemin(const struct utf8data *data, const char *s, size_t len)
utf8leaf_t *leaf;
int leaf_age;
int age;
+ unsigned char hangul[UTF8HANGULLEAF];

if (!data)
return -1;
age = data->maxage;
while (len && *s) {
- leaf = utf8nlookup(data, s, len);
+ leaf = utf8nlookup(data, hangul, s, len);
if (!leaf)
return -1;
leaf_age = utf8agetab[LEAF_GEN(leaf)];
@@ -377,11 +522,12 @@ ssize_t utf8len(const struct utf8data *data, const char *s)
{
utf8leaf_t *leaf;
size_t ret = 0;
+ unsigned char hangul[UTF8HANGULLEAF];

if (!data)
return -1;
while (*s) {
- leaf = utf8lookup(data, s);
+ leaf = utf8lookup(data, hangul, s);
if (!leaf)
return -1;
if (utf8agetab[LEAF_GEN(leaf)] > data->maxage)
@@ -404,11 +550,12 @@ ssize_t utf8nlen(const struct utf8data *data, const char *s, size_t len)
{
utf8leaf_t *leaf;
size_t ret = 0;
+ unsigned char hangul[UTF8HANGULLEAF];

if (!data)
return -1;
while (len && *s) {
- leaf = utf8nlookup(data, s, len);
+ leaf = utf8nlookup(data, hangul, s, len);
if (!leaf)
return -1;
if (utf8agetab[LEAF_GEN(leaf)] > data->maxage)
@@ -531,10 +678,12 @@ int utf8byte(struct utf8cursor *u8c)
}

/* Look up the data for the current character. */
- if (u8c->p)
- leaf = utf8lookup(u8c->data, u8c->s);
- else
- leaf = utf8nlookup(u8c->data, u8c->s, u8c->len);
+ if (u8c->p) {
+ leaf = utf8lookup(u8c->data, u8c->hangul, u8c->s);
+ } else {
+ leaf = utf8nlookup(u8c->data, u8c->hangul,
+ u8c->s, u8c->len);
+ }

/* No leaf found implies that the input is a binary blob. */
if (!leaf)
@@ -555,7 +704,9 @@ int utf8byte(struct utf8cursor *u8c)
ccc = STOPPER;
goto ccc_mismatch;
}
- leaf = utf8lookup(u8c->data, u8c->s);
+
+ leaf = utf8lookup(u8c->data, u8c->hangul, u8c->s);
+ ccc = LEAF_CCC(leaf);
}

/*
diff --git a/fs/unicode/utf8n.h b/fs/unicode/utf8n.h
index 696e52124296..b63a9091dc39 100644
--- a/fs/unicode/utf8n.h
+++ b/fs/unicode/utf8n.h
@@ -76,6 +76,9 @@ extern int utf8nagemin(const struct utf8data *data, const char *s, size_t len);
extern ssize_t utf8len(const struct utf8data *data, const char *s);
extern ssize_t utf8nlen(const struct utf8data *data, const char *s, size_t len);

+/* Needed in struct utf8cursor below. */
+#define UTF8HANGULLEAF (12)
+
/*
* Cursor structure used by the normalizer.
*/
@@ -89,6 +92,7 @@ struct utf8cursor {
unsigned int slen;
short int ccc;
short int nccc;
+ unsigned char hangul[UTF8HANGULLEAF];
};

/*
diff --git a/scripts/mkutf8data.c b/scripts/mkutf8data.c
index 4df1a799f73c..12ce94b43be6 100644
--- a/scripts/mkutf8data.c
+++ b/scripts/mkutf8data.c
@@ -182,10 +182,14 @@ typedef unsigned char utf8leaf_t;
#define MAXCCC (254)
#define STOPPER (0)
#define DECOMPOSE (255)
+#define HANGUL ((char)(255))
+
+#define UTF8HANGULLEAF (12)

struct tree;
-static utf8leaf_t *utf8nlookup(struct tree *, const char *, size_t);
-static utf8leaf_t *utf8lookup(struct tree *, const char *);
+static utf8leaf_t *utf8nlookup(struct tree *, unsigned char *,
+ const char *, size_t);
+static utf8leaf_t *utf8lookup(struct tree *, unsigned char *, const char *);

unsigned char *utf8data;
size_t utf8data_size;
@@ -333,6 +337,8 @@ static int utf32valid(unsigned int unichar)
return unichar < 0x110000;
}

+#define HANGUL_SYLLABLE(U) ((U) >= 0xAC00 && (U) <= 0xD7A3)
+
#define NODE 1
#define LEAF 0

@@ -463,7 +469,7 @@ static void tree_walk(struct tree *tree)
indent+1);
leaves += 1;
} else if (node->right) {
- assert(node->rightnode==NODE);
+ assert(node->rightnode == NODE);
indent += 1;
node = node->right;
break;
@@ -857,7 +863,7 @@ static void mark_nodes(struct tree *tree)
}
}
} else if (node->right) {
- assert(node->rightnode==NODE);
+ assert(node->rightnode == NODE);
node = node->right;
continue;
}
@@ -909,7 +915,7 @@ static void mark_nodes(struct tree *tree)
}
}
} else if (node->right) {
- assert(node->rightnode==NODE);
+ assert(node->rightnode == NODE);
node = node->right;
if (!node->mark && node->parent->mark &&
!node->parent->left) {
@@ -992,7 +998,7 @@ static int index_nodes(struct tree *tree, int index)
index += tree->leaf_size(node->right);
count++;
} else if (node->right) {
- assert(node->rightnode==NODE);
+ assert(node->rightnode == NODE);
indent += 1;
node = node->right;
break;
@@ -1013,6 +1019,25 @@ static int index_nodes(struct tree *tree, int index)
return index;
}

+/*
+ * Mark the nodes in a subtree, helper for size_nodes().
+ */
+static int mark_subtree(struct node *node)
+{
+ int changed;
+
+ if (!node || node->mark)
+ return 0;
+ node->mark = 1;
+ node->index = node->parent->index;
+ changed = 1;
+ if (node->leftnode == NODE)
+ changed += mark_subtree(node->left);
+ if (node->rightnode == NODE)
+ changed += mark_subtree(node->right);
+ return changed;
+}
+
/*
* Compute the size of nodes and leaves. We start by assuming that
* each node needs to store a three-byte offset. The indexes of the
@@ -1031,6 +1056,7 @@ static int size_nodes(struct tree *tree)
unsigned int bitmask;
unsigned int pathbits;
unsigned int pathmask;
+ unsigned int nbit;
int changed;
int offset;
int size;
@@ -1058,22 +1084,40 @@ static int size_nodes(struct tree *tree)
size = 1;
} else {
if (node->rightnode == NODE) {
+ /*
+ * If the right node is not marked,
+ * look for a corresponding node in
+ * the next tree. Such a node need
+ * not exist.
+ */
right = node->right;
next = tree->next;
while (!right->mark) {
assert(next);
n = next->root;
while (n->bitnum != node->bitnum) {
- if (pathbits & (1<<n->bitnum))
+ nbit = 1 << n->bitnum;
+ if (!(pathmask & nbit))
+ break;
+ if (pathbits & nbit) {
+ if (n->rightnode == LEAF)
+ break;
n = n->right;
- else
+ } else {
+ if (n->leftnode == LEAF)
+ break;
n = n->left;
+ }
}
+ if (n->bitnum != node->bitnum)
+ break;
n = n->right;
- assert(right->bitnum == n->bitnum);
right = n;
next = next->next;
}
+ /* Make sure the right node is marked. */
+ if (!right->mark)
+ changed += mark_subtree(right);
offset = right->index - node->index;
} else {
offset = *tree->leaf_index(tree, node->right);
@@ -1115,7 +1159,7 @@ static int size_nodes(struct tree *tree)
if (node->rightnode == LEAF) {
assert(node->right);
} else if (node->right) {
- assert(node->rightnode==NODE);
+ assert(node->rightnode == NODE);
indent += 1;
node = node->right;
break;
@@ -1148,8 +1192,15 @@ static void emit(struct tree *tree, unsigned char *data)
int offset;
int index;
int indent;
+ int size;
+ int bytes;
+ int leaves;
+ int nodes[4];
unsigned char byte;

+ nodes[0] = nodes[1] = nodes[2] = nodes[3] = 0;
+ leaves = 0;
+ bytes = 0;
index = tree->index;
data += index;
indent = 1;
@@ -1158,7 +1209,10 @@ static void emit(struct tree *tree, unsigned char *data)
if (tree->childnode == LEAF) {
assert(tree->root);
tree->leaf_emit(tree->root, data);
- return;
+ size = tree->leaf_size(tree->root);
+ index += size;
+ leaves++;
+ goto done;
}

assert(tree->childnode == NODE);
@@ -1185,6 +1239,7 @@ static void emit(struct tree *tree, unsigned char *data)
offlen = 2;
else
offlen = 3;
+ nodes[offlen]++;
offset = node->offset;
byte |= offlen << OFFLEN_SHIFT;
*data++ = byte;
@@ -1197,12 +1252,14 @@ static void emit(struct tree *tree, unsigned char *data)
} else if (node->left) {
if (node->leftnode == NODE)
byte |= TRIENODE;
+ nodes[0]++;
*data++ = byte;
index++;
} else if (node->right) {
byte |= RIGHTNODE;
if (node->rightnode == NODE)
byte |= TRIENODE;
+ nodes[0]++;
*data++ = byte;
index++;
} else {
@@ -1217,7 +1274,10 @@ static void emit(struct tree *tree, unsigned char *data)
assert(node->left);
data = tree->leaf_emit(node->left,
data);
- index += tree->leaf_size(node->left);
+ size = tree->leaf_size(node->left);
+ index += size;
+ bytes += size;
+ leaves++;
} else if (node->left) {
assert(node->leftnode == NODE);
indent += 1;
@@ -1231,9 +1291,12 @@ static void emit(struct tree *tree, unsigned char *data)
assert(node->right);
data = tree->leaf_emit(node->right,
data);
- index += tree->leaf_size(node->right);
+ size = tree->leaf_size(node->right);
+ index += size;
+ bytes += size;
+ leaves++;
} else if (node->right) {
- assert(node->rightnode==NODE);
+ assert(node->rightnode == NODE);
indent += 1;
node = node->right;
break;
@@ -1245,6 +1308,15 @@ static void emit(struct tree *tree, unsigned char *data)
indent -= 1;
}
}
+done:
+ if (verbose > 0) {
+ printf("Emitted %d (%d) leaves",
+ leaves, bytes);
+ printf(" %d (%d+%d+%d+%d) nodes",
+ nodes[0] + nodes[1] + nodes[2] + nodes[3],
+ nodes[0], nodes[1], nodes[2], nodes[3]);
+ printf(" %d total\n", index - tree->index);
+ }
}

/* ------------------------------------------------------------------ */
@@ -1346,8 +1418,12 @@ static void nfdi_print(void *l, int indent)

printf("%*sleaf @ %p code %X ccc %d gen %d", indent, "", leaf,
leaf->code, leaf->ccc, leaf->gen);
- if (leaf->utf8nfdi)
+
+ if (leaf->utf8nfdi && leaf->utf8nfdi[0] == HANGUL)
+ printf(" nfdi \"%s\"", "HANGUL SYLLABLE");
+ else if (leaf->utf8nfdi)
printf(" nfdi \"%s\"", (const char*)leaf->utf8nfdi);
+
printf("\n");
}

@@ -1357,8 +1433,11 @@ static void nfdicf_print(void *l, int indent)

printf("%*sleaf @ %p code %X ccc %d gen %d", indent, "", leaf,
leaf->code, leaf->ccc, leaf->gen);
+
if (leaf->utf8nfdicf)
printf(" nfdicf \"%s\"", (const char*)leaf->utf8nfdicf);
+ else if (leaf->utf8nfdi && leaf->utf8nfdi[0] == HANGUL)
+ printf(" nfdi \"%s\"", "HANGUL SYLLABLE");
else if (leaf->utf8nfdi)
printf(" nfdi \"%s\"", (const char*)leaf->utf8nfdi);
printf("\n");
@@ -1388,9 +1467,11 @@ static int correction_mark(void *l)
static int nfdi_size(void *l)
{
struct unicode_data *leaf = l;
-
int size = 2;
- if (leaf->utf8nfdi)
+
+ if (HANGUL_SYLLABLE(leaf->code))
+ size += 1;
+ else if (leaf->utf8nfdi)
size += strlen(leaf->utf8nfdi) + 1;
return size;
}
@@ -1398,9 +1479,11 @@ static int nfdi_size(void *l)
static int nfdicf_size(void *l)
{
struct unicode_data *leaf = l;
-
int size = 2;
- if (leaf->utf8nfdicf)
+
+ if (HANGUL_SYLLABLE(leaf->code))
+ size += 1;
+ else if (leaf->utf8nfdicf)
size += strlen(leaf->utf8nfdicf) + 1;
else if (leaf->utf8nfdi)
size += strlen(leaf->utf8nfdi) + 1;
@@ -1427,7 +1510,11 @@ static unsigned char *nfdi_emit(void *l, unsigned char *data)
unsigned char *s;

*data++ = leaf->gen;
- if (leaf->utf8nfdi) {
+
+ if (HANGUL_SYLLABLE(leaf->code)) {
+ *data++ = DECOMPOSE;
+ *data++ = HANGUL;
+ } else if (leaf->utf8nfdi) {
*data++ = DECOMPOSE;
s = (unsigned char*)leaf->utf8nfdi;
while ((*data++ = *s++) != 0)
@@ -1444,7 +1531,11 @@ static unsigned char *nfdicf_emit(void *l, unsigned char *data)
unsigned char *s;

*data++ = leaf->gen;
- if (leaf->utf8nfdicf) {
+
+ if (HANGUL_SYLLABLE(leaf->code)) {
+ *data++ = DECOMPOSE;
+ *data++ = HANGUL;
+ } else if (leaf->utf8nfdicf) {
*data++ = DECOMPOSE;
s = (unsigned char*)leaf->utf8nfdicf;
while ((*data++ = *s++) != 0)
@@ -1467,6 +1558,11 @@ static void utf8_create(struct unicode_data *data)
unsigned int *um;
int i;

+ if (data->utf8nfdi) {
+ assert(data->utf8nfdi[0] == HANGUL);
+ return;
+ }
+
u = utf;
um = data->utf32nfdi;
if (um) {
@@ -1652,6 +1748,7 @@ static void verify(struct tree *tree)
utf8leaf_t *leaf;
unsigned int unichar;
char key[4];
+ unsigned char hangul[UTF8HANGULLEAF];
int report;
int nocf;

@@ -1665,7 +1762,8 @@ static void verify(struct tree *tree)
if (data->correction <= tree->maxage)
data = &unicode_data[unichar];
utf8encode(key,unichar);
- leaf = utf8lookup(tree, key);
+ leaf = utf8lookup(tree, hangul, key);
+
if (!leaf) {
if (data->gen != -1)
report++;
@@ -1679,7 +1777,10 @@ static void verify(struct tree *tree)
if (data->gen != LEAF_GEN(leaf))
report++;
if (LEAF_CCC(leaf) == DECOMPOSE) {
- if (nocf) {
+ if (HANGUL_SYLLABLE(data->code)) {
+ if (data->utf8nfdi[0] != HANGUL)
+ report++;
+ } else if (nocf) {
if (!data->utf8nfdi) {
report++;
} else if (strcmp(data->utf8nfdi,
@@ -2323,8 +2424,7 @@ static void corrections_init(void)
*
*/

-static void
-hangul_decompose(void)
+static void hangul_decompose(void)
{
unsigned int sb = 0xAC00;
unsigned int lb = 0x1100;
@@ -2368,6 +2468,15 @@ hangul_decompose(void)
memcpy(um, mapping, i * sizeof(unsigned int));
unicode_data[unichar].utf32nfdicf = um;

+ /*
+ * Add a cookie as a reminder that the hangul syllable
+ * decompositions must not be stored in the generated
+ * trie.
+ */
+ unicode_data[unichar].utf8nfdi = malloc(2);
+ unicode_data[unichar].utf8nfdi[0] = HANGUL;
+ unicode_data[unichar].utf8nfdi[1] = '\0';
+
if (verbose > 1)
print_utf32nfdi(unichar);

@@ -2493,6 +2602,99 @@ int utf8cursor(struct utf8cursor *, struct tree *, const char *);
int utf8ncursor(struct utf8cursor *, struct tree *, const char *, size_t);
int utf8byte(struct utf8cursor *);

+/*
+ * Hangul decomposition (algorithm from Section 3.12 of Unicode 6.3.0)
+ *
+ * AC00;<Hangul Syllable, First>;Lo;0;L;;;;;N;;;;;
+ * D7A3;<Hangul Syllable, Last>;Lo;0;L;;;;;N;;;;;
+ *
+ * SBase = 0xAC00
+ * LBase = 0x1100
+ * VBase = 0x1161
+ * TBase = 0x11A7
+ * LCount = 19
+ * VCount = 21
+ * TCount = 28
+ * NCount = 588 (VCount * TCount)
+ * SCount = 11172 (LCount * NCount)
+ *
+ * Decomposition:
+ * SIndex = s - SBase
+ *
+ * LV (Canonical/Full)
+ * LIndex = SIndex / NCount
+ * VIndex = (Sindex % NCount) / TCount
+ * LPart = LBase + LIndex
+ * VPart = VBase + VIndex
+ *
+ * LVT (Canonical)
+ * LVIndex = (SIndex / TCount) * TCount
+ * TIndex = (Sindex % TCount)
+ * LVPart = SBase + LVIndex
+ * TPart = TBase + TIndex
+ *
+ * LVT (Full)
+ * LIndex = SIndex / NCount
+ * VIndex = (Sindex % NCount) / TCount
+ * TIndex = (Sindex % TCount)
+ * LPart = LBase + LIndex
+ * VPart = VBase + VIndex
+ * if (TIndex == 0) {
+ * d = <LPart, VPart>
+ * } else {
+ * TPart = TBase + TIndex
+ * d = <LPart, VPart, TPart>
+ * }
+ */
+
+/* Constants */
+#define SB (0xAC00)
+#define LB (0x1100)
+#define VB (0x1161)
+#define TB (0x11A7)
+#define LC (19)
+#define VC (21)
+#define TC (28)
+#define NC (VC * TC)
+#define SC (LC * NC)
+
+/* Algorithmic decomposition of hangul syllable. */
+static utf8leaf_t *utf8hangul(const char *str, unsigned char *hangul)
+{
+ unsigned int si;
+ unsigned int li;
+ unsigned int vi;
+ unsigned int ti;
+ unsigned char *h;
+
+ /* Calculate the SI, LI, VI, and TI values. */
+ si = utf8decode(str) - SB;
+ li = si / NC;
+ vi = (si % NC) / TC;
+ ti = si % TC;
+
+ /* Fill in base of leaf. */
+ h = hangul;
+ LEAF_GEN(h) = 2;
+ LEAF_CCC(h) = DECOMPOSE;
+ h += 2;
+
+ /* Add LPart, a 3-byte UTF-8 sequence. */
+ h += utf8encode((char *)h, li + LB);
+
+ /* Add VPart, a 3-byte UTF-8 sequence. */
+ h += utf8encode((char *)h, vi + VB);
+
+ /* Add TPart if required, also a 3-byte UTF-8 sequence. */
+ if (ti)
+ h += utf8encode((char *)h, ti + TB);
+
+ /* Terminate string. */
+ h[0] = '\0';
+
+ return hangul;
+}
+
/*
* Use trie to scan s, touching at most len bytes.
* Returns the leaf if one exists, NULL otherwise.
@@ -2501,7 +2703,8 @@ int utf8byte(struct utf8cursor *);
* is well-formed and corresponds to a known unicode code point. The
* shorthand for this will be "is valid UTF-8 unicode".
*/
-static utf8leaf_t *utf8nlookup(struct tree *tree, const char *s, size_t len)
+static utf8leaf_t *utf8nlookup(struct tree *tree, unsigned char *hangul,
+ const char *s, size_t len)
{
utf8trie_t *trie = utf8data + tree->index;
int offlen;
@@ -2557,6 +2760,14 @@ static utf8leaf_t *utf8nlookup(struct tree *tree, const char *s, size_t len)
}
}
}
+ /*
+ * Hangul decomposition is done algorithmically. These are the
+ * codepoints >= 0xAC00 and <= 0xD7A3. Their UTF-8 encoding is
+ * always 3 bytes long, so s has been advanced twice, and the
+ * start of the sequence is at s-2.
+ */
+ if (LEAF_CCC(trie) == DECOMPOSE && LEAF_STR(trie)[0] == HANGUL)
+ trie = utf8hangul(s - 2, hangul);
return trie;
}

@@ -2566,9 +2777,10 @@ static utf8leaf_t *utf8nlookup(struct tree *tree, const char *s, size_t len)
*
* Forwards to trie_nlookup().
*/
-static utf8leaf_t *utf8lookup(struct tree *tree, const char *s)
+static utf8leaf_t *utf8lookup(struct tree *tree, unsigned char *hangul,
+ const char *s)
{
- return utf8nlookup(tree, s, (size_t)-1);
+ return utf8nlookup(tree, hangul, s, (size_t)-1);
}

/*
@@ -2592,11 +2804,14 @@ int utf8agemax(struct tree *tree, const char *s)
utf8leaf_t *leaf;
int age = 0;
int leaf_age;
+ unsigned char hangul[UTF8HANGULLEAF];

if (!tree)
return -1;
+
while (*s) {
- if (!(leaf = utf8lookup(tree, s)))
+ leaf = utf8lookup(tree, hangul, s);
+ if (!leaf)
return -1;
leaf_age = ages[LEAF_GEN(leaf)];
if (leaf_age <= tree->maxage && leaf_age > age)
@@ -2616,12 +2831,14 @@ int utf8agemin(struct tree *tree, const char *s)
utf8leaf_t *leaf;
int age;
int leaf_age;
+ unsigned char hangul[UTF8HANGULLEAF];

if (!tree)
return -1;
age = tree->maxage;
while (*s) {
- if (!(leaf = utf8lookup(tree, s)))
+ leaf = utf8lookup(tree, hangul, s);
+ if (!leaf)
return -1;
leaf_age = ages[LEAF_GEN(leaf)];
if (leaf_age <= tree->maxage && leaf_age < age)
@@ -2640,11 +2857,14 @@ int utf8nagemax(struct tree *tree, const char *s, size_t len)
utf8leaf_t *leaf;
int age = 0;
int leaf_age;
+ unsigned char hangul[UTF8HANGULLEAF];

if (!tree)
return -1;
+
while (len && *s) {
- if (!(leaf = utf8nlookup(tree, s, len)))
+ leaf = utf8nlookup(tree, hangul, s, len);
+ if (!leaf)
return -1;
leaf_age = ages[LEAF_GEN(leaf)];
if (leaf_age <= tree->maxage && leaf_age > age)
@@ -2664,12 +2884,14 @@ int utf8nagemin(struct tree *tree, const char *s, size_t len)
utf8leaf_t *leaf;
int leaf_age;
int age;
+ unsigned char hangul[UTF8HANGULLEAF];

if (!tree)
return -1;
age = tree->maxage;
while (len && *s) {
- if (!(leaf = utf8nlookup(tree, s, len)))
+ leaf = utf8nlookup(tree, hangul, s, len);
+ if (!leaf)
return -1;
leaf_age = ages[LEAF_GEN(leaf)];
if (leaf_age <= tree->maxage && leaf_age < age)
@@ -2690,11 +2912,13 @@ ssize_t utf8len(struct tree *tree, const char *s)
{
utf8leaf_t *leaf;
size_t ret = 0;
+ unsigned char hangul[UTF8HANGULLEAF];

if (!tree)
return -1;
while (*s) {
- if (!(leaf = utf8lookup(tree, s)))
+ leaf = utf8lookup(tree, hangul, s);
+ if (!leaf)
return -1;
if (ages[LEAF_GEN(leaf)] > tree->maxage)
ret += utf8clen(s);
@@ -2715,11 +2939,13 @@ ssize_t utf8nlen(struct tree *tree, const char *s, size_t len)
{
utf8leaf_t *leaf;
size_t ret = 0;
+ unsigned char hangul[UTF8HANGULLEAF];

if (!tree)
return -1;
while (len && *s) {
- if (!(leaf = utf8nlookup(tree, s, len)))
+ leaf = utf8nlookup(tree, hangul, s, len);
+ if (!leaf)
return -1;
if (ages[LEAF_GEN(leaf)] > tree->maxage)
ret += utf8clen(s);
@@ -2747,6 +2973,7 @@ struct utf8cursor {
short int ccc;
short int nccc;
unsigned int unichar;
+ unsigned char hangul[UTF8HANGULLEAF];
};

/*
@@ -2854,10 +3081,12 @@ int utf8byte(struct utf8cursor *u8c)
}

/* Look up the data for the current character. */
- if (u8c->p)
- leaf = utf8lookup(u8c->tree, u8c->s);
- else
- leaf = utf8nlookup(u8c->tree, u8c->s, u8c->len);
+ if (u8c->p) {
+ leaf = utf8lookup(u8c->tree, u8c->hangul, u8c->s);
+ } else {
+ leaf = utf8nlookup(u8c->tree, u8c->hangul,
+ u8c->s, u8c->len);
+ }

/* No leaf found implies that the input is a binary blob. */
if (!leaf)
@@ -2877,7 +3106,7 @@ int utf8byte(struct utf8cursor *u8c)
ccc = STOPPER;
goto ccc_mismatch;
}
- leaf = utf8lookup(u8c->tree, u8c->s);
+ leaf = utf8lookup(u8c->tree, u8c->hangul, u8c->s);
ccc = LEAF_CCC(leaf);
}
u8c->unichar = utf8decode(u8c->s);
--
2.20.1

2019-03-18 20:28:25

by Gabriel Krisman Bertazi

[permalink] [raw]

Subject: [PATCH RFC v6 06/11] unicode: Introduce test module for normalized utf8 implementation

From: Gabriel Krisman Bertazi <[email protected]>

This implements a in-kernel sanity test module for the utf8
normalization core. At probe time, it will run basic sequences through
the utf8n core, to identify problems will equivalent sequences and
normalization/casefold code. This is supposed to be useful for
regression testing when adding support for a new version of utf8 to
linux.

Changes since v4:
- integrate with fs/unicode

Changes since RFC v1:
- Include comparison tests for matching strings with different lengths.
- Include tests for characters included in unicode 8.0.0, 9.0.0 and 10.0.0.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
fs/unicode/Kconfig | 5 +
fs/unicode/Makefile | 1 +
fs/unicode/utf8-selftest.c | 320 +++++++++++++++++++++++++++++++++++++
3 files changed, 326 insertions(+)
create mode 100644 fs/unicode/utf8-selftest.c

diff --git a/fs/unicode/Kconfig b/fs/unicode/Kconfig
index f41520f57dff..b560a879edf7 100644
--- a/fs/unicode/Kconfig
+++ b/fs/unicode/Kconfig
@@ -6,3 +6,8 @@ config UNICODE
help
Say Y here to enable UTF-8 NFD normalization and NFD+CF casefolding
support.
+
+config UNICODE_NORMALIZATION_SELFTEST
+ tristate "Test UTF-8 normalization support"
+ depends on UNICODE
+ default n
diff --git a/fs/unicode/Makefile b/fs/unicode/Makefile
index 9a9836fcf38b..b1560c4107bf 100644
--- a/fs/unicode/Makefile
+++ b/fs/unicode/Makefile
@@ -3,6 +3,7 @@
UNICODE_VERSION=11.0.0

obj-$(CONFIG_UNICODE) += unicode.o
+obj-$(CONFIG_UNICODE_NORMALIZATION_SELFTEST) += utf8-selftest.o

unicode-y := utf8-norm.o utf8-core.o

diff --git a/fs/unicode/utf8-selftest.c b/fs/unicode/utf8-selftest.c
new file mode 100644
index 000000000000..492d934d5c1c
--- /dev/null
+++ b/fs/unicode/utf8-selftest.c
@@ -0,0 +1,320 @@
+/*
+ * Kernel module for testing utf-8 support.
+ *
+ * Copyright 2017 Collabora Ltd.
+ *
+ * This software is licensed under the terms of the GNU General Public
+ * License version 2, as published by the Free Software Foundation, and
+ * may be copied, distributed, and modified under those terms.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/module.h>
+#include <linux/printk.h>
+#include <linux/unicode.h>
+#include <linux/dcache.h>
+
+#include "utf8n.h"
+
+unsigned int failed_tests;
+unsigned int total_tests;
+
+/* Tests will be based on this version. */
+#define latest_maj 11
+#define latest_min 0
+#define latest_rev 0
+
+#define _test(cond, func, line, fmt, ...) do { \
+ total_tests++; \
+ if (!cond) { \
+ failed_tests++; \
+ pr_err("test %s:%d Failed: %s%s", \
+ func, line, #cond, (fmt?":":".")); \
+ if (fmt) \
+ pr_err(fmt, ##__VA_ARGS__); \
+ } \
+ } while (0)
+#define test_f(cond, fmt, ...) _test(cond, __func__, __LINE__, fmt, ##__VA_ARGS__)
+#define test(cond) _test(cond, __func__, __LINE__, "")
+
+const static struct {
+ /* UTF-8 strings in this vector _must_ be NULL-terminated. */
+ unsigned char str[10];
+ unsigned char dec[10];
+} nfdi_test_data[] = {
+ /* Trivial sequence */
+ {
+ /* "ABba" decomposes to itself */
+ .str = "aBba",
+ .dec = "aBba",
+ },
+ /* Simple equivalent sequences */
+ {
+ /* 'VULGAR FRACTION ONE QUARTER' cannot decompose to
+ 'NUMBER 1' + 'FRACTION SLASH' + 'NUMBER 4' on
+ canonical decomposition */
+ .str = {0xc2, 0xbc, 0x00},
+ .dec = {0xc2, 0xbc, 0x00},
+ },
+ {
+ /* 'LATIN SMALL LETTER A WITH DIAERESIS' decomposes to
+ 'LETTER A' + 'COMBINING DIAERESIS' */
+ .str = {0xc3, 0xa4, 0x00},
+ .dec = {0x61, 0xcc, 0x88, 0x00},
+ },
+ {
+ /* 'LATIN SMALL LETTER LJ' can't decompose to
+ 'LETTER L' + 'LETTER J' on canonical decomposition */
+ .str = {0xC7, 0x89, 0x00},
+ .dec = {0xC7, 0x89, 0x00},
+ },
+ {
+ /* GREEK ANO TELEIA decomposes to MIDDLE DOT */
+ .str = {0xCE, 0x87, 0x00},
+ .dec = {0xC2, 0xB7, 0x00}
+ },
+ /* Canonical ordering */
+ {
+ /* A + 'COMBINING ACUTE ACCENT' + 'COMBINING OGONEK' decomposes
+ to A + 'COMBINING OGONEK' + 'COMBINING ACUTE ACCENT' */
+ .str = {0x41, 0xcc, 0x81, 0xcc, 0xa8, 0x0},
+ .dec = {0x41, 0xcc, 0xa8, 0xcc, 0x81, 0x0},
+ },
+ {
+ /* 'LATIN SMALL LETTER A WITH DIAERESIS' + 'COMBINING OGONEK'
+ decomposes to
+ 'LETTER A' + 'COMBINING OGONEK' + 'COMBINING DIAERESIS' */
+ .str = {0xc3, 0xa4, 0xCC, 0xA8, 0x00},
+
+ .dec = {0x61, 0xCC, 0xA8, 0xcc, 0x88, 0x00},
+ },
+
+};
+
+const static struct {
+ /* UTF-8 strings in this vector _must_ be NULL-terminated. */
+ unsigned char str[30];
+ unsigned char ncf[30];
+} nfdicf_test_data[] = {
+ /* Trivial sequences */
+ {
+ /* "ABba" folds to lowercase */
+ .str = {0x41, 0x42, 0x62, 0x61, 0x00},
+ .ncf = {0x61, 0x62, 0x62, 0x61, 0x00},
+ },
+ {
+ /* All ASCII folds to lower-case */
+ .str = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0.1",
+ .ncf = "abcdefghijklmnopqrstuvwxyz0.1",
+ },
+ {
+ /* LATIN SMALL LETTER SHARP S folds to
+ LATIN SMALL LETTER S + LATIN SMALL LETTER S */
+ .str = {0xc3, 0x9f, 0x00},
+ .ncf = {0x73, 0x73, 0x00},
+ },
+ {
+ /* LATIN CAPITAL LETTER A WITH RING ABOVE folds to
+ LATIN SMALL LETTER A + COMBINING RING ABOVE */
+ .str = {0xC3, 0x85, 0x00},
+ .ncf = {0x61, 0xcc, 0x8a, 0x00},
+ },
+ /* Introduced by UTF-8.0.0. */
+ /* Cherokee letters are interesting test-cases because they fold
+ to upper-case. Before 8.0.0, Cherokee lowercase were
+ undefined, thus, the folding from LC is not stable between
+ 7.0.0 -> 8.0.0, but it is from UC. */
+ {
+ /* CHEROKEE SMALL LETTER A folds to CHEROKEE LETTER A */
+ .str = {0xea, 0xad, 0xb0, 0x00},
+ .ncf = {0xe1, 0x8e, 0xa0, 0x00},
+ },
+ {
+ /* CHEROKEE SMALL LETTER YE folds to CHEROKEE LETTER YE */
+ .str = {0xe1, 0x8f, 0xb8, 0x00},
+ .ncf = {0xe1, 0x8f, 0xb0, 0x00},
+ },
+ {
+ /* OLD HUNGARIAN CAPITAL LETTER AMB folds to
+ OLD HUNGARIAN SMALL LETTER AMB */
+ .str = {0xf0, 0x90, 0xb2, 0x83, 0x00},
+ .ncf = {0xf0, 0x90, 0xb3, 0x83, 0x00},
+ },
+ /* Introduced by UTF-9.0.0. */
+ {
+ /* OSAGE CAPITAL LETTER CHA folds to
+ OSAGE SMALL LETTER CHA */
+ .str = {0xf0, 0x90, 0x92, 0xb5, 0x00},
+ .ncf = {0xf0, 0x90, 0x93, 0x9d, 0x00},
+ },
+ {
+ /* LATIN CAPITAL LETTER SMALL CAPITAL I folds to
+ LATIN LETTER SMALL CAPITAL I */
+ .str = {0xea, 0x9e, 0xae, 0x00},
+ .ncf = {0xc9, 0xaa, 0x00},
+ },
+ /* Introduced by UTF-11.0.0. */
+ {
+ /* GEORGIAN SMALL LETTER AN folds to GEORGIAN MTAVRULI
+ CAPITAL LETTER AN */
+ .str = {0xe1, 0xb2, 0x90, 0x00},
+ .ncf = {0xe1, 0x83, 0x90, 0x00},
+ }
+};
+
+static void check_utf8_nfdi(void)
+{
+ int i;
+ struct utf8cursor u8c;
+ const struct utf8data *data;
+
+ data = utf8nfdi(UNICODE_AGE(latest_maj, latest_min, latest_rev));
+ if (!data) {
+ pr_err("%s: Unable to load utf8-%d.%d.%d. Skipping.\n",
+ __func__, latest_maj, latest_min, latest_rev);
+ return;
+ }
+
+ for (i = 0; i < ARRAY_SIZE(nfdi_test_data); i++) {
+ int len = strlen(nfdi_test_data[i].str);
+ int nlen = strlen(nfdi_test_data[i].dec);
+ int j = 0;
+ unsigned char c;
+
+ test((utf8len(data, nfdi_test_data[i].str) == nlen));
+ test((utf8nlen(data, nfdi_test_data[i].str, len) == nlen));
+
+ if (utf8cursor(&u8c, data, nfdi_test_data[i].str) < 0)
+ pr_err("can't create cursor\n");
+
+ while ((c = utf8byte(&u8c)) > 0) {
+ test_f((c == nfdi_test_data[i].dec[j]),
+ "Unexpected byte 0x%x should be 0x%x\n",
+ c, nfdi_test_data[i].dec[j]);
+ j++;
+ }
+
+ test((j == nlen));
+ }
+}
+
+static void check_utf8_nfdicf(void)
+{
+ int i;
+ struct utf8cursor u8c;
+ const struct utf8data *data;
+
+ data = utf8nfdicf(UNICODE_AGE(latest_maj, latest_min, latest_rev));
+ if (!data) {
+ pr_err("%s: Unable to load utf8-%d.%d.%d. Skipping.\n",
+ __func__, latest_maj, latest_min, latest_rev);
+ return;
+ }
+
+ for (i = 0; i < ARRAY_SIZE(nfdicf_test_data); i++) {
+ int len = strlen(nfdicf_test_data[i].str);
+ int nlen = strlen(nfdicf_test_data[i].ncf);
+ int j = 0;
+ unsigned char c;
+
+ test((utf8len(data, nfdicf_test_data[i].str) == nlen));
+ test((utf8nlen(data, nfdicf_test_data[i].str, len) == nlen));
+
+ if (utf8cursor(&u8c, data, nfdicf_test_data[i].str) < 0)
+ pr_err("can't create cursor\n");
+
+ while ((c = utf8byte(&u8c)) > 0) {
+ test_f((c == nfdicf_test_data[i].ncf[j]),
+ "Unexpected byte 0x%x should be 0x%x\n",
+ c, nfdicf_test_data[i].ncf[j]);
+ j++;
+ }
+
+ test((j == nlen));
+ }
+}
+
+static void check_utf8_comparisons(void)
+{
+ int i;
+ struct unicode_map *table = utf8_load("11.0.0");
+
+ if (IS_ERR(table)) {
+ pr_err("%s: Unable to load utf8 %d.%d.%d. Skipping.\n",
+ __func__, latest_maj, latest_min, latest_rev);
+ return;
+ }
+
+ for (i = 0; i < ARRAY_SIZE(nfdi_test_data); i++) {
+ const struct qstr s1 = {.name = nfdi_test_data[i].str,
+ .len = sizeof(nfdi_test_data[i].str)};
+ const struct qstr s2 = {.name = nfdi_test_data[i].dec,
+ .len = sizeof(nfdi_test_data[i].dec)};
+
+ test_f(!utf8_strncmp(table, &s1, &s2),
+ "%s %s comparison mismatch\n", s1.name, s2.name);
+ }
+
+ for (i = 0; i < ARRAY_SIZE(nfdicf_test_data); i++) {
+ const struct qstr s1 = {.name = nfdicf_test_data[i].str,
+ .len = sizeof(nfdicf_test_data[i].str)};
+ const struct qstr s2 = {.name = nfdicf_test_data[i].ncf,
+ .len = sizeof(nfdicf_test_data[i].ncf)};
+
+ test_f(!utf8_strncasecmp(table, &s1, &s2),
+ "%s %s comparison mismatch\n", s1.name, s2.name);
+ }
+
+ utf8_unload(table);
+}
+
+static void check_supported_versions(void)
+{
+ /* Unicode 7.0.0 should be supported. */
+ test(utf8version_is_supported(7, 0, 0));
+
+ /* Unicode 9.0.0 should be supported. */
+ test(utf8version_is_supported(9, 0, 0));
+
+ /* Unicode 1x.0.0 (the latest version) should be supported. */
+ test(utf8version_is_supported(latest_maj, latest_min, latest_rev));
+
+ /* Next versions don't exist. */
+ test(!utf8version_is_supported(12, 0, 0));
+ test(!utf8version_is_supported(0, 0, 0));
+ test(!utf8version_is_supported(-1, -1, -1));
+}
+
+static int __init init_test_ucd(void)
+{
+ failed_tests = 0;
+ total_tests = 0;
+
+ check_supported_versions();
+ check_utf8_nfdi();
+ check_utf8_nfdicf();
+ check_utf8_comparisons();
+
+ if (!failed_tests)
+ pr_info("All %u tests passed\n", total_tests);
+ else
+ pr_err("%u out of %u tests failed\n", failed_tests,
+ total_tests);
+ return 0;
+}
+
+static void __exit exit_test_ucd(void)
+{
+}
+
+module_init(init_test_ucd);
+module_exit(exit_test_ucd);
+
+MODULE_AUTHOR("Gabriel Krisman Bertazi <[email protected]>");
+MODULE_LICENSE("GPL");
--
2.20.1

2019-03-18 20:28:27

by Gabriel Krisman Bertazi

[permalink] [raw]

Subject: [PATCH RFC v6 07/11] MAINTAINERS: Add Unicode subsystem entry

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
MAINTAINERS | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index e17ebf70b548..e2fe062a7977 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -15926,6 +15926,12 @@ F: drivers/uwb/
F: include/linux/uwb.h
F: include/linux/uwb/

+UNICODE SUBSYSTEM:
+M: Gabriel Krisman Bertazi <[email protected]>
+L: [email protected]
+S: Supported
+F: fs/unicode/
+
UNICORE32 ARCHITECTURE:
M: Guan Xuetao <[email protected]>
W: http://mprc.pku.edu.cn/~guanxuetao/linux
--
2.20.1

2019-03-18 20:28:32

by Gabriel Krisman Bertazi

[permalink] [raw]

Subject: [PATCH RFC v6 08/11] ext4: Include encoding information in the superblock

From: Gabriel Krisman Bertazi <[email protected]>

Support for encoding is considered an incompatible feature, since it has
potential to create collisions of file names in existing filesystems.
If the feature flag is not enabled, the entire filesystem will operate
on opaque byte sequences, respecting the original behavior.

The s_encoding field stores a magic number indicating the encoding
format and version used globally by file and directory names in the
filesystem. The s_encoding_flags defines policies for using the charset
encoding, like how to handle invalid sequences. The magic number is
mapped to the exact charset table, but the mapping is specific to ext4.
Since we don't have any commitment to support old encodings, the only
encoding I am supporting right now is utf8-11.0.

The current implementation prevents the user from enabling encoding and
per-directory encryption on the same filesystem at the same time. The
incompatibility between these features lies in how we do efficient
directory searches when we cannot be sure the encryption of the user
provided fname will match the actual hash stored in the disk without
decrypting every directory entry, because of normalization cases. My
quickest solution is to simply block the concurrent use of these
features for now, and enable it later, once we have a better solution.

Changes since v4:
- Move from fs/nls to fs/unicode

Changes since v2:
- Split superblock bitfield reservation into another patch.
- Rename s_ioencoding -> s_encoding
- Remove encoding_info from in-memory superblock.

Changes since v1:
- Guard code with CONFIG_NLS.
- Use 16 bits for s_ioencoding.
- Split mount option from this patch

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
fs/ext4/ext4.h | 21 +++++++++++-
fs/ext4/super.c | 85 +++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 105 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 82ffdacdc7fa..2039270be616 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1313,7 +1313,9 @@ struct ext4_super_block {
__u8 s_first_error_time_hi;
__u8 s_last_error_time_hi;
__u8 s_pad[2];
- __le32 s_reserved[96]; /* Padding to the end of the block */
+ __le16 s_encoding; /* Filename charset encoding */
+ __le16 s_encoding_flags; /* Filename charset encoding flags */
+ __le32 s_reserved[95]; /* Padding to the end of the block */
__le32 s_checksum; /* crc32c(superblock) */
};

@@ -1338,6 +1340,16 @@ struct ext4_super_block {
/* Number of quota types we support */
#define EXT4_MAXQUOTAS 3

+#define EXT4_ENC_UTF8_11_0 1
+
+/*
+ * Flags for ext4_sb_info.s_encoding_flags.
+ */
+#define EXT4_ENC_STRICT_MODE_FL (1 << 0)
+
+#define ext4_has_strict_mode(sbi) \
+ (sbi->s_encoding_flags & EXT4_ENC_STRICT_MODE_FL)
+
/*
* fourth extended-fs super-block data in memory
*/
@@ -1387,6 +1399,10 @@ struct ext4_sb_info {
struct kobject s_kobj;
struct completion s_kobj_unregister;
struct super_block *s_sb;
+#ifdef CONFIG_UNICODE
+ struct unicode_map *s_encoding;
+ __u16 s_encoding_flags;
+#endif

/* Journaling */
struct journal_s *s_journal;
@@ -1663,6 +1679,7 @@ static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
#define EXT4_FEATURE_INCOMPAT_LARGEDIR 0x4000 /* >2GB or 3-lvl htree */
#define EXT4_FEATURE_INCOMPAT_INLINE_DATA 0x8000 /* data in inode */
#define EXT4_FEATURE_INCOMPAT_ENCRYPT 0x10000
+#define EXT4_FEATURE_INCOMPAT_FNAME_ENCODING 0x20000

extern void ext4_update_dynamic_rev(struct super_block *sb);

@@ -1756,6 +1773,7 @@ EXT4_FEATURE_INCOMPAT_FUNCS(csum_seed, CSUM_SEED)
EXT4_FEATURE_INCOMPAT_FUNCS(largedir, LARGEDIR)
EXT4_FEATURE_INCOMPAT_FUNCS(inline_data, INLINE_DATA)
EXT4_FEATURE_INCOMPAT_FUNCS(encrypt, ENCRYPT)
+EXT4_FEATURE_INCOMPAT_FUNCS(fname_encoding, FNAME_ENCODING)

#define EXT2_FEATURE_COMPAT_SUPP EXT4_FEATURE_COMPAT_EXT_ATTR
#define EXT2_FEATURE_INCOMPAT_SUPP (EXT4_FEATURE_INCOMPAT_FILETYPE| \
@@ -1783,6 +1801,7 @@ EXT4_FEATURE_INCOMPAT_FUNCS(encrypt, ENCRYPT)
EXT4_FEATURE_INCOMPAT_MMP | \
EXT4_FEATURE_INCOMPAT_INLINE_DATA | \
EXT4_FEATURE_INCOMPAT_ENCRYPT | \
+ EXT4_FEATURE_INCOMPAT_FNAME_ENCODING | \
EXT4_FEATURE_INCOMPAT_CSUM_SEED | \
EXT4_FEATURE_INCOMPAT_LARGEDIR)
#define EXT4_FEATURE_RO_COMPAT_SUPP (EXT4_FEATURE_RO_COMPAT_SPARSE_SUPER| \
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index f5b828bf1299..f9a194c086e4 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -42,6 +42,7 @@
#include <linux/cleancache.h>
#include <linux/uaccess.h>
#include <linux/iversion.h>
+#include <linux/unicode.h>

#include <linux/kthread.h>
#include <linux/freezer.h>
@@ -1044,6 +1045,9 @@ static void ext4_put_super(struct super_block *sb)
crypto_free_shash(sbi->s_chksum_driver);
kfree(sbi->s_blockgroup_lock);
fs_put_dax(sbi->s_daxdev);
+#ifdef CONFIG_UNICODE
+ utf8_unload(sbi->s_encoding);
+#endif
kfree(sbi);
}

@@ -1740,6 +1744,36 @@ static const struct mount_opts {
{Opt_err, 0, 0}
};

+#ifdef CONFIG_UNICODE
+static const struct ext4_sb_encodings {
+ __u16 magic;
+ char *name;
+ char *version;
+} ext4_sb_encoding_map[] = {
+ {EXT4_ENC_UTF8_11_0, "utf8", "11.0.0"},
+};
+
+static int ext4_sb_read_encoding(const struct ext4_super_block *es,
+ const struct ext4_sb_encodings **encoding,
+ __u16 *flags)
+{
+ __u16 magic = le16_to_cpu(es->s_encoding);
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(ext4_sb_encoding_map); i++)
+ if (magic == ext4_sb_encoding_map[i].magic)
+ break;
+
+ if (i >= ARRAY_SIZE(ext4_sb_encoding_map))
+ return -EINVAL;
+
+ *encoding = &ext4_sb_encoding_map[i];
+ *flags = le16_to_cpu(es->s_encoding_flags);
+
+ return 0;
+}
+#endif
+
static int handle_mount_opt(struct super_block *sb, char *opt, int token,
substring_t *args, unsigned long *journal_devnum,
unsigned int *journal_ioprio, int is_remount)
@@ -2870,6 +2904,15 @@ static int ext4_feature_set_ok(struct super_block *sb, int readonly)
return 0;
}

+#ifndef CONFIG_UNICODE
+ if (ext4_has_feature_fname_encoding(sb)) {
+ ext4_msg(sb, KERN_ERR,
+ "Filesystem with fname_encoding feature cannot be "
+ "mounted without CONFIG_UNICODE");
+ return 0;
+ }
+#endif
+
if (readonly)
return 1;

@@ -3729,6 +3772,43 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
&journal_ioprio, 0))
goto failed_mount;

+#ifdef CONFIG_UNICODE
+ if (ext4_has_feature_fname_encoding(sb) && !sbi->s_encoding) {
+ const struct ext4_sb_encodings *encoding_info;
+ struct unicode_map *encoding;
+ __u16 encoding_flags;
+
+ if (ext4_has_feature_encrypt(sb)) {
+ ext4_msg(sb, KERN_ERR,
+ "Can't mount with encoding and encryption");
+ goto failed_mount;
+ }
+
+ if (ext4_sb_read_encoding(es, &encoding_info,
+ &encoding_flags)) {
+ ext4_msg(sb, KERN_ERR,
+ "Encoding requested by superblock is unknown");
+ goto failed_mount;
+ }
+
+ encoding = utf8_load(encoding_info->version);
+ if (IS_ERR(encoding)) {
+ ext4_msg(sb, KERN_ERR,
+ "can't mount with superblock charset: %s-%s "
+ "not supported by the kernel. flags: 0x%x.",
+ encoding_info->name, encoding_info->version,
+ encoding_flags);
+ goto failed_mount;
+ }
+ ext4_msg(sb, KERN_INFO,"Using encoding defined by superblock: "
+ "%s-%s with flags 0x%hx", encoding_info->name,
+ encoding_info->version?:"\b", encoding_flags);
+
+ sbi->s_encoding = encoding;
+ sbi->s_encoding_flags = encoding_flags;
+ }
+#endif
+
if (test_opt(sb, DATA_FLAGS) == EXT4_MOUNT_JOURNAL_DATA) {
printk_once(KERN_WARNING "EXT4-fs: Warning: mounting "
"with data=journal disables delayed "
@@ -4568,6 +4648,11 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
failed_mount:
if (sbi->s_chksum_driver)
crypto_free_shash(sbi->s_chksum_driver);
+
+#ifdef CONFIG_UNICODE
+ utf8_unload(sbi->s_encoding);
+#endif
+
#ifdef CONFIG_QUOTA
for (i = 0; i < EXT4_MAXQUOTAS; i++)
kfree(sbi->s_qf_names[i]);
--
2.20.1

2019-03-18 20:28:35

by Gabriel Krisman Bertazi

[permalink] [raw]

Subject: [PATCH RFC v6 09/11] ext4: Support encoding-aware file name lookups

From: Gabriel Krisman Bertazi <[email protected]>

This patch implements the actual support for encoding-aware file name
lookups in ext4, based on the feature bit and the encoding stored in the
superblock.

A filesystem that has the fname_encoding feature set is able to find
files even if the name used by userspace is not a byte per byte match,
but is an equivalent unicode string. This operation will be called and
inexact-match name search.

* dcache handling:

Ext4 only stores the first equivalent name dentry used in the
dcache. This is done to prevent unintentional duplication of dentries in
the dcache, while also allowing the VFS code to quickly find the right
entry in the cache despite what equivalent string was used without
resorting to ->lookup().

d_hash() is implemented as the hash of the normalized string, such that
we always have a well-known bucket for all the equivalencies of the same
string. d_compare uses the nls_strncmp() infrastructure, which should
handle the comparison of equivalent names as well. If the filesystem's
normalization type is PLAIN, though, we can just reuse the VFS hash.

For now, negative lookups are not inserted in the dcache, since they
would need to be invalidated anyway, because we can't trust missing file
dentries. This is bad for performance but requires some leveraging of
the vfs layer to fix. We can live without that for now, and so does
everyone else.

* on-disk data:

Despite using a specific version of the name as the internal
representation within the dcache, the name stored and fetch from the
disk is a byte-per-byte match with what the user requested, making this
implementation Name-preserving. i.e. no actual information is lost when
writing to the storage.

DX is supported by modifying the hashes to make them encoding-aware.
The new disk hashes are also calculated as the hash of the normalized
string, instead of the string directly. This allows us to efficiently
search for file names in the htree without requiring the user to provide
the exact name.

* Dealing with invalid sequences:

By default, when a invalid UTF-8 sequence is identified, ext4 will treat
it as an opaque byte sequence, ignoring the encoding and reverting to
the old behavior for that unique file. This means that inexact-match
lookups and (future) casefold will not work for that file only. An
optional bit can be set in the superblock telling the filesystem code
and userspace tools to enforce the encoding. When that flag is set, any
attempt to create a file name using an invalid UTF-8 sequence will fail.

* Normalization algorithm:

The UTF-8 algorithm used to lookup and compare strings in ext4 is
implemented by fs/unicode, and is based on a previous developed by SGI.
It implements the Canonical decomposition (NFD) algorithm described by
the Unicode specification 11.0, or higher, combined with the elimination
of ignorable code points (NFDi) as documented in fs/unicode/utf8_norm.c.

NFD seems to be the best normalization method because:

- It has a lower cost than NFC/NFKC (which requires
decomposing to NFD as an intermediary step)
- It doesn't eliminate important semantic meaning like NFKD.

But:

- It ignores valid equivalent sequences in some languages, which
would be invalid on anothers: the ff ligature, in o_ff_i_c_e, and the
usual spelling o_f_f_i_c_e, for instance, will no longer match. On
the other hand, the ash grapheme (formed by a ligature of AE )
can be represented as a different letter than A plus E, which is
correct in Danish, Norwegian, Icelandic, and Faroese.

- This implementation is not linguistic accurate, because different
languages have conflicting rules, which would require the
specialization of the filesystem to a given locale, which brings all
sorts of problems for removable media and for users who use more
than one language.

Changes since v4:
- Migrate NFKD -> NFD

Changes since v2:
- Don't use d_add_ci.
- Squash the dcache hooks into this patch.
- Rename sbi->encoding -> sbi->s_encoding.

Changes since v1:
- Support normalized htree hashes.
- Guard code with CONFIG_NLS.
- Use qstr->len instead of strlen in dcache hookups.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
fs/ext4/dir.c | 40 +++++++++++++++++++
fs/ext4/ext4.h | 12 +++++-
fs/ext4/hash.c | 35 ++++++++++++++++-
fs/ext4/ialloc.c | 2 +-
fs/ext4/inline.c | 2 +-
fs/ext4/namei.c | 100 +++++++++++++++++++++++++++++++++++++++++------
fs/ext4/super.c | 6 +++
7 files changed, 181 insertions(+), 16 deletions(-)

diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index 0ccd51f72048..ab1bc1c43063 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -26,6 +26,7 @@
#include <linux/buffer_head.h>
#include <linux/slab.h>
#include <linux/iversion.h>
+#include <linux/unicode.h>
#include "ext4.h"
#include "xattr.h"

@@ -660,3 +661,42 @@ const struct file_operations ext4_dir_operations = {
.open = ext4_dir_open,
.release = ext4_release_dir,
};
+
+#ifdef CONFIG_UNICODE
+static int ext4_d_compare(const struct dentry *dentry, unsigned int len,
+ const char *str, const struct qstr *name)
+{
+ struct qstr qstr = {.name = str, .len = len };
+
+ return ext4_encoding_cmp(dentry->d_parent->d_inode, name, &qstr);
+}
+
+static int ext4_d_hash(const struct dentry *dentry, struct qstr *str)
+{
+ const struct ext4_sb_info *sbi = EXT4_SB(dentry->d_sb);
+ const struct unicode_map *um = sbi->s_encoding;
+ unsigned char *norm;
+ int len, ret = 0;
+
+ norm = kmalloc(PATH_MAX, GFP_ATOMIC);
+ if (!norm)
+ return -ENOMEM;
+
+ len = utf8_normalize(um, str, norm, PATH_MAX);
+
+ if (len < 0) {
+ if (ext4_has_strict_mode(sbi))
+ ret = -EINVAL;
+ goto out;
+ }
+ str->hash = full_name_hash(dentry, norm, len);
+out:
+ kfree(norm);
+ return ret;
+}
+
+const struct dentry_operations ext4_dentry_ops = {
+ .d_hash = ext4_d_hash,
+ .d_compare = ext4_d_compare,
+};
+#endif
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 2039270be616..18035350b28f 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2396,8 +2396,8 @@ extern int ext4_check_all_de(struct inode *dir, struct buffer_head *bh,
extern int ext4_sync_file(struct file *, loff_t, loff_t, int);

/* hash.c */
-extern int ext4fs_dirhash(const char *name, int len, struct
- dx_hash_info *hinfo);
+extern int ext4fs_dirhash(const struct inode *dir, const char *name, int len,
+ struct dx_hash_info *hinfo);

/* ialloc.c */
extern struct inode *__ext4_new_inode(handle_t *, struct inode *, umode_t,
@@ -2993,6 +2993,10 @@ static inline void ext4_unlock_group(struct super_block *sb,
/* dir.c */
extern const struct file_operations ext4_dir_operations;

+#ifdef CONFIG_UNICODE
+extern const struct dentry_operations ext4_dentry_ops;
+#endif
+
/* file.c */
extern const struct inode_operations ext4_file_inode_operations;
extern const struct file_operations ext4_file_operations;
@@ -3085,6 +3089,10 @@ extern void initialize_dirent_tail(struct ext4_dir_entry_tail *t,
extern int ext4_handle_dirty_dirent_node(handle_t *handle,
struct inode *inode,
struct buffer_head *bh);
+extern int ext4_encoding_cmp(const struct inode *parent,
+ const struct qstr *name,
+ const struct qstr *entry);
+
#define S_SHIFT 12
static const unsigned char ext4_type_by_mode[(S_IFMT >> S_SHIFT) + 1] = {
[S_IFREG >> S_SHIFT] = EXT4_FT_REG_FILE,
diff --git a/fs/ext4/hash.c b/fs/ext4/hash.c
index 46b24da33a28..5a33c72efc56 100644
--- a/fs/ext4/hash.c
+++ b/fs/ext4/hash.c
@@ -6,6 +6,7 @@
*/

#include <linux/fs.h>
+#include <linux/unicode.h>
#include <linux/compiler.h>
#include <linux/bitops.h>
#include "ext4.h"
@@ -196,7 +197,8 @@ static void str2hashbuf_unsigned(const char *msg, int len, __u32 *buf, int num)
* represented, and whether or not the returned hash is 32 bits or 64
* bits. 32 bit hashes will return 0 for the minor hash.
*/
-int ext4fs_dirhash(const char *name, int len, struct dx_hash_info *hinfo)
+static int __ext4fs_dirhash(const char *name, int len,
+ struct dx_hash_info *hinfo)
{
__u32 hash;
__u32 minor_hash = 0;
@@ -268,3 +270,34 @@ int ext4fs_dirhash(const char *name, int len, struct dx_hash_info *hinfo)
hinfo->minor_hash = minor_hash;
return 0;
}
+
+int ext4fs_dirhash(const struct inode *dir, const char *name, int len,
+ struct dx_hash_info *hinfo)
+{
+#ifdef CONFIG_UNICODE
+ const struct unicode_map *um = EXT4_SB(dir->i_sb)->s_encoding;
+ int r, dlen;
+ unsigned char *buff;
+ struct qstr qstr = {.name = name, .len = len };
+
+ if (len && um) {
+ buff = kzalloc(sizeof(char) * PATH_MAX, GFP_KERNEL);
+ if (!buff)
+ return -1;
+
+ dlen = utf8_normalize(um, &qstr, buff, PATH_MAX);
+
+ if (dlen < 0) {
+ kfree(buff);
+ goto opaque_seq;
+ }
+
+ r = __ext4fs_dirhash(buff, dlen, hinfo);
+
+ kfree(buff);
+ return r;
+ }
+opaque_seq:
+#endif
+ return __ext4fs_dirhash(name, len, hinfo);
+}
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index f3e17a8c84b4..764ff4c56233 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -455,7 +455,7 @@ static int find_group_orlov(struct super_block *sb, struct inode *parent,
if (qstr) {
hinfo.hash_version = DX_HASH_HALF_MD4;
hinfo.seed = sbi->s_hash_seed;
- ext4fs_dirhash(qstr->name, qstr->len, &hinfo);
+ ext4fs_dirhash(parent, qstr->name, qstr->len, &hinfo);
grp = hinfo.hash;
} else
grp = prandom_u32();
diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
index 56f6e1782d5f..f73bc3925282 100644
--- a/fs/ext4/inline.c
+++ b/fs/ext4/inline.c
@@ -1407,7 +1407,7 @@ int htree_inlinedir_to_tree(struct file *dir_file,
}
}

- ext4fs_dirhash(de->name, de->name_len, hinfo);
+ ext4fs_dirhash(dir, de->name, de->name_len, hinfo);
if ((hinfo->hash < start_hash) ||
((hinfo->hash == start_hash) &&
(hinfo->minor_hash < start_minor_hash)))
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 980166a8122a..0caaa210be8c 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -35,6 +35,7 @@
#include <linux/buffer_head.h>
#include <linux/bio.h>
#include <linux/iversion.h>
+#include <linux/unicode.h>
#include "ext4.h"
#include "ext4_jbd2.h"

@@ -629,7 +630,7 @@ static struct stats dx_show_leaf(struct inode *dir,
}
if (!fscrypt_has_encryption_key(dir)) {
/* Directory is not encrypted */
- ext4fs_dirhash(de->name,
+ ext4fs_dirhash(dir, de->name,
de->name_len, &h);
printk("%*.s:(U)%x.%u ", len,
name, h.hash,
@@ -662,8 +663,8 @@ static struct stats dx_show_leaf(struct inode *dir,
name = fname_crypto_str.name;
len = fname_crypto_str.len;
}
- ext4fs_dirhash(de->name, de->name_len,
- &h);
+ ext4fs_dirhash(dir, de->name,
+ de->name_len, &h);
printk("%*.s:(E)%x.%u ", len, name,
h.hash, (unsigned) ((char *) de
- base));
@@ -673,7 +674,7 @@ static struct stats dx_show_leaf(struct inode *dir,
#else
int len = de->name_len;
char *name = de->name;
- ext4fs_dirhash(de->name, de->name_len, &h);
+ ext4fs_dirhash(dir, de->name, de->name_len, &h);
printk("%*.s:%x.%u ", len, name, h.hash,
(unsigned) ((char *) de - base));
#endif
@@ -762,7 +763,7 @@ dx_probe(struct ext4_filename *fname, struct inode *dir,
hinfo->hash_version += EXT4_SB(dir->i_sb)->s_hash_unsigned;
hinfo->seed = EXT4_SB(dir->i_sb)->s_hash_seed;
if (fname && fname_name(fname))
- ext4fs_dirhash(fname_name(fname), fname_len(fname), hinfo);
+ ext4fs_dirhash(dir, fname_name(fname), fname_len(fname), hinfo);
hash = hinfo->hash;

if (root->info.unused_flags & 1) {
@@ -1008,7 +1009,7 @@ static int htree_dirblock_to_tree(struct file *dir_file,
/* silently ignore the rest of the block */
break;
}
- ext4fs_dirhash(de->name, de->name_len, hinfo);
+ ext4fs_dirhash(dir, de->name, de->name_len, hinfo);
if ((hinfo->hash < start_hash) ||
((hinfo->hash == start_hash) &&
(hinfo->minor_hash < start_minor_hash)))
@@ -1197,7 +1198,7 @@ static int dx_make_map(struct inode *dir, struct ext4_dir_entry_2 *de,

while ((char *) de < base + blocksize) {
if (de->name_len && de->inode) {
- ext4fs_dirhash(de->name, de->name_len, &h);
+ ext4fs_dirhash(dir, de->name, de->name_len, &h);
map_tail--;
map_tail->hash = h.hash;
map_tail->offs = ((char *) de - base)>>2;
@@ -1252,15 +1253,45 @@ static void dx_insert_block(struct dx_frame *frame, u32 hash, ext4_lblk_t block)
dx_set_count(entries, count + 1);
}

+#ifdef CONFIG_UNICODE
+int ext4_encoding_cmp(const struct inode *parent, const struct qstr *name,
+ const struct qstr *entry)
+{
+ const struct ext4_sb_info *sbi = EXT4_SB(parent->i_sb);
+ const struct unicode_map *um = sbi->s_encoding;
+ int ret;
+
+ ret = utf8_strncmp(um, name, entry);
+ if (ret < 0) {
+ /* Handle invalid character sequence as either an error
+ * or as an opaque byte sequence.
+ */
+ if (ext4_has_strict_mode(sbi))
+ return -EINVAL;
+
+ if (name->len != entry->len)
+ return 1;
+
+ return !!memcmp(name->name, entry->name, name->len);
+ }
+
+ return ret;
+}
+#endif
+
/*
* Test whether a directory entry matches the filename being searched for.
*
* Return: %true if the directory entry matches, otherwise %false.
*/
-static inline bool ext4_match(const struct ext4_filename *fname,
+static inline bool ext4_match(const struct inode *parent,
+ const struct ext4_filename *fname,
const struct ext4_dir_entry_2 *de)
{
struct fscrypt_name f;
+#ifdef CONFIG_UNICODE
+ const struct qstr entry = {.name = de->name, .len = de->name_len};
+#endif

if (!de->inode)
return false;
@@ -1270,6 +1301,12 @@ static inline bool ext4_match(const struct ext4_filename *fname,
#ifdef CONFIG_FS_ENCRYPTION
f.crypto_buf = fname->crypto_buf;
#endif
+
+#ifdef CONFIG_UNICODE
+ if (EXT4_SB(parent->i_sb)->s_encoding)
+ return !ext4_encoding_cmp(parent, fname->usr_fname, &entry);
+#endif
+
return fscrypt_match_name(&f, de->name, de->name_len);
}

@@ -1290,7 +1327,7 @@ int ext4_search_dir(struct buffer_head *bh, char *search_buf, int buf_size,
/* this code is executed quadratically often */
/* do minimal checking `by hand' */
if ((char *) de + de->name_len <= dlimit &&
- ext4_match(fname, de)) {
+ ext4_match(dir, fname, de)) {
/* found a match - just to be sure, do
* a full check */
if (ext4_check_dir_entry(dir, NULL, de, bh, bh->b_data,
@@ -1588,6 +1625,17 @@ static struct dentry *ext4_lookup(struct inode *dir, struct dentry *dentry, unsi
return ERR_PTR(-EPERM);
}
}
+
+#ifdef CONFIG_UNICODE
+ if (EXT4_SB(dir->i_sb)->s_encoding && !inode) {
+ /* Eventually we want to call d_add_ci(dentry, NULL)
+ * for negative dentries in the encoding case as
+ * well. For now, prevent the negative dentry
+ * from being cached.
+ */
+ return NULL;
+ }
+#endif
return d_splice_alias(inode, dentry);
}

@@ -1798,7 +1846,7 @@ int ext4_find_dest_de(struct inode *dir, struct inode *inode,
if (ext4_check_dir_entry(dir, NULL, de, bh,
buf, buf_size, offset))
return -EFSCORRUPTED;
- if (ext4_match(fname, de))
+ if (ext4_match(dir, fname, de))
return -EEXIST;
nlen = EXT4_DIR_REC_LEN(de->name_len);
rlen = ext4_rec_len_from_disk(de->rec_len, buf_size);
@@ -1983,7 +2031,7 @@ static int make_indexed_dir(handle_t *handle, struct ext4_filename *fname,
if (fname->hinfo.hash_version <= DX_HASH_TEA)
fname->hinfo.hash_version += EXT4_SB(dir->i_sb)->s_hash_unsigned;
fname->hinfo.seed = EXT4_SB(dir->i_sb)->s_hash_seed;
- ext4fs_dirhash(fname_name(fname), fname_len(fname), &fname->hinfo);
+ ext4fs_dirhash(dir, fname_name(fname), fname_len(fname), &fname->hinfo);

memset(frames, 0, sizeof(frames));
frame = frames;
@@ -2036,6 +2084,7 @@ static int ext4_add_entry(handle_t *handle, struct dentry *dentry,
struct ext4_dir_entry_2 *de;
struct ext4_dir_entry_tail *t;
struct super_block *sb;
+ struct ext4_sb_info *sbi;
struct ext4_filename fname;
int retval;
int dx_fallback=0;
@@ -2047,10 +2096,17 @@ static int ext4_add_entry(handle_t *handle, struct dentry *dentry,
csum_size = sizeof(struct ext4_dir_entry_tail);

sb = dir->i_sb;
+ sbi = EXT4_SB(sb);
blocksize = sb->s_blocksize;
if (!dentry->d_name.len)
return -EINVAL;

+#ifdef CONFIG_UNICODE
+ if (sbi->s_encoding_flags & EXT4_ENC_STRICT_MODE_FL &&
+ utf8_validate(sbi->s_encoding, &dentry->d_name))
+ return -EINVAL;
+#endif
+
retval = ext4_fname_setup_filename(dir, &dentry->d_name, 0, &fname);
if (retval)
return retval;
@@ -2975,6 +3031,17 @@ static int ext4_rmdir(struct inode *dir, struct dentry *dentry)
ext4_update_dx_flag(dir);
ext4_mark_inode_dirty(handle, dir);

+#ifdef CONFIG_UNICODE
+ /* VFS negative dentries are incompatible with Encoding and
+ * Case-insensitiveness. Eventually we'll want avoid
+ * invalidating the dentries here, alongside with returning the
+ * negative dentries at ext4_lookup(), when it is better
+ * supported by the VFS for the CI case.
+ */
+ if (EXT4_SB(dir->i_sb)->s_encoding)
+ d_invalidate(dentry);
+#endif
+
end_rmdir:
brelse(bh);
if (handle)
@@ -3044,6 +3111,17 @@ static int ext4_unlink(struct inode *dir, struct dentry *dentry)
inode->i_ctime = current_time(inode);
ext4_mark_inode_dirty(handle, inode);

+#ifdef CONFIG_UNICODE
+ /* VFS negative dentries are incompatible with Encoding and
+ * Case-insensitiveness. Eventually we'll want avoid
+ * invalidating the dentries here, alongside with returning the
+ * negative dentries at ext4_lookup(), when it is better
+ * supported by the VFS for the CI case.
+ */
+ if (EXT4_SB(dir->i_sb)->s_encoding)
+ d_invalidate(dentry);
+#endif
+
end_unlink:
brelse(bh);
if (handle)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index f9a194c086e4..29f06f4fb0c7 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4443,6 +4443,12 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
iput(root);
goto failed_mount4;
}
+
+#ifdef CONFIG_UNICODE
+ if (sbi->s_encoding)
+ sb->s_d_op = &ext4_dentry_ops;
+#endif
+
sb->s_root = d_make_root(root);
if (!sb->s_root) {
ext4_msg(sb, KERN_ERR, "get root dentry failed");
--
2.20.1

2019-03-18 20:28:38

by Gabriel Krisman Bertazi

[permalink] [raw]

Subject: [PATCH RFC v6 10/11] ext4: Implement EXT4_CASEFOLD_FL flag

From: Gabriel Krisman Bertazi <[email protected]>

Casefold is implemented as a special case of the fname_encoding feature,
thus requiring it to be enabled. This design mechanism works
semantically fine, because we consider that without an specified
encoding the casefold operation cannot be defined.

Despite the common core implementation, the Casefold setting can be
applied to directories individually, making entire subtrees of the
filesystem case-insensitive or not. The flag is set as an inode
attribute applied to directories and inherited by its children which
states that the directory requires case-insensitive searches. This flag
can only be enabled on empty directories for filesystems that support
the encoding feature, thus preventing collision of file names that only
differ by case.

Changes since v2:
- Rename sbi->encoding -> sbi->s_encoding.

Changes since v1:
- Moved the CASEFOLD_FL to prevent collision with reserved verity flag.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
fs/ext4/dir.c | 5 ++++-
fs/ext4/ext4.h | 9 +++++----
fs/ext4/hash.c | 5 ++++-
fs/ext4/inode.c | 4 +++-
fs/ext4/ioctl.c | 18 ++++++++++++++++++
fs/ext4/namei.c | 6 +++++-
include/linux/fs.h | 2 ++
7 files changed, 41 insertions(+), 8 deletions(-)

diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index ab1bc1c43063..5118723a5d9a 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -682,7 +682,10 @@ static int ext4_d_hash(const struct dentry *dentry, struct qstr *str)
if (!norm)
return -ENOMEM;

- len = utf8_normalize(um, str, norm, PATH_MAX);
+ if (!IS_CASEFOLDED(dentry->d_inode))
+ len = utf8_normalize(um, str, norm, PATH_MAX);
+ else
+ len = utf8_casefold(um, str, norm, PATH_MAX);

if (len < 0) {
if (ext4_has_strict_mode(sbi))
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 18035350b28f..21ad8d71008c 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -399,10 +399,11 @@ struct flex_groups {
#define EXT4_EOFBLOCKS_FL 0x00400000 /* Blocks allocated beyond EOF */
#define EXT4_INLINE_DATA_FL 0x10000000 /* Inode has inline data. */
#define EXT4_PROJINHERIT_FL 0x20000000 /* Create with parents projid */
+#define EXT4_CASEFOLD_FL 0x40000000 /* Casefolded file */
#define EXT4_RESERVED_FL 0x80000000 /* reserved for ext4 lib */

-#define EXT4_FL_USER_VISIBLE 0x304BDFFF /* User visible flags */
-#define EXT4_FL_USER_MODIFIABLE 0x204BC0FF /* User modifiable flags */
+#define EXT4_FL_USER_VISIBLE 0x704BDFFF /* User visible flags */
+#define EXT4_FL_USER_MODIFIABLE 0x604BC0FF /* User modifiable flags */

/* Flags we can manipulate with through EXT4_IOC_FSSETXATTR */
#define EXT4_FL_XFLAG_VISIBLE (EXT4_SYNC_FL | \
@@ -417,10 +418,10 @@ struct flex_groups {
EXT4_SYNC_FL | EXT4_NODUMP_FL | EXT4_NOATIME_FL |\
EXT4_NOCOMPR_FL | EXT4_JOURNAL_DATA_FL |\
EXT4_NOTAIL_FL | EXT4_DIRSYNC_FL |\
- EXT4_PROJINHERIT_FL)
+ EXT4_PROJINHERIT_FL | EXT4_CASEFOLD_FL)

/* Flags that are appropriate for regular files (all but dir-specific ones). */
-#define EXT4_REG_FLMASK (~(EXT4_DIRSYNC_FL | EXT4_TOPDIR_FL))
+#define EXT4_REG_FLMASK (~(EXT4_DIRSYNC_FL | EXT4_TOPDIR_FL | EXT4_CASEFOLD_FL))

/* Flags that are appropriate for non-directories/regular files. */
#define EXT4_OTHER_FLMASK (EXT4_NODUMP_FL | EXT4_NOATIME_FL)
diff --git a/fs/ext4/hash.c b/fs/ext4/hash.c
index 5a33c72efc56..f1a30e1353f6 100644
--- a/fs/ext4/hash.c
+++ b/fs/ext4/hash.c
@@ -285,7 +285,10 @@ int ext4fs_dirhash(const struct inode *dir, const char *name, int len,
if (!buff)
return -1;

- dlen = utf8_normalize(um, &qstr, buff, PATH_MAX);
+ if (!IS_CASEFOLDED(dir))
+ dlen = utf8_normalize(um, &qstr, buff, PATH_MAX);
+ else
+ dlen = utf8_casefold(um, &qstr, buff, PATH_MAX);

if (dlen < 0) {
kfree(buff);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index b54b261ded36..9a990f5557b5 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4738,9 +4738,11 @@ void ext4_set_inode_flags(struct inode *inode)
new_fl |= S_DAX;
if (flags & EXT4_ENCRYPT_FL)
new_fl |= S_ENCRYPTED;
+ if (flags & EXT4_CASEFOLD_FL)
+ new_fl |= S_CASEFOLD;
inode_set_flags(inode, new_fl,
S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC|S_DAX|
- S_ENCRYPTED);
+ S_ENCRYPTED|S_CASEFOLD);
}

static blkcnt_t ext4_inode_blocks(struct ext4_inode *raw_inode,
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index 3c4f8bb59f8a..d61312d9236d 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -278,6 +278,7 @@ static int ext4_ioctl_setflags(struct inode *inode,
struct ext4_iloc iloc;
unsigned int oldflags, mask, i;
unsigned int jflag;
+ struct super_block *sb = inode->i_sb;

/* Is it quota file? Do not allow user to mess with it */
if (ext4_is_quota_file(inode))
@@ -322,6 +323,23 @@ static int ext4_ioctl_setflags(struct inode *inode,
goto flags_out;
}

+ if ((flags ^ oldflags) & EXT4_CASEFOLD_FL) {
+ if (!ext4_has_feature_fname_encoding(sb)) {
+ err = -EOPNOTSUPP;
+ goto flags_out;
+ }
+
+ if (!S_ISDIR(inode->i_mode)) {
+ err = -ENOTDIR;
+ goto flags_out;
+ }
+
+ if (!ext4_empty_dir(inode)) {
+ err = -ENOTEMPTY;
+ goto flags_out;
+ }
+ }
+
handle = ext4_journal_start(inode, EXT4_HT_INODE, 1);
if (IS_ERR(handle)) {
err = PTR_ERR(handle);
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 0caaa210be8c..173f45321162 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -1261,7 +1261,11 @@ int ext4_encoding_cmp(const struct inode *parent, const struct qstr *name,
const struct unicode_map *um = sbi->s_encoding;
int ret;

- ret = utf8_strncmp(um, name, entry);
+ if (!IS_CASEFOLDED(parent))
+ ret = utf8_strncmp(um, name, entry);
+ else
+ ret = utf8_strncasecmp(um, name, entry);
+
if (ret < 0) {
/* Handle invalid character sequence as either an error
* or as an opaque byte sequence.
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 8b42df09b04c..6261090e605b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1953,6 +1953,7 @@ struct super_operations {
#define S_DAX 0 /* Make all the DAX code disappear */
#endif
#define S_ENCRYPTED 16384 /* Encrypted file (using fs/crypto/) */
+#define S_CASEFOLD 32768 /* Casefolded file */

/*
* Note that nosuid etc flags are inode-specific: setting some file-system
@@ -1993,6 +1994,7 @@ static inline bool sb_rdonly(const struct super_block *sb) { return sb->s_flags
#define IS_NOSEC(inode) ((inode)->i_flags & S_NOSEC)
#define IS_DAX(inode) ((inode)->i_flags & S_DAX)
#define IS_ENCRYPTED(inode) ((inode)->i_flags & S_ENCRYPTED)
+#define IS_CASEFOLDED(inode) ((inode)->i_flags & S_CASEFOLD)

#define IS_WHITEOUT(inode) (S_ISCHR(inode->i_mode) && \
(inode)->i_rdev == WHITEOUT_DEV)
--
2.20.1

2019-03-18 20:28:41

by Gabriel Krisman Bertazi

[permalink] [raw]

Subject: [PATCH RFC v6 11/11] docs: ext4.rst: Document encoding and case-insensitive

From: Gabriel Krisman Bertazi <[email protected]>

Introduces the encoding-awareness and case-insensitive features on ext4
for system administrators. Explain the minimum of design decisions that
are important for sysadmins wanting to enable this feature.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
Documentation/admin-guide/ext4.rst | 41 ++++++++++++++++++++++++++++++
1 file changed, 41 insertions(+)

diff --git a/Documentation/admin-guide/ext4.rst b/Documentation/admin-guide/ext4.rst
index e506d3dae510..4e08d0309f1e 100644
--- a/Documentation/admin-guide/ext4.rst
+++ b/Documentation/admin-guide/ext4.rst
@@ -91,10 +91,51 @@ Currently Available
* large block (up to pagesize) support
* efficient new ordered mode in JBD2 and ext4 (avoid using buffer head to force
the ordering)
+* Encoding aware file names
+* Case insensitive file name lookups

[1] Filesystems with a block size of 1k may see a limit imposed by the
directory hash tree having a maximum depth of two.

+Encoding-aware file names and case-insensitive lookups
+======================================================
+
+Ext4 optionally supports filesystem-wide charset knowledge when handling
+file names, which allows the user to perform file system lookups using
+charset equivalent versions of the same file name, and optionally ensure
+that no invalid names are held by the filesystem. charset encoding
+awareness is also essential for performing case-insensitive lookups,
+because it is what defines the casefold operation.
+
+The case-insensitive file name lookup feature is supported in a smaller
+granularity, on a per-directory basis, allowing the user to mix
+case-insensitive and case-sensitive directories in the same filesystem.
+It is enabled by flipping a file attribute on an empty directory. For
+the reason stated above, the filesystem must have encoding enabled to
+use this feature.
+
+Both encoding-awareness and case-awareness are name-preserving on the
+disk, meaning that the file name provided by userspace is a
+byte-per-byte match to what is actually written in the disk. The
+Unicode normalization format used by the kernel is thus an internal
+representation, and not exposed to the userspace nor to the disk, with
+the important exception of disk hashes, used on large directories with
+DX feature. On DX directories, the hash must be calculated using the
+normalized version of the filename, meaning that the normalization
+format used actually has an impact on where the directory entry is
+stored.
+
+When we change from viewing filenames as opaque byte sequences to seeing
+them as encoded strings we need to address what happens when a program
+tries to create a file with an invalid name. The Unicode subsystem
+within the kernel leaves the decision of what to do in this case to the
+filesystem, which select its preferred behavior by enabling/disabling
+the strict mode. When Ext4 encounters one of those strings and the
+filesystem did not require strict mode, it falls back to considering the
+entire string as an opaque byte sequence, which still allows the user to
+operate on that file but the case-insensitive and equivalent sequence
+lookups won't work.
+
Options
=======

--
2.20.1

2019-03-21 22:30:43

by Randy Dunlap

[permalink] [raw]

Subject: Re: [PATCH RFC v6 00/11] Ext4 Encoding and Case-insensitive support

On 3/18/19 1:27 PM, Gabriel Krisman Bertazi wrote:
> Hi,
>
> This version pretty much the same as v5. I am resending cause as the
> previous version didn't grab much discussion on the main topic of moving
> from KD to D.
>
> Same as version 5, at a first glance, you will notice the series got a
> lot smaller, with the separation of unicode code from the NLS subsystem,
> as Linus requested. The ext4 parts are pretty much the same, with only
> the addition of a verification in ext4_feature_set_ok() to fail encoding
> mounts when without CONFIG_UNICODE on newer kernels.
>
> The main change presented here is a proposal to migrate the
> normalization method from NFKD to NFD. After our discussions, and
> reviewing other operating systems and languages aspects, I am more
> convinced that canonical decomposition is more viable solution than
> compatibility decomposition, because it doesn't ignore eliminate any
> semantic meaning, like the definitive case of superscript numbers. NFD
> is also the documented method used by HFS+ and APFS, so there is
> precedent. Notice however, that as far as my research goes, APFS doesn't
> completely follows NFD, and in some cases, like <compat> flags, it
> actually does NFKD, but not in others (<fraction>), where it applies the
> canonical form. We take a more consistent approach and always do plain NFD.
>
> This RFC, therefore, aims to resume/start conversation with some
> stalkeholders that may have something to say regarding the normalization
> method used. I added people from SMB, NFS and FS development who
> might be interested on this.
>
> Regarding Casefold, I am unsure whether Casefold Common + Full still
> makes sense after migrating from the compatibility to the canonical
> form. While Casefold Full, by definition, addresses cases where the
> casefolding grows in size, like the casefold of the german eszett to SS,
> it also is responsible for folding smallcase ligatures without a
> corresponding uppercase to their compatible counterpart. Which means
> that on -F directories, o_f_f_i_c_e and o_ff_i_c_e will differ, while on
> +F directories they will match. This seems unaceptable to me,
> suggesting that we should start to use Common + Simple instead of Common
> + Full, but I would like more input on what seems more reasonable to
> you.
>
> After we decide on this, I will be sending new patches to update
> e2fsprogs to the agreed method and remove the normalization/casefold
> type flags (EXT4_UTF8_NORMALIZATION_TYPE_NFKD,
> EXT4_UTF8_CASEFOLD_TYPE_NFKDCF), before actually proposing the current
> patch series for inclusion in the kernel.
>
> For the record, I am aware that unicode 12 was released 2 weeks ago. The
> world can't live without a new set of emojis every 6 months. I will
> withold updating the unicode version until we get something
> upstreamable, then I will update to the latest version and send a new
> version. This way I avoid having to update versions that will never
> actually be used.
>
> Practical things, w.r.t. this patch series:
>
> - As usual, the UCD files are not part of the series, because they
> would cause the email to bounce. To test this one would need to fetch
> the files as explained in the commit message.
>
> - If you prefer, you can checkout from
> https://gitlab.collabora.com/krisman/linux -b ext4-ci-directory-no-nls
>
> - More details on the design decisions restricted to ext4 are
> available in the corresponding commit messages.
>
> Thanks!
>

Hi,
I briefly scanned but did not look terribly closely:

Does this patch series ignore ext3 filesystems that are being handled
by the ext4fs code?

Thanks.

>
> Gabriel Krisman Bertazi (7):
> unicode: Implement higher level API for string handling
> unicode: Introduce test module for normalized utf8 implementation
> MAINTAINERS: Add Unicode subsystem entry
> ext4: Include encoding information in the superblock
> ext4: Support encoding-aware file name lookups
> ext4: Implement EXT4_CASEFOLD_FL flag
> docs: ext4.rst: Document encoding and case-insensitive
>
> Olaf Weber (4):
> unicode: Add unicode character database files
> scripts: add trie generator for UTF-8
> unicode: Introduce code for UTF-8 normalization
> unicode: reduce the size of utf8data[]
>
> Documentation/admin-guide/ext4.rst | 41 +
> MAINTAINERS | 6 +
> fs/Kconfig | 1 +
> fs/Makefile | 1 +
> fs/ext4/dir.c | 43 +
> fs/ext4/ext4.h | 42 +-
> fs/ext4/hash.c | 38 +-
> fs/ext4/ialloc.c | 2 +-
> fs/ext4/inline.c | 2 +-
> fs/ext4/inode.c | 4 +-
> fs/ext4/ioctl.c | 18 +
> fs/ext4/namei.c | 104 +-
> fs/ext4/super.c | 91 +
> fs/unicode/Kconfig | 13 +
> fs/unicode/Makefile | 22 +
> fs/unicode/ucd/README | 33 +
> fs/unicode/utf8-core.c | 183 ++
> fs/unicode/utf8-norm.c | 797 +++++++
> fs/unicode/utf8-selftest.c | 320 +++
> fs/unicode/utf8n.h | 117 +
> include/linux/fs.h | 2 +
> include/linux/unicode.h | 30 +
> scripts/Makefile | 1 +
> scripts/mkutf8data.c | 3418 ++++++++++++++++++++++++++++
> 24 files changed, 5307 insertions(+), 22 deletions(-)
> create mode 100644 fs/unicode/Kconfig
> create mode 100644 fs/unicode/Makefile
> create mode 100644 fs/unicode/ucd/README
> create mode 100644 fs/unicode/utf8-core.c
> create mode 100644 fs/unicode/utf8-norm.c
> create mode 100644 fs/unicode/utf8-selftest.c
> create mode 100644 fs/unicode/utf8n.h
> create mode 100644 include/linux/unicode.h
> create mode 100644 scripts/mkutf8data.c
>

--
~Randy

2019-03-23 00:01:59

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [PATCH RFC v6 00/11] Ext4 Encoding and Case-insensitive support

On Thu, Mar 21, 2019 at 03:30:35PM -0700, Randy Dunlap wrote:
> I briefly scanned but did not look terribly closely:
>
> Does this patch series ignore ext3 filesystems that are being handled
> by the ext4fs code?

Like all ext2/3/4 features, new functionality is gated by a feature
bit in the compat, rocompat, or incompat bitmasks. A file system
implementation (e.g., in the Linux kernel, NetBSD/FreeBSD, Grub,
et. al) which sees a feature bit it doesn't recognize in an rocompat
bitmask will only allow the file system to be mounted read-only. If
it sees a bit in the incompat bitmask it doesn't recognize, it won't
allow the file system to be mounted at all. The kernel doesn't care
if it finds a bit it doesn't recognize in the compat bitmask; however,
e2fsck will not touch a file system that has a compat bit which it
doesn't know how to deal with.

Strictly speaking, there is no such thing as an "ext3 file system".
There are just file systems with a set of feature bits set. And the
set of features supported by NetBSD or the Grub installer may not fall
neatly into the "ext2", "ext3", or "ext4" buckets.

If by "ext3 file system" you mean a file system with those featire
bits set by mke2fs -t ext3 by default (e.g., not overridden by the -O
option, or by the system administrator editing /etc/mke2fs.conf), then
yes, this patch series will not result in any substantive change for
"ext3 file systems", since it by default "mke2fs -t ext3" or
"mkfs.ext3" will not enable the feature flag that turns on
case-insensitive support.

For more information, see the ext4 and mke2fs.conf manual pages.

Cheers,

- Ted

2019-04-06 20:47:40

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [PATCH RFC v6 04/11] unicode: reduce the size of utf8data[]

On Mon, Mar 18, 2019 at 04:27:38PM -0400, Gabriel Krisman Bertazi wrote:
> From: Olaf Weber <[email protected]>
>
> Remove the Hangul decompositions from the utf8data trie, and do
> algorithmic decomposition to calculate them on the fly. To store
> the decomposition the caller of utf8lookup()/utf8nlookup() must
> provide a 12-byte buffer, which is used to synthesize a leaf with
> the decomposition. Trie size is reduced from 245kB to 90kB.

I'm seeing sizes much smaller; the actual utf8data[] array is 63,584.
And size utf8-norm.o reports:

text data bss dec hex filename
68752 96 0 68848 10cf0 fs/unicode/utf8-norm.o

Were you measuring the size of the utf8-norm.o file? That will vary
in size depending on whether debugging symbols are enabled, etc.

- Ted

2019-04-08 12:03:34

by Weber, Olaf (HPC Data Management & Storage)

[permalink] [raw]

Subject: RE: [PATCH RFC v6 04/11] unicode: reduce the size of utf8data[]

From: Theodore Ts'o
> On Mon, Mar 18, 2019 at 04:27:38PM -0400, Gabriel Krisman Bertazi wrote:
> > From: Olaf Weber <[email protected]>
> >
> > Remove the Hangul decompositions from the utf8data trie, and do
> > algorithmic decomposition to calculate them on the fly. To store
> > the decomposition the caller of utf8lookup()/utf8nlookup() must
> > provide a 12-byte buffer, which is used to synthesize a leaf with
> > the decomposition. Trie size is reduced from 245kB to 90kB.
>
> I'm seeing sizes much smaller; the actual utf8data[] array is 63,584.
> And size utf8-norm.o reports:
>
> text data bss dec hex filename
> 68752 96 0 68848 10cf0 fs/unicode/utf8-norm.o
>
> Were you measuring the size of the utf8-norm.o file? That will vary
> in size depending on whether debugging symbols are enabled, etc.
>
> - Ted

These numbers came from the size of the array reported in utf8data.h,
and were correct for the NFKDI + NFKDICF normalizations for Unicode 9.
The switch to NFDI + NFDICF reduced the size, and it looks like the commit
message was not updated to account for this.

Olaf