2018-12-06 22:04:41

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH v4 00/23] Ext4 Encoding and Case-insensitive support

Hi,

Following the e2fsprogs changes, these are the corresponding kernel-side
modifications to support the fname_encoding feature.

The patches are split in two parts. The fist 14 patches are refactoring
and improvements to the NLS code, including the utf8 normalization
support. The final patches implement the fname_encoding feature in ext4.

To test this feature, you need to use the tip of e2fsprogs branch, which
already include support for enabling this feature.

As usual, the ucd files are not included in this email because they are
too large, and would actually cause the email message to bounce.

There are two test files for this in a private xfstests branch, that I
plan to submit upstream once we get this series merged:

https://gitlab.collabora.com/krisman/xfstests.git -b encoding_v4

I also tested this with the xfstests smoke tests using two scenarios:
(1) a non-encoding TEST_DEV; (2) a utf8 enabled TEST_DEV. On both
cases, no unrelated regressions where observed. With my branch of
xfstests above, that fixes some related tests, I didn't observe any
regressions.

Gabriel Krisman Bertazi (19):
nls: Wrap uni2char/char2uni callers
nls: Wrap charset field access
nls: Wrap charset hooks in ops structure
nls: Split default charset from NLS core
nls: Split struct nls_charset from struct nls_table
nls: Add support for multiple versions of an encoding
nls: Implement NLS_STRICT_MODE flag
nls: Let charsets define the behavior of tolower/toupper
nls: Add new interface for string comparisons
nls: Add optional normalization and casefold hooks
nls: ascii: Support validation and normalization operations
nls: utf8: Move nls-utf8{,-core}.c
nls: utf8: Integrate utf8 normalization code with utf8 charset
nls: utf8: Introduce test module for normalized utf8 implementation
ext4: Reserve superblock fields for encoding information
ext4: Include encoding information in the superblock
ext4: Support encoding-aware file name lookups
ext4: Implement EXT4_CASEFOLD_FL flag
docs: ext4.rst: Document encoding and case-insensitive

Olaf Weber (4):
nls: utf8: Add unicode character database files
scripts: add trie generator for UTF-8
nls: utf8: Introduce code for UTF-8 normalization
nls: utf8n: reduce the size of utf8data[]

Documentation/admin-guide/ext4.rst | 29 +
fs/befs/linuxvfs.c | 8 +-
fs/cifs/cifs_unicode.c | 15 +-
fs/cifs/cifsfs.c | 2 +-
fs/cifs/connect.c | 2 +-
fs/cifs/dir.c | 7 +-
fs/ext4/dir.c | 59 +
fs/ext4/ext4.h | 33 +-
fs/ext4/hash.c | 38 +-
fs/ext4/ialloc.c | 2 +-
fs/ext4/inline.c | 2 +-
fs/ext4/inode.c | 4 +-
fs/ext4/ioctl.c | 18 +
fs/ext4/namei.c | 85 +-
fs/ext4/super.c | 83 +
fs/fat/dir.c | 13 +-
fs/fat/inode.c | 6 +-
fs/fat/namei_vfat.c | 6 +-
fs/hfs/super.c | 6 +-
fs/hfs/trans.c | 9 +-
fs/hfsplus/options.c | 2 +-
fs/hfsplus/unicode.c | 6 +-
fs/isofs/inode.c | 5 +-
fs/isofs/joliet.c | 3 +-
fs/jfs/jfs_unicode.c | 9 +-
fs/jfs/super.c | 3 +-
fs/nls/Kconfig | 15 +
fs/nls/Makefile | 20 +
fs/nls/mac-celtic.c | 34 +-
fs/nls/mac-centeuro.c | 34 +-
fs/nls/mac-croatian.c | 34 +-
fs/nls/mac-cyrillic.c | 34 +-
fs/nls/mac-gaelic.c | 34 +-
fs/nls/mac-greek.c | 34 +-
fs/nls/mac-iceland.c | 34 +-
fs/nls/mac-inuit.c | 34 +-
fs/nls/mac-roman.c | 34 +-
fs/nls/mac-romanian.c | 34 +-
fs/nls/mac-turkish.c | 34 +-
fs/nls/nls_ascii.c | 84 +-
fs/nls/nls_core.c | 163 ++
fs/nls/nls_cp1250.c | 34 +-
fs/nls/nls_cp1251.c | 34 +-
fs/nls/nls_cp1255.c | 36 +-
fs/nls/nls_cp437.c | 34 +-
fs/nls/nls_cp737.c | 34 +-
fs/nls/nls_cp775.c | 34 +-
fs/nls/nls_cp850.c | 34 +-
fs/nls/nls_cp852.c | 34 +-
fs/nls/nls_cp855.c | 34 +-
fs/nls/nls_cp857.c | 34 +-
fs/nls/nls_cp860.c | 34 +-
fs/nls/nls_cp861.c | 34 +-
fs/nls/nls_cp862.c | 34 +-
fs/nls/nls_cp863.c | 34 +-
fs/nls/nls_cp864.c | 34 +-
fs/nls/nls_cp865.c | 34 +-
fs/nls/nls_cp866.c | 34 +-
fs/nls/nls_cp869.c | 34 +-
fs/nls/nls_cp874.c | 36 +-
fs/nls/nls_cp932.c | 36 +-
fs/nls/nls_cp936.c | 36 +-
fs/nls/nls_cp949.c | 36 +-
fs/nls/nls_cp950.c | 36 +-
fs/nls/{nls_base.c => nls_default.c} | 124 +-
fs/nls/nls_euc-jp.c | 29 +-
fs/nls/nls_iso8859-1.c | 34 +-
fs/nls/nls_iso8859-13.c | 34 +-
fs/nls/nls_iso8859-14.c | 34 +-
fs/nls/nls_iso8859-15.c | 34 +-
fs/nls/nls_iso8859-2.c | 34 +-
fs/nls/nls_iso8859-3.c | 34 +-
fs/nls/nls_iso8859-4.c | 34 +-
fs/nls/nls_iso8859-5.c | 34 +-
fs/nls/nls_iso8859-6.c | 34 +-
fs/nls/nls_iso8859-7.c | 34 +-
fs/nls/nls_iso8859-9.c | 34 +-
fs/nls/nls_koi8-r.c | 34 +-
fs/nls/nls_koi8-ru.c | 30 +-
fs/nls/nls_koi8-u.c | 34 +-
fs/nls/nls_utf8-core.c | 328 +++
fs/nls/nls_utf8-norm.c | 797 ++++++
fs/nls/nls_utf8-selftest.c | 316 +++
fs/nls/nls_utf8.c | 67 -
fs/nls/ucd/README | 34 +
fs/nls/utf8n.h | 117 +
fs/ntfs/inode.c | 2 +-
fs/ntfs/super.c | 6 +-
fs/ntfs/unistr.c | 13 +-
fs/udf/super.c | 3 +-
fs/udf/unicode.c | 4 +-
include/linux/fs.h | 2 +
include/linux/nls.h | 293 ++-
scripts/Makefile | 1 +
scripts/mkutf8data.c | 3392 ++++++++++++++++++++++++++
95 files changed, 7287 insertions(+), 618 deletions(-)
create mode 100644 fs/nls/nls_core.c
rename fs/nls/{nls_base.c => nls_default.c} (89%)
create mode 100644 fs/nls/nls_utf8-core.c
create mode 100644 fs/nls/nls_utf8-norm.c
create mode 100644 fs/nls/nls_utf8-selftest.c
delete mode 100644 fs/nls/nls_utf8.c
create mode 100644 fs/nls/ucd/README
create mode 100644 fs/nls/utf8n.h
create mode 100644 scripts/mkutf8data.c

--
2.20.0.rc2


2018-12-06 22:05:42

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH v4 16/23] nls: utf8n: reduce the size of utf8data[]

From: Olaf Weber <[email protected]>

Remove the Hangul decompositions from the utf8data trie, and do
algorithmic decomposition to calculate them on the fly. To store
the decomposition the caller of utf8lookup()/utf8nlookup() must
provide a 12-byte buffer, which is used to synthesize a leaf with
the decomposition. Trie size is reduced from 245kB to 90kB.

Signed-off-by: Olaf Weber <[email protected]>
Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
[Rebase to mainline]
[Fix checkpatch errors]
[Extract robustness fixes and merge back to original mkutf8data.c
patch]
---
fs/nls/nls_utf8-norm.c | 191 +++++++++++++++++++++++---
fs/nls/utf8n.h | 4 +
scripts/mkutf8data.c | 298 ++++++++++++++++++++++++++++++++++++-----
3 files changed, 436 insertions(+), 57 deletions(-)

diff --git a/fs/nls/nls_utf8-norm.c b/fs/nls/nls_utf8-norm.c
index ca0bbf644b49..64c3cc74a2ca 100644
--- a/fs/nls/nls_utf8-norm.c
+++ b/fs/nls/nls_utf8-norm.c
@@ -98,6 +98,38 @@ static inline int utf8clen(const char *s)
return 1 + (c >= 0xC0) + (c >= 0xE0) + (c >= 0xF0);
}

+/*
+ * Decode a 3-byte UTF-8 sequence.
+ */
+static unsigned int
+utf8decode3(const char *str)
+{
+ unsigned int uc;
+
+ uc = *str++ & 0x0F;
+ uc <<= 6;
+ uc |= *str++ & 0x3F;
+ uc <<= 6;
+ uc |= *str++ & 0x3F;
+
+ return uc;
+}
+
+/*
+ * Encode a 3-byte UTF-8 sequence.
+ */
+static int
+utf8encode3(char *str, unsigned int val)
+{
+ str[2] = (val & 0x3F) | 0x80;
+ val >>= 6;
+ str[1] = (val & 0x3F) | 0x80;
+ val >>= 6;
+ str[0] = val | 0xE0;
+
+ return 3;
+}
+
/*
* utf8trie_t
*
@@ -159,7 +191,8 @@ typedef const unsigned char utf8trie_t;
* characters with the Default_Ignorable_Code_Point property.
* These do affect normalization, as they all have CCC 0.
*
- * The decompositions in the trie have been fully expanded.
+ * The decompositions in the trie have been fully expanded, with the
+ * exception of Hangul syllables, which are decomposed algorithmically.
*
* Casefolding, if applicable, is also done using decompositions.
*
@@ -179,6 +212,105 @@ typedef const unsigned char utf8leaf_t;
#define STOPPER (0)
#define DECOMPOSE (255)

+/* Marker for hangul syllable decomposition. */
+#define HANGUL ((char)(255))
+/* Size of the synthesized leaf used for Hangul syllable decomposition. */
+#define UTF8HANGULLEAF (12)
+
+/*
+ * Hangul decomposition (algorithm from Section 3.12 of Unicode 6.3.0)
+ *
+ * AC00;<Hangul Syllable, First>;Lo;0;L;;;;;N;;;;;
+ * D7A3;<Hangul Syllable, Last>;Lo;0;L;;;;;N;;;;;
+ *
+ * SBase = 0xAC00
+ * LBase = 0x1100
+ * VBase = 0x1161
+ * TBase = 0x11A7
+ * LCount = 19
+ * VCount = 21
+ * TCount = 28
+ * NCount = 588 (VCount * TCount)
+ * SCount = 11172 (LCount * NCount)
+ *
+ * Decomposition:
+ * SIndex = s - SBase
+ *
+ * LV (Canonical/Full)
+ * LIndex = SIndex / NCount
+ * VIndex = (Sindex % NCount) / TCount
+ * LPart = LBase + LIndex
+ * VPart = VBase + VIndex
+ *
+ * LVT (Canonical)
+ * LVIndex = (SIndex / TCount) * TCount
+ * TIndex = (Sindex % TCount)
+ * LVPart = SBase + LVIndex
+ * TPart = TBase + TIndex
+ *
+ * LVT (Full)
+ * LIndex = SIndex / NCount
+ * VIndex = (Sindex % NCount) / TCount
+ * TIndex = (Sindex % TCount)
+ * LPart = LBase + LIndex
+ * VPart = VBase + VIndex
+ * if (TIndex == 0) {
+ * d = <LPart, VPart>
+ * } else {
+ * TPart = TBase + TIndex
+ * d = <LPart, TPart, VPart>
+ * }
+ */
+
+/* Constants */
+#define SB (0xAC00)
+#define LB (0x1100)
+#define VB (0x1161)
+#define TB (0x11A7)
+#define LC (19)
+#define VC (21)
+#define TC (28)
+#define NC (VC * TC)
+#define SC (LC * NC)
+
+/* Algorithmic decomposition of hangul syllable. */
+static utf8leaf_t *
+utf8hangul(const char *str, unsigned char *hangul)
+{
+ unsigned int si;
+ unsigned int li;
+ unsigned int vi;
+ unsigned int ti;
+ unsigned char *h;
+
+ /* Calculate the SI, LI, VI, and TI values. */
+ si = utf8decode3(str) - SB;
+ li = si / NC;
+ vi = (si % NC) / TC;
+ ti = si % TC;
+
+ /* Fill in base of leaf. */
+ h = hangul;
+ LEAF_GEN(h) = 2;
+ LEAF_CCC(h) = DECOMPOSE;
+ h += 2;
+
+ /* Add LPart, a 3-byte UTF-8 sequence. */
+ h += utf8encode3((char *)h, li + LB);
+
+ /* Add VPart, a 3-byte UTF-8 sequence. */
+ h += utf8encode3((char *)h, vi + VB);
+
+ /* Add TPart if required, also a 3-byte UTF-8 sequence. */
+ if (ti)
+ h += utf8encode3((char *)h, ti + TB);
+
+ /* Terminate string. */
+ h[0] = '\0';
+
+ return hangul;
+}
+
/*
* Use trie to scan s, touching at most len bytes.
* Returns the leaf if one exists, NULL otherwise.
@@ -187,8 +319,8 @@ typedef const unsigned char utf8leaf_t;
* is well-formed and corresponds to a known unicode code point. The
* shorthand for this will be "is valid UTF-8 unicode".
*/
-static utf8leaf_t *utf8nlookup(const struct utf8data *data, const char *s,
- size_t len)
+static utf8leaf_t *utf8nlookup(const struct utf8data *data,
+ unsigned char *hangul, const char *s, size_t len)
{
utf8trie_t *trie = utf8data + data->offset;
int offlen;
@@ -226,8 +358,7 @@ static utf8leaf_t *utf8nlookup(const struct utf8data *data, const char *s,
trie++;
} else {
/* No right node. */
- node = 0;
- trie = NULL;
+ return NULL;
}
} else {
/* Left leg */
@@ -237,8 +368,7 @@ static utf8leaf_t *utf8nlookup(const struct utf8data *data, const char *s,
trie += offlen + 1;
} else if (*trie & RIGHTPATH) {
/* No left node. */
- node = 0;
- trie = NULL;
+ return NULL;
} else {
/* Left node after this node */
node = (*trie & TRIENODE);
@@ -246,6 +376,14 @@ static utf8leaf_t *utf8nlookup(const struct utf8data *data, const char *s,
}
}
}
+ /*
+ * Hangul decomposition is done algorithmically. These are the
+ * codepoints >= 0xAC00 and <= 0xD7A3. Their UTF-8 encoding is
+ * always 3 bytes long, so s has been advanced twice, and the
+ * start of the sequence is at s-2.
+ */
+ if (LEAF_CCC(trie) == DECOMPOSE && LEAF_STR(trie)[0] == HANGUL)
+ trie = utf8hangul(s - 2, hangul);
return trie;
}

@@ -255,9 +393,10 @@ static utf8leaf_t *utf8nlookup(const struct utf8data *data, const char *s,
*
* Forwards to utf8nlookup().
*/
-static utf8leaf_t *utf8lookup(const struct utf8data *data, const char *s)
+static utf8leaf_t *utf8lookup(const struct utf8data *data,
+ unsigned char *hangul, const char *s)
{
- return utf8nlookup(data, s, (size_t)-1);
+ return utf8nlookup(data, hangul, s, (size_t)-1);
}

/*
@@ -270,11 +409,13 @@ int utf8agemax(const struct utf8data *data, const char *s)
utf8leaf_t *leaf;
int age = 0;
int leaf_age;
+ unsigned char hangul[UTF8HANGULLEAF];

if (!data)
return -1;
+
while (*s) {
- leaf = utf8lookup(data, s);
+ leaf = utf8lookup(data, hangul, s);
if (!leaf)
return -1;

@@ -297,12 +438,13 @@ int utf8agemin(const struct utf8data *data, const char *s)
utf8leaf_t *leaf;
int age;
int leaf_age;
+ unsigned char hangul[UTF8HANGULLEAF];

if (!data)
return -1;
age = data->maxage;
while (*s) {
- leaf = utf8lookup(data, s);
+ leaf = utf8lookup(data, hangul, s);
if (!leaf)
return -1;
leaf_age = utf8agetab[LEAF_GEN(leaf)];
@@ -323,11 +465,13 @@ int utf8nagemax(const struct utf8data *data, const char *s, size_t len)
utf8leaf_t *leaf;
int age = 0;
int leaf_age;
+ unsigned char hangul[UTF8HANGULLEAF];

if (!data)
return -1;
+
while (len && *s) {
- leaf = utf8nlookup(data, s, len);
+ leaf = utf8nlookup(data, hangul, s, len);
if (!leaf)
return -1;
leaf_age = utf8agetab[LEAF_GEN(leaf)];
@@ -349,12 +493,13 @@ int utf8nagemin(const struct utf8data *data, const char *s, size_t len)
utf8leaf_t *leaf;
int leaf_age;
int age;
+ unsigned char hangul[UTF8HANGULLEAF];

if (!data)
return -1;
age = data->maxage;
while (len && *s) {
- leaf = utf8nlookup(data, s, len);
+ leaf = utf8nlookup(data, hangul, s, len);
if (!leaf)
return -1;
leaf_age = utf8agetab[LEAF_GEN(leaf)];
@@ -377,11 +522,12 @@ ssize_t utf8len(const struct utf8data *data, const char *s)
{
utf8leaf_t *leaf;
size_t ret = 0;
+ unsigned char hangul[UTF8HANGULLEAF];

if (!data)
return -1;
while (*s) {
- leaf = utf8lookup(data, s);
+ leaf = utf8lookup(data, hangul, s);
if (!leaf)
return -1;
if (utf8agetab[LEAF_GEN(leaf)] > data->maxage)
@@ -404,11 +550,12 @@ ssize_t utf8nlen(const struct utf8data *data, const char *s, size_t len)
{
utf8leaf_t *leaf;
size_t ret = 0;
+ unsigned char hangul[UTF8HANGULLEAF];

if (!data)
return -1;
while (len && *s) {
- leaf = utf8nlookup(data, s, len);
+ leaf = utf8nlookup(data, hangul, s, len);
if (!leaf)
return -1;
if (utf8agetab[LEAF_GEN(leaf)] > data->maxage)
@@ -531,10 +678,12 @@ int utf8byte(struct utf8cursor *u8c)
}

/* Look up the data for the current character. */
- if (u8c->p)
- leaf = utf8lookup(u8c->data, u8c->s);
- else
- leaf = utf8nlookup(u8c->data, u8c->s, u8c->len);
+ if (u8c->p) {
+ leaf = utf8lookup(u8c->data, u8c->hangul, u8c->s);
+ } else {
+ leaf = utf8nlookup(u8c->data, u8c->hangul,
+ u8c->s, u8c->len);
+ }

/* No leaf found implies that the input is a binary blob. */
if (!leaf)
@@ -555,7 +704,9 @@ int utf8byte(struct utf8cursor *u8c)
ccc = STOPPER;
goto ccc_mismatch;
}
- leaf = utf8lookup(u8c->data, u8c->s);
+
+ leaf = utf8lookup(u8c->data, u8c->hangul, u8c->s);
+ ccc = LEAF_CCC(leaf);
}

/*
diff --git a/fs/nls/utf8n.h b/fs/nls/utf8n.h
index 0f5fc14d4fd2..f60827663503 100644
--- a/fs/nls/utf8n.h
+++ b/fs/nls/utf8n.h
@@ -76,6 +76,9 @@ extern int utf8nagemin(const struct utf8data *data, const char *s, size_t len);
extern ssize_t utf8len(const struct utf8data *data, const char *s);
extern ssize_t utf8nlen(const struct utf8data *data, const char *s, size_t len);

+/* Needed in struct utf8cursor below. */
+#define UTF8HANGULLEAF (12)
+
/*
* Cursor structure used by the normalizer.
*/
@@ -89,6 +92,7 @@ struct utf8cursor {
unsigned int slen;
short int ccc;
short int nccc;
+ unsigned char hangul[UTF8HANGULLEAF];
};

/*
diff --git a/scripts/mkutf8data.c b/scripts/mkutf8data.c
index 26794053d0d4..49bb0e16669b 100644
--- a/scripts/mkutf8data.c
+++ b/scripts/mkutf8data.c
@@ -180,10 +180,14 @@ typedef unsigned char utf8leaf_t;
#define MAXCCC (254)
#define STOPPER (0)
#define DECOMPOSE (255)
+#define HANGUL ((char)(255))
+
+#define UTF8HANGULLEAF (12)

struct tree;
-static utf8leaf_t *utf8nlookup(struct tree *, const char *, size_t);
-static utf8leaf_t *utf8lookup(struct tree *, const char *);
+static utf8leaf_t *utf8nlookup(struct tree *, unsigned char *,
+ const char *, size_t);
+static utf8leaf_t *utf8lookup(struct tree *, unsigned char *, const char *);

unsigned char *utf8data;
size_t utf8data_size;
@@ -331,6 +335,8 @@ static int utf32valid(unsigned int unichar)
return unichar < 0x110000;
}

+#define HANGUL_SYLLABLE(U) ((U) >= 0xAC00 && (U) <= 0xD7A3)
+
#define NODE 1
#define LEAF 0

@@ -461,7 +467,7 @@ static void tree_walk(struct tree *tree)
indent+1);
leaves += 1;
} else if (node->right) {
- assert(node->rightnode==NODE);
+ assert(node->rightnode == NODE);
indent += 1;
node = node->right;
break;
@@ -855,7 +861,7 @@ static void mark_nodes(struct tree *tree)
}
}
} else if (node->right) {
- assert(node->rightnode==NODE);
+ assert(node->rightnode == NODE);
node = node->right;
continue;
}
@@ -907,7 +913,7 @@ static void mark_nodes(struct tree *tree)
}
}
} else if (node->right) {
- assert(node->rightnode==NODE);
+ assert(node->rightnode == NODE);
node = node->right;
if (!node->mark && node->parent->mark &&
!node->parent->left) {
@@ -990,7 +996,7 @@ static int index_nodes(struct tree *tree, int index)
index += tree->leaf_size(node->right);
count++;
} else if (node->right) {
- assert(node->rightnode==NODE);
+ assert(node->rightnode == NODE);
indent += 1;
node = node->right;
break;
@@ -1011,6 +1017,25 @@ static int index_nodes(struct tree *tree, int index)
return index;
}

+/*
+ * Mark the nodes in a subtree, helper for size_nodes().
+ */
+static int mark_subtree(struct node *node)
+{
+ int changed;
+
+ if (!node || node->mark)
+ return 0;
+ node->mark = 1;
+ node->index = node->parent->index;
+ changed = 1;
+ if (node->leftnode == NODE)
+ changed += mark_subtree(node->left);
+ if (node->rightnode == NODE)
+ changed += mark_subtree(node->right);
+ return changed;
+}
+
/*
* Compute the size of nodes and leaves. We start by assuming that
* each node needs to store a three-byte offset. The indexes of the
@@ -1029,6 +1054,7 @@ static int size_nodes(struct tree *tree)
unsigned int bitmask;
unsigned int pathbits;
unsigned int pathmask;
+ unsigned int nbit;
int changed;
int offset;
int size;
@@ -1056,22 +1082,40 @@ static int size_nodes(struct tree *tree)
size = 1;
} else {
if (node->rightnode == NODE) {
+ /*
+ * If the right node is not marked,
+ * look for a corresponding node in
+ * the next tree. Such a node need
+ * not exist.
+ */
right = node->right;
next = tree->next;
while (!right->mark) {
assert(next);
n = next->root;
while (n->bitnum != node->bitnum) {
- if (pathbits & (1<<n->bitnum))
+ nbit = 1 << n->bitnum;
+ if (!(pathmask & nbit))
+ break;
+ if (pathbits & nbit) {
+ if (n->rightnode == LEAF)
+ break;
n = n->right;
- else
+ } else {
+ if (n->leftnode == LEAF)
+ break;
n = n->left;
+ }
}
+ if (n->bitnum != node->bitnum)
+ break;
n = n->right;
- assert(right->bitnum == n->bitnum);
right = n;
next = next->next;
}
+ /* Make sure the right node is marked. */
+ if (!right->mark)
+ changed += mark_subtree(right);
offset = right->index - node->index;
} else {
offset = *tree->leaf_index(tree, node->right);
@@ -1113,7 +1157,7 @@ static int size_nodes(struct tree *tree)
if (node->rightnode == LEAF) {
assert(node->right);
} else if (node->right) {
- assert(node->rightnode==NODE);
+ assert(node->rightnode == NODE);
indent += 1;
node = node->right;
break;
@@ -1146,8 +1190,15 @@ static void emit(struct tree *tree, unsigned char *data)
int offset;
int index;
int indent;
+ int size;
+ int bytes;
+ int leaves;
+ int nodes[4];
unsigned char byte;

+ nodes[0] = nodes[1] = nodes[2] = nodes[3] = 0;
+ leaves = 0;
+ bytes = 0;
index = tree->index;
data += index;
indent = 1;
@@ -1156,7 +1207,10 @@ static void emit(struct tree *tree, unsigned char *data)
if (tree->childnode == LEAF) {
assert(tree->root);
tree->leaf_emit(tree->root, data);
- return;
+ size = tree->leaf_size(tree->root);
+ index += size;
+ leaves++;
+ goto done;
}

assert(tree->childnode == NODE);
@@ -1183,6 +1237,7 @@ static void emit(struct tree *tree, unsigned char *data)
offlen = 2;
else
offlen = 3;
+ nodes[offlen]++;
offset = node->offset;
byte |= offlen << OFFLEN_SHIFT;
*data++ = byte;
@@ -1195,12 +1250,14 @@ static void emit(struct tree *tree, unsigned char *data)
} else if (node->left) {
if (node->leftnode == NODE)
byte |= TRIENODE;
+ nodes[0]++;
*data++ = byte;
index++;
} else if (node->right) {
byte |= RIGHTNODE;
if (node->rightnode == NODE)
byte |= TRIENODE;
+ nodes[0]++;
*data++ = byte;
index++;
} else {
@@ -1215,7 +1272,10 @@ static void emit(struct tree *tree, unsigned char *data)
assert(node->left);
data = tree->leaf_emit(node->left,
data);
- index += tree->leaf_size(node->left);
+ size = tree->leaf_size(node->left);
+ index += size;
+ bytes += size;
+ leaves++;
} else if (node->left) {
assert(node->leftnode == NODE);
indent += 1;
@@ -1229,9 +1289,12 @@ static void emit(struct tree *tree, unsigned char *data)
assert(node->right);
data = tree->leaf_emit(node->right,
data);
- index += tree->leaf_size(node->right);
+ size = tree->leaf_size(node->right);
+ index += size;
+ bytes += size;
+ leaves++;
} else if (node->right) {
- assert(node->rightnode==NODE);
+ assert(node->rightnode == NODE);
indent += 1;
node = node->right;
break;
@@ -1243,6 +1306,15 @@ static void emit(struct tree *tree, unsigned char *data)
indent -= 1;
}
}
+done:
+ if (verbose > 0) {
+ printf("Emitted %d (%d) leaves",
+ leaves, bytes);
+ printf(" %d (%d+%d+%d+%d) nodes",
+ nodes[0] + nodes[1] + nodes[2] + nodes[3],
+ nodes[0], nodes[1], nodes[2], nodes[3]);
+ printf(" %d total\n", index - tree->index);
+ }
}

/* ------------------------------------------------------------------ */
@@ -1344,7 +1416,9 @@ static void nfkdi_print(void *l, int indent)

printf("%*sleaf @ %p code %X ccc %d gen %d", indent, "", leaf,
leaf->code, leaf->ccc, leaf->gen);
- if (leaf->utf8nfkdi)
+ if (leaf->utf8nfkdi && leaf->utf8nfkdi[0] == HANGUL)
+ printf(" nfkdi \"%s\"", "HANGUL SYLLABLE");
+ else if (leaf->utf8nfkdi)
printf(" nfkdi \"%s\"", (const char*)leaf->utf8nfkdi);
printf("\n");
}
@@ -1357,6 +1431,8 @@ static void nfkdicf_print(void *l, int indent)
leaf->code, leaf->ccc, leaf->gen);
if (leaf->utf8nfkdicf)
printf(" nfkdicf \"%s\"", (const char*)leaf->utf8nfkdicf);
+ else if (leaf->utf8nfkdi && leaf->utf8nfkdi[0] == HANGUL)
+ printf(" nfkdi \"%s\"", "HANGUL SYLLABLE");
else if (leaf->utf8nfkdi)
printf(" nfkdi \"%s\"", (const char*)leaf->utf8nfkdi);
printf("\n");
@@ -1388,7 +1464,9 @@ static int nfkdi_size(void *l)
struct unicode_data *leaf = l;

int size = 2;
- if (leaf->utf8nfkdi)
+ if (HANGUL_SYLLABLE(leaf->code))
+ size += 1;
+ else if (leaf->utf8nfkdi)
size += strlen(leaf->utf8nfkdi) + 1;
return size;
}
@@ -1398,7 +1476,9 @@ static int nfkdicf_size(void *l)
struct unicode_data *leaf = l;

int size = 2;
- if (leaf->utf8nfkdicf)
+ if (HANGUL_SYLLABLE(leaf->code))
+ size += 1;
+ else if (leaf->utf8nfkdicf)
size += strlen(leaf->utf8nfkdicf) + 1;
else if (leaf->utf8nfkdi)
size += strlen(leaf->utf8nfkdi) + 1;
@@ -1425,7 +1505,10 @@ static unsigned char *nfkdi_emit(void *l, unsigned char *data)
unsigned char *s;

*data++ = leaf->gen;
- if (leaf->utf8nfkdi) {
+ if (HANGUL_SYLLABLE(leaf->code)) {
+ *data++ = DECOMPOSE;
+ *data++ = HANGUL;
+ } else if (leaf->utf8nfkdi) {
*data++ = DECOMPOSE;
s = (unsigned char*)leaf->utf8nfkdi;
while ((*data++ = *s++) != 0)
@@ -1442,7 +1525,10 @@ static unsigned char *nfkdicf_emit(void *l, unsigned char *data)
unsigned char *s;

*data++ = leaf->gen;
- if (leaf->utf8nfkdicf) {
+ if (HANGUL_SYLLABLE(leaf->code)) {
+ *data++ = DECOMPOSE;
+ *data++ = HANGUL;
+ } else if (leaf->utf8nfkdicf) {
*data++ = DECOMPOSE;
s = (unsigned char*)leaf->utf8nfkdicf;
while ((*data++ = *s++) != 0)
@@ -1465,6 +1551,11 @@ static void utf8_create(struct unicode_data *data)
unsigned int *um;
int i;

+ if (data->utf8nfkdi) {
+ assert(data->utf8nfkdi[0] == HANGUL);
+ return;
+ }
+
u = utf;
um = data->utf32nfkdi;
if (um) {
@@ -1650,6 +1741,7 @@ static void verify(struct tree *tree)
utf8leaf_t *leaf;
unsigned int unichar;
char key[4];
+ unsigned char hangul[UTF8HANGULLEAF];
int report;
int nocf;

@@ -1663,7 +1755,8 @@ static void verify(struct tree *tree)
if (data->correction <= tree->maxage)
data = &unicode_data[unichar];
utf8encode(key,unichar);
- leaf = utf8lookup(tree, key);
+ leaf = utf8lookup(tree, hangul, key);
+
if (!leaf) {
if (data->gen != -1)
report++;
@@ -1677,7 +1770,10 @@ static void verify(struct tree *tree)
if (data->gen != LEAF_GEN(leaf))
report++;
if (LEAF_CCC(leaf) == DECOMPOSE) {
- if (nocf) {
+ if (HANGUL_SYLLABLE(data->code)) {
+ if (data->utf8nfkdi[0] != HANGUL)
+ report++;
+ } else if (nocf) {
if (!data->utf8nfkdi) {
report++;
} else if (strcmp(data->utf8nfkdi,
@@ -2302,8 +2398,7 @@ static void corrections_init(void)
*
*/

-static void
-hangul_decompose(void)
+static void hangul_decompose(void)
{
unsigned int sb = 0xAC00;
unsigned int lb = 0x1100;
@@ -2347,6 +2442,15 @@ hangul_decompose(void)
memcpy(um, mapping, i * sizeof(unsigned int));
unicode_data[unichar].utf32nfkdicf = um;

+ /*
+ * Add a cookie as a reminder that the hangul syllable
+ * decompositions must not be stored in the generated
+ * trie.
+ */
+ unicode_data[unichar].utf8nfkdi = malloc(2);
+ unicode_data[unichar].utf8nfkdi[0] = HANGUL;
+ unicode_data[unichar].utf8nfkdi[1] = '\0';
+
if (verbose > 1)
print_utf32nfkdi(unichar);

@@ -2472,6 +2576,99 @@ int utf8cursor(struct utf8cursor *, struct tree *, const char *);
int utf8ncursor(struct utf8cursor *, struct tree *, const char *, size_t);
int utf8byte(struct utf8cursor *);

+/*
+ * Hangul decomposition (algorithm from Section 3.12 of Unicode 6.3.0)
+ *
+ * AC00;<Hangul Syllable, First>;Lo;0;L;;;;;N;;;;;
+ * D7A3;<Hangul Syllable, Last>;Lo;0;L;;;;;N;;;;;
+ *
+ * SBase = 0xAC00
+ * LBase = 0x1100
+ * VBase = 0x1161
+ * TBase = 0x11A7
+ * LCount = 19
+ * VCount = 21
+ * TCount = 28
+ * NCount = 588 (VCount * TCount)
+ * SCount = 11172 (LCount * NCount)
+ *
+ * Decomposition:
+ * SIndex = s - SBase
+ *
+ * LV (Canonical/Full)
+ * LIndex = SIndex / NCount
+ * VIndex = (Sindex % NCount) / TCount
+ * LPart = LBase + LIndex
+ * VPart = VBase + VIndex
+ *
+ * LVT (Canonical)
+ * LVIndex = (SIndex / TCount) * TCount
+ * TIndex = (Sindex % TCount)
+ * LVPart = SBase + LVIndex
+ * TPart = TBase + TIndex
+ *
+ * LVT (Full)
+ * LIndex = SIndex / NCount
+ * VIndex = (Sindex % NCount) / TCount
+ * TIndex = (Sindex % TCount)
+ * LPart = LBase + LIndex
+ * VPart = VBase + VIndex
+ * if (TIndex == 0) {
+ * d = <LPart, VPart>
+ * } else {
+ * TPart = TBase + TIndex
+ * d = <LPart, VPart, TPart>
+ * }
+ */
+
+/* Constants */
+#define SB (0xAC00)
+#define LB (0x1100)
+#define VB (0x1161)
+#define TB (0x11A7)
+#define LC (19)
+#define VC (21)
+#define TC (28)
+#define NC (VC * TC)
+#define SC (LC * NC)
+
+/* Algorithmic decomposition of hangul syllable. */
+static utf8leaf_t *utf8hangul(const char *str, unsigned char *hangul)
+{
+ unsigned int si;
+ unsigned int li;
+ unsigned int vi;
+ unsigned int ti;
+ unsigned char *h;
+
+ /* Calculate the SI, LI, VI, and TI values. */
+ si = utf8decode(str) - SB;
+ li = si / NC;
+ vi = (si % NC) / TC;
+ ti = si % TC;
+
+ /* Fill in base of leaf. */
+ h = hangul;
+ LEAF_GEN(h) = 2;
+ LEAF_CCC(h) = DECOMPOSE;
+ h += 2;
+
+ /* Add LPart, a 3-byte UTF-8 sequence. */
+ h += utf8encode((char *)h, li + LB);
+
+ /* Add VPart, a 3-byte UTF-8 sequence. */
+ h += utf8encode((char *)h, vi + VB);
+
+ /* Add TPart if required, also a 3-byte UTF-8 sequence. */
+ if (ti)
+ h += utf8encode((char *)h, ti + TB);
+
+ /* Terminate string. */
+ h[0] = '\0';
+
+ return hangul;
+}
+
/*
* Use trie to scan s, touching at most len bytes.
* Returns the leaf if one exists, NULL otherwise.
@@ -2480,7 +2677,8 @@ int utf8byte(struct utf8cursor *);
* is well-formed and corresponds to a known unicode code point. The
* shorthand for this will be "is valid UTF-8 unicode".
*/
-static utf8leaf_t *utf8nlookup(struct tree *tree, const char *s, size_t len)
+static utf8leaf_t *utf8nlookup(struct tree *tree, unsigned char *hangul,
+ const char *s, size_t len)
{
utf8trie_t *trie = utf8data + tree->index;
int offlen;
@@ -2536,6 +2734,14 @@ static utf8leaf_t *utf8nlookup(struct tree *tree, const char *s, size_t len)
}
}
}
+ /*
+ * Hangul decomposition is done algorithmically. These are the
+ * codepoints >= 0xAC00 and <= 0xD7A3. Their UTF-8 encoding is
+ * always 3 bytes long, so s has been advanced twice, and the
+ * start of the sequence is at s-2.
+ */
+ if (LEAF_CCC(trie) == DECOMPOSE && LEAF_STR(trie)[0] == HANGUL)
+ trie = utf8hangul(s - 2, hangul);
return trie;
}

@@ -2545,9 +2751,10 @@ static utf8leaf_t *utf8nlookup(struct tree *tree, const char *s, size_t len)
*
* Forwards to trie_nlookup().
*/
-static utf8leaf_t *utf8lookup(struct tree *tree, const char *s)
+static utf8leaf_t *utf8lookup(struct tree *tree, unsigned char *hangul,
+ const char *s)
{
- return utf8nlookup(tree, s, (size_t)-1);
+ return utf8nlookup(tree, hangul, s, (size_t)-1);
}

/*
@@ -2571,11 +2778,14 @@ int utf8agemax(struct tree *tree, const char *s)
utf8leaf_t *leaf;
int age = 0;
int leaf_age;
+ unsigned char hangul[UTF8HANGULLEAF];

if (!tree)
return -1;
+
while (*s) {
- if (!(leaf = utf8lookup(tree, s)))
+ leaf = utf8lookup(tree, hangul, s);
+ if (!leaf)
return -1;
leaf_age = ages[LEAF_GEN(leaf)];
if (leaf_age <= tree->maxage && leaf_age > age)
@@ -2595,12 +2805,14 @@ int utf8agemin(struct tree *tree, const char *s)
utf8leaf_t *leaf;
int age;
int leaf_age;
+ unsigned char hangul[UTF8HANGULLEAF];

if (!tree)
return -1;
age = tree->maxage;
while (*s) {
- if (!(leaf = utf8lookup(tree, s)))
+ leaf = utf8lookup(tree, hangul, s);
+ if (!leaf)
return -1;
leaf_age = ages[LEAF_GEN(leaf)];
if (leaf_age <= tree->maxage && leaf_age < age)
@@ -2619,11 +2831,14 @@ int utf8nagemax(struct tree *tree, const char *s, size_t len)
utf8leaf_t *leaf;
int age = 0;
int leaf_age;
+ unsigned char hangul[UTF8HANGULLEAF];

if (!tree)
return -1;
+
while (len && *s) {
- if (!(leaf = utf8nlookup(tree, s, len)))
+ leaf = utf8nlookup(tree, hangul, s, len);
+ if (!leaf)
return -1;
leaf_age = ages[LEAF_GEN(leaf)];
if (leaf_age <= tree->maxage && leaf_age > age)
@@ -2643,12 +2858,14 @@ int utf8nagemin(struct tree *tree, const char *s, size_t len)
utf8leaf_t *leaf;
int leaf_age;
int age;
+ unsigned char hangul[UTF8HANGULLEAF];

if (!tree)
return -1;
age = tree->maxage;
while (len && *s) {
- if (!(leaf = utf8nlookup(tree, s, len)))
+ leaf = utf8nlookup(tree, hangul, s, len);
+ if (!leaf)
return -1;
leaf_age = ages[LEAF_GEN(leaf)];
if (leaf_age <= tree->maxage && leaf_age < age)
@@ -2669,11 +2886,13 @@ ssize_t utf8len(struct tree *tree, const char *s)
{
utf8leaf_t *leaf;
size_t ret = 0;
+ unsigned char hangul[UTF8HANGULLEAF];

if (!tree)
return -1;
while (*s) {
- if (!(leaf = utf8lookup(tree, s)))
+ leaf = utf8lookup(tree, hangul, s);
+ if (!leaf)
return -1;
if (ages[LEAF_GEN(leaf)] > tree->maxage)
ret += utf8clen(s);
@@ -2694,11 +2913,13 @@ ssize_t utf8nlen(struct tree *tree, const char *s, size_t len)
{
utf8leaf_t *leaf;
size_t ret = 0;
+ unsigned char hangul[UTF8HANGULLEAF];

if (!tree)
return -1;
while (len && *s) {
- if (!(leaf = utf8nlookup(tree, s, len)))
+ leaf = utf8nlookup(tree, hangul, s, len);
+ if (!leaf)
return -1;
if (ages[LEAF_GEN(leaf)] > tree->maxage)
ret += utf8clen(s);
@@ -2726,6 +2947,7 @@ struct utf8cursor {
short int ccc;
short int nccc;
unsigned int unichar;
+ unsigned char hangul[UTF8HANGULLEAF];
};

/*
@@ -2833,10 +3055,12 @@ int utf8byte(struct utf8cursor *u8c)
}

/* Look up the data for the current character. */
- if (u8c->p)
- leaf = utf8lookup(u8c->tree, u8c->s);
- else
- leaf = utf8nlookup(u8c->tree, u8c->s, u8c->len);
+ if (u8c->p) {
+ leaf = utf8lookup(u8c->tree, u8c->hangul, u8c->s);
+ } else {
+ leaf = utf8nlookup(u8c->tree, u8c->hangul,
+ u8c->s, u8c->len);
+ }

/* No leaf found implies that the input is a binary blob. */
if (!leaf)
@@ -2856,7 +3080,7 @@ int utf8byte(struct utf8cursor *u8c)
ccc = STOPPER;
goto ccc_mismatch;
}
- leaf = utf8lookup(u8c->tree, u8c->s);
+ leaf = utf8lookup(u8c->tree, u8c->hangul, u8c->s);
ccc = LEAF_CCC(leaf);
}
u8c->unichar = utf8decode(u8c->s);
--
2.20.0.rc2

2018-12-06 22:05:25

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH v4 12/23] nls: utf8: Add unicode character database files

From: Olaf Weber <[email protected]>

Add files from the Unicode Character Database, version 11.0, to the
source. A helper program that generates a trie used for normalization
from these files is part of a separate commit.

- Notes on the update from 8.0.0 and 11.0:

The structure of ucd files and special cases have not experienced any
changes between versions 8.0.0 and 11.0.0. 8.0.0 saw the addition of
Cherokee LC characters, which is an interesting case for case-folding.
The update is accompanied by new tests on the test_ucd module to catch
specific cases. No changes to mkutf8data script was required for the
update.

The actual files are not part of the commit submitted to the list
because they are to big and would bounce. Still, they can be obtained
by the following script:

FILES="CaseFolding.txt DerivedAge.txt extracted/DerivedCombiningClass.txt
DerivedCoreProperties.txt NormalizationCorrections.txt
NormalizationTest.txt UnicodeData.txt"
VERSION=11.0.0
BASE=http://www.unicode.org/Public/${VERSION}/ucd

for i in ${FILES} ; do
wget "${BASE}/$i" -O fs/nls/ucd/$(basename ${i} .txt)-${VERSION}.txt
done

Signed-off-by: Olaf Weber <[email protected]>
Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
[Move ucd directory to fs/nls/]
[Update to Unicode 11.0.0]
---
fs/nls/ucd/README | 34 ++++++++++++++++++++++++++++++++++
1 file changed, 34 insertions(+)
create mode 100644 fs/nls/ucd/README

diff --git a/fs/nls/ucd/README b/fs/nls/ucd/README
new file mode 100644
index 000000000000..553ce7e4c224
--- /dev/null
+++ b/fs/nls/ucd/README
@@ -0,0 +1,34 @@
+The files in this directory are part of the Unicode Character Database
+for version 11.0.0 of the Unicode standard.
+
+The full set of files can be found here:
+
+ http://www.unicode.org/Public/11.0.0/ucd/
+
+The latest released version of the UCD can be found here:
+
+ http://www.unicode.org/Public/UCD/latest/
+
+The files in this directory are identical, except that they have been
+renamed with a suffix indicating the unicode version.
+
+Individual source links:
+
+ http://www.unicode.org/Public/11.0.0/ucd/CaseFolding.txt
+ http://www.unicode.org/Public/11.0.0/ucd/DerivedAge.txt
+ http://www.unicode.org/Public/11.0.0/ucd/extracted/DerivedCombiningClass.txt
+ http://www.unicode.org/Public/11.0.0/ucd/DerivedCoreProperties.txt
+ http://www.unicode.org/Public/11.0.0/ucd/NormalizationCorrections.txt
+ http://www.unicode.org/Public/11.0.0/ucd/NormalizationTest.txt
+ http://www.unicode.org/Public/11.0.0/ucd/UnicodeData.txt
+
+md5sums
+
+ 414436796cf097df55f798e1585448ee CaseFolding-11.0.0.txt
+ 6032a595fbb782694456491d86eecfac DerivedAge-11.0.0.txt
+ 3240997d671297ac754ab0d27577acf7 DerivedCombiningClass-11.0.0.txt
+ d41d8cd98f00b204e9800998ecf8427e DerivedCombiningClass.txt
+ 2a4fe257d9d8184518e036194d2248ec DerivedCoreProperties-11.0.0.txt
+ 4e7d383fa0dd3cd9d49d64e5b7b7c9e0 NormalizationCorrections-11.0.0.txt
+ c9500c5b8b88e584469f056023ecc3f2 NormalizationTest-11.0.0.txt
+ acc291106c3758d2025f8d7bd5518bee UnicodeData-11.0.0.txt
--
2.20.0.rc2

2018-12-08 23:00:07

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v4 00/23] Ext4 Encoding and Case-insensitive support

On Sat, Dec 8, 2018 at 1:58 PM Linus Torvalds
<[email protected]> wrote:
>
> I'm hoping you are at least doing it per-directory. That makes at
> least the "oh, the whole filesystem needs to do this wrong" issue a
> bit less bad.

So for example, if you do it per-directory, the rules could be something like:

- new directories (ie "mkdir()") inherit the icase/folding semantics
of the parent directory

- empty directories can have their case/folding rules changed with
some well-defined interface

and even from just those simple rules, now some icase behavior could
be useful to testing.

Not just filesystem testing (although that would be a thing - thing
fsstress), but for doing app development in a test directory.

Apps like git (and GNU fileutils) could use it for having test suites
for FAT etc filesystems.

And cross-platform apps could use it as a "I want to check that I do
the right thing" if you do development on Linux, but might have a
portable app for other platforms.

If the whole filesystem is that way, nobody is going to do it. Sure,
they could do it on a FAT filesystem using a USB disk, but nobody
really does that. But if you can troivially just run your tests in a
test subdirectory, it's another thing entirely.

So this is the kind of thing I mean when I think icase behavior for a
major Linux filesystem should have a real _design_. It's really quite
fundamentally different from the "oh, I need FAT to be icase" hack
that we have now.

(We might also be able to make the dcache better at handling
well-defined icase/folding rules, as opposed to the current "just give
up, let the filesystem hash it" behavior).

Linus

2018-12-06 23:09:39

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: Re: [PATCH v4 00/23] Ext4 Encoding and Case-insensitive support

Dave Chinner <[email protected]> writes:

> On Thu, Dec 06, 2018 at 05:04:06PM -0500, Gabriel Krisman Bertazi wrote:
>> Hi,
>>
>> Following the e2fsprogs changes, these are the corresponding kernel-side
>> modifications to support the fname_encoding feature.
>>
>> The patches are split in two parts. The fist 14 patches are refactoring
>> and improvements to the NLS code, including the utf8 normalization
>> support. The final patches implement the fname_encoding feature in ext4.
>
> Please repost this all to [email protected]. You're
> changing a significant amount of non-ext4 filesystem code, as well
> as adding core filesystem infrastructure so it needs to have wider
> visibility and review than just the ext4 list.

Thanks. I've submitted it again with a cc to linux-fsdevel.

--
Gabriel Krisman Bertazi

2018-12-09 00:46:22

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH v4 00/23] Ext4 Encoding and Case-insensitive support

On Dec 8, 2018, at 3:59 PM, Linus Torvalds <[email protected]> wrote:
>
> On Sat, Dec 8, 2018 at 1:58 PM Linus Torvalds
> <[email protected]> wrote:
>>
>> I'm hoping you are at least doing it per-directory. That makes at
>> least the "oh, the whole filesystem needs to do this wrong" issue a
>> bit less bad.
>
> So for example, if you do it per-directory, the rules could be something like:
>
> - new directories (ie "mkdir()") inherit the icase/folding semantics
> of the parent directory
>
> - empty directories can have their case/folding rules changed with
> some well-defined interface
>
> and even from just those simple rules, now some icase behavior could
> be useful to testing.
>
> Not just filesystem testing (although that would be a thing - thing
> fsstress), but for doing app development in a test directory.
>
> Apps like git (and GNU fileutils) could use it for having test suites
> for FAT etc filesystems.
>
> And cross-platform apps could use it as a "I want to check that I do
> the right thing" if you do development on Linux, but might have a
> portable app for other platforms.
>
> If the whole filesystem is that way, nobody is going to do it. Sure,
> they could do it on a FAT filesystem using a USB disk, but nobody
> really does that. But if you can troivially just run your tests in a
> test subdirectory, it's another thing entirely.
>
> So this is the kind of thing I mean when I think icase behavior for a
> major Linux filesystem should have a real _design_. It's really quite
> fundamentally different from the "oh, I need FAT to be icase" hack
> that we have now.
>
> (We might also be able to make the dcache better at handling
> well-defined icase/folding rules, as opposed to the current "just give
> up, let the filesystem hash it" behavior).

In theory, we could store the encoding on a per-entry basis if we
wanted, using the dir_data feature (this would consume 2-3 bytes per
entry, depending on how rich an encoding type we wanted). The tricky
part is how does the kernel know what the filename encoding is? How
do we communicate the encoding type back to userspace?

Cheers, Andreas






Attachments:
signature.asc (873.00 B)
Message signed with OpenPGP

2018-12-06 22:05:45

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH v4 17/23] nls: utf8: Integrate utf8 normalization code with utf8 charset

From: Gabriel Krisman Bertazi <[email protected]>

This patch integrates the utf8n patches with the NLS utf8 charset by
implementing the nls_ops operations and nls_charset table. The
Normalization is done with NFKD, and Casefold is implemented using the
NFKD+CF algorithm, implemented by Olaf Weber and SGI. The high level,
strcmp, strncmp functions are implemented on top of the same utf8 code.

Utf-8 with normalization is exposed as optional on top of the existing
utf8 charset, and disabled by default, to avoid changing the behavior of
existing nls_utf8 users. To enable normalization, the specific
normalization type must be set at load_table() time.

Changes since RFC v2:
- Integrate with NLS
- Merge utf8n with nls_utf8.

Changes since RFC v1:
- Change error return code from EIO to EINVAL. (Olaf Weber)
- Fix issues with strncmp/strcmp. (Olaf Weber)
- Remove stack buffer in normalization/casefold. (Olaf Weber)
- Include length parameter for second string on comparison functions.
- Change length type to size_t.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
fs/nls/nls_utf8-core.c | 269 ++++++++++++++++++++++++++++++++++++++---
fs/nls/nls_utf8-norm.c | 6 +
fs/nls/utf8n.h | 1 +
include/linux/nls.h | 8 ++
4 files changed, 270 insertions(+), 14 deletions(-)

diff --git a/fs/nls/nls_utf8-core.c b/fs/nls/nls_utf8-core.c
index fe1ac5efaa37..1b7320bd9c34 100644
--- a/fs/nls/nls_utf8-core.c
+++ b/fs/nls/nls_utf8-core.c
@@ -6,10 +6,15 @@
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/string.h>
+#include <linux/slab.h>
+#include <linux/parser.h>
#include <linux/nls.h>
#include <linux/errno.h>

+#include "utf8n.h"
+
static unsigned char identity[256];
+static struct nls_charset utf8_info;

static int uni2char(wchar_t uni, unsigned char *out, int boundlen)
{
@@ -50,22 +55,257 @@ static unsigned char charset_toupper(const struct nls_table *table,
return identity[c];
}

-static const struct nls_ops charset_ops = {
- .lowercase = charset_toupper,
- .uppercase = charset_tolower,
- .uni2char = uni2char,
- .char2uni = char2uni,
-};
+#ifdef CONFIG_NLS_UTF8_NORMALIZATION
+
+static int utf8_validate(const struct nls_table *charset,
+ const unsigned char *str, size_t len)
+{
+ const struct utf8data *data = utf8nfkdi(charset->version);
+
+ if (utf8nlen(data, str, len) < 0)
+ return -1;
+ return 0;
+}
+
+static int utf8_strncmp(const struct nls_table *charset,
+ const unsigned char *str1, size_t len1,
+ const unsigned char *str2, size_t len2)
+{
+ const struct utf8data *data = utf8nfkdi(charset->version);
+ struct utf8cursor cur1, cur2;
+ int c1, c2;
+
+ if (utf8ncursor(&cur1, data, str1, len1) < 0)
+ goto invalid_seq;
+
+ if (utf8ncursor(&cur2, data, str2, len2) < 0)
+ goto invalid_seq;
+
+ do {
+ c1 = utf8byte(&cur1);
+ c2 = utf8byte(&cur2);
+
+ if (c1 < 0 || c2 < 0)
+ goto invalid_seq;
+ if (c1 != c2)
+ return 1;
+ } while (c1);
+
+ return 0;
+
+invalid_seq:
+ if(IS_STRICT_MODE(charset))
+ return -EINVAL;
+
+ /* Treat the sequence as a binary blob. */
+ if (len1 != len2)
+ return 1;
+
+ return !!memcmp(str1, str2, len1);
+}
+
+static int utf8_strncasecmp(const struct nls_table *charset,
+ const unsigned char *str1, size_t len1,
+ const unsigned char *str2, size_t len2)
+{
+ const struct utf8data *data = utf8nfkdicf(charset->version);
+ struct utf8cursor cur1, cur2;
+ int c1, c2;
+
+ if (utf8ncursor(&cur1, data, str1, len1) < 0)
+ goto invalid_seq;
+
+ if (utf8ncursor(&cur2, data, str2, len2) < 0)
+ goto invalid_seq;
+
+ do {
+ c1 = utf8byte(&cur1);
+ c2 = utf8byte(&cur2);
+
+ if (c1 < 0 || c2 < 0)
+ goto invalid_seq;
+ if (c1 != c2)
+ return 1;
+ } while (c1);
+
+ return 0;
+
+invalid_seq:
+ if(IS_STRICT_MODE(charset))
+ return -EINVAL;
+
+ /* Treat the sequence as a binary blob. */
+ if (len1 != len2)
+ return 1;
+
+ return !!memcmp(str1, str2, len1);
+}
+
+static int utf8_casefold_nfkdcf(const struct nls_table *charset,
+ const unsigned char *str, size_t len,
+ unsigned char *dest, size_t dlen)
+{
+ const struct utf8data *data = utf8nfkdicf(charset->version);
+ struct utf8cursor cur;
+ size_t nlen = 0;
+
+ if (utf8ncursor(&cur, data, str, len) < 0)
+ goto invalid_seq;
+
+ for (nlen = 0; nlen < dlen; nlen++) {
+ dest[nlen] = utf8byte(&cur);
+ if (!dest[nlen])
+ return nlen;
+ if (dest[nlen] == -1)
+ break;
+ }
+
+invalid_seq:
+ if (IS_STRICT_MODE(charset))
+ return -EINVAL;
+
+ /* Treat the sequence as a binary blob. */
+ memcpy(dest, str, len);
+ return len;
+}
+
+static int utf8_normalize_nfkd(const struct nls_table *charset,
+ const unsigned char *str,
+ size_t len, unsigned char *dest, size_t dlen)
+{
+ const struct utf8data *data = utf8nfkdi(charset->version);
+ struct utf8cursor cur;
+ ssize_t nlen = 0;
+
+ if (utf8ncursor(&cur, data, str, len) < 0)
+ goto invalid_seq;

-static struct nls_charset nls_charset;
-static struct nls_table table = {
- .charset = &nls_charset,
- .ops = &charset_ops,
+ for (nlen = 0; nlen < dlen; nlen++) {
+ dest[nlen] = utf8byte(&cur);
+ if (!dest[nlen])
+ return nlen;
+ if (dest[nlen] == -1)
+ break;
+ }
+
+invalid_seq:
+ if (IS_STRICT_MODE(charset))
+ return -EINVAL;
+
+ /* Treat the sequence as a binary blob. */
+ memcpy(dest, str, len);
+ return len;
+}
+
+static int utf8_parse_version(const char *version, unsigned int *maj,
+ unsigned int *min, unsigned int *rev)
+{
+ substring_t args[3];
+ char version_string[12];
+ const struct match_token token[] = {
+ {1, "%d.%d.%d"},
+ {0, NULL}
+ };
+
+ strncpy(version_string, version, sizeof(version_string));
+
+ if (match_token(version_string, token, args) != 1)
+ return -EINVAL;
+
+ if (match_int(&args[0], maj) || match_int(&args[1], min) ||
+ match_int(&args[2], rev))
+ return -EINVAL;
+
+ return 0;
+}
+#endif
+
+struct utf8_table {
+ struct nls_table tbl;
+ struct nls_ops ops;
};

-static struct nls_charset nls_charset = {
+static void utf8_set_ops(struct utf8_table *utbl)
+{
+ utbl->ops.lowercase = charset_toupper;
+ utbl->ops.uppercase = charset_tolower;
+ utbl->ops.uni2char = uni2char;
+ utbl->ops.char2uni = char2uni;
+
+#ifdef CONFIG_NLS_UTF8_NORMALIZATION
+ utbl->ops.validate = utf8_validate;
+
+ if (IS_NORMALIZATION_TYPE_UTF8_NFKD(&utbl->tbl)) {
+ utbl->ops.normalize = utf8_normalize_nfkd;
+ utbl->ops.strncmp = utf8_strncmp;
+ }
+
+ if (IS_CASEFOLD_TYPE_UTF8_NFKDCF(&utbl->tbl)) {
+ utbl->ops.casefold = utf8_casefold_nfkdcf;
+ utbl->ops.strncasecmp = utf8_strncasecmp;
+ }
+#endif
+
+ utbl->tbl.ops = &utbl->ops;
+}
+
+static struct nls_table *utf8_load_table(const char *version, unsigned int flags)
+{
+ struct utf8_table *utbl = NULL;
+ unsigned int nls_version;
+
+#ifdef CONFIG_NLS_UTF8_NORMALIZATION
+ if (version) {
+ unsigned int maj, min, rev;
+
+ if (utf8_parse_version(version, &maj, &min, &rev) < 0)
+ return ERR_PTR(-EINVAL);
+
+ if (!utf8version_is_supported(maj, min, rev))
+ return ERR_PTR(-EINVAL);
+
+ nls_version = UNICODE_AGE(maj, min, rev);
+ } else {
+ nls_version = utf8version_latest();
+ printk(KERN_WARNING"UTF-8 version not specified. "
+ "Assuming latest supported version (%d.%d.%d).",
+ (nls_version >> 16) & 0xff, (nls_version >> 8) & 0xff,
+ (nls_version & 0xff));
+ }
+#else
+ nls_version = 0;
+#endif
+
+ utbl = kzalloc(sizeof(struct utf8_table), GFP_KERNEL);
+ if (!utbl)
+ return ERR_PTR(-ENOMEM);
+
+ utbl->tbl.charset = &utf8_info;
+ utbl->tbl.version = nls_version;
+ utbl->tbl.flags = flags;
+ utf8_set_ops(utbl);
+
+ utbl->tbl.next = utf8_info.tables;
+ utf8_info.tables = &utbl->tbl;
+
+ return &utbl->tbl;
+}
+
+static void utf8_cleanup_tables(void)
+{
+ struct nls_table *tmp, *tbl = utf8_info.tables;
+
+ while (tbl) {
+ tmp = tbl;
+ tbl = tbl->next;
+ kfree(tmp);
+ }
+ utf8_info.tables = NULL;
+}
+
+static struct nls_charset utf8_info = {
.charset = "utf8",
- .tables = &table,
+ .load_table = utf8_load_table,
};

static int __init init_nls_utf8(void)
@@ -74,12 +314,13 @@ static int __init init_nls_utf8(void)
for (i=0; i<256; i++)
identity[i] = i;

- return register_nls(&nls_charset);
+ return register_nls(&utf8_info);
}

static void __exit exit_nls_utf8(void)
{
- unregister_nls(&nls_charset);
+ unregister_nls(&utf8_info);
+ utf8_cleanup_tables();
}

module_init(init_nls_utf8)
diff --git a/fs/nls/nls_utf8-norm.c b/fs/nls/nls_utf8-norm.c
index 64c3cc74a2ca..abee8b376a87 100644
--- a/fs/nls/nls_utf8-norm.c
+++ b/fs/nls/nls_utf8-norm.c
@@ -38,6 +38,12 @@ int utf8version_is_supported(u8 maj, u8 min, u8 rev)
}
EXPORT_SYMBOL(utf8version_is_supported);

+int utf8version_latest()
+{
+ return utf8vers;
+}
+EXPORT_SYMBOL(utf8version_latest);
+
/*
* UTF-8 valid ranges.
*
diff --git a/fs/nls/utf8n.h b/fs/nls/utf8n.h
index f60827663503..b4697f9bfbab 100644
--- a/fs/nls/utf8n.h
+++ b/fs/nls/utf8n.h
@@ -32,6 +32,7 @@

/* Highest unicode version supported by the data tables. */
extern int utf8version_is_supported(u8 maj, u8 min, u8 rev);
+extern int utf8version_latest(void);

/*
* Look for the correct const struct utf8data for a unicode version.
diff --git a/include/linux/nls.h b/include/linux/nls.h
index aab60d4858ee..aee5cbfc07c6 100644
--- a/include/linux/nls.h
+++ b/include/linux/nls.h
@@ -186,6 +186,14 @@ NLS_CASEFOLD_FUNCS(ALL, TOUPPER, NLS_CASEFOLD_TYPE_TOUPPER)
NLS_CASEFOLD_FUNCS(ASCII, TOUPPER, NLS_ASCII_CASEFOLD_TOUPPER)
NLS_CASEFOLD_FUNCS(ASCII, TOLOWER, NLS_ASCII_CASEFOLD_TOLOWER)

+/* UTF-8 */
+
+#define NLS_UTF8_NORMALIZATION_TYPE_NFKD NLS_NORMALIZATION_TYPE(1)
+#define NLS_UTF8_CASEFOLD_TYPE_NFKDCF NLS_CASEFOLD_TYPE(1)
+
+NLS_NORMALIZATION_FUNCS(UTF8, NFKD, NLS_UTF8_NORMALIZATION_TYPE_NFKD)
+NLS_CASEFOLD_FUNCS(UTF8, NFKDCF, NLS_UTF8_CASEFOLD_TYPE_NFKDCF)
+
/* nls_base.c */
extern int __register_nls(struct nls_charset *, struct module *);
extern int unregister_nls(struct nls_charset *);
--
2.20.0.rc2

2018-12-06 22:05:03

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH v4 05/23] nls: Split struct nls_charset from struct nls_table

From: Gabriel Krisman Bertazi <[email protected]>

In order to support multiple versions of the same encoding, we want to
split the charset registering data, which enrolls it with the NLS
subsystem and is valid for all versions of the encoding, from the
version specific data, that is actually going to be returned when the
caller loads an NLS charset.

With the exception of the following files (files declaration and default
charset), which were edited by hand, the other files were generated with
the following Coccinelle patch:

Files edited by hand:
- fs/nls/nls_base.c
- include/linux/nls.h
- fs/nls/nls_default.c

<smpl>
@nlstable@
identifier p;
expression charset_str;
@@

static struct nls_table p = {
- .charset = charset_str,
};

@createops@
identifier nlstable.p;
expression nlstable.charset_str;

@@

+static struct nls_charset nls_charset;

static struct nls_table p = {
+ .charset = &nls_charset,
};
+
+static struct nls_charset nls_charset = {
+ .charset = charset_str,
+ .tables = &p,
+};

@@
expression A;
@@
? return
- register_nls(A);
+ register_nls(&nls_charset);

@@
expression A;
@@

- unregister_nls(A);
+ unregister_nls(&nls_charset);

@mvalias@
identifier p;
expression alias_str;
@@
static struct nls_table p = {
- .alias = alias_str,
};

@@
expression mvalias.alias_str;
@@
static struct nls_charset nls_charset = {
+ .alias = alias_str,
};

</smpl>

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
fs/nls/mac-celtic.c | 12 ++++++++---
fs/nls/mac-centeuro.c | 12 ++++++++---
fs/nls/mac-croatian.c | 12 ++++++++---
fs/nls/mac-cyrillic.c | 12 ++++++++---
fs/nls/mac-gaelic.c | 12 ++++++++---
fs/nls/mac-greek.c | 12 ++++++++---
fs/nls/mac-iceland.c | 12 ++++++++---
fs/nls/mac-inuit.c | 12 ++++++++---
fs/nls/mac-roman.c | 12 ++++++++---
fs/nls/mac-romanian.c | 12 ++++++++---
fs/nls/mac-turkish.c | 12 ++++++++---
fs/nls/nls_ascii.c | 12 ++++++++---
fs/nls/nls_core.c | 48 +++++++++++++++++++++++++++--------------
fs/nls/nls_cp1250.c | 12 ++++++++---
fs/nls/nls_cp1251.c | 12 ++++++++---
fs/nls/nls_cp1255.c | 14 ++++++++----
fs/nls/nls_cp437.c | 12 ++++++++---
fs/nls/nls_cp737.c | 12 ++++++++---
fs/nls/nls_cp775.c | 12 ++++++++---
fs/nls/nls_cp850.c | 12 ++++++++---
fs/nls/nls_cp852.c | 12 ++++++++---
fs/nls/nls_cp855.c | 12 ++++++++---
fs/nls/nls_cp857.c | 12 ++++++++---
fs/nls/nls_cp860.c | 12 ++++++++---
fs/nls/nls_cp861.c | 12 ++++++++---
fs/nls/nls_cp862.c | 12 ++++++++---
fs/nls/nls_cp863.c | 12 ++++++++---
fs/nls/nls_cp864.c | 12 ++++++++---
fs/nls/nls_cp865.c | 12 ++++++++---
fs/nls/nls_cp866.c | 12 ++++++++---
fs/nls/nls_cp869.c | 12 ++++++++---
fs/nls/nls_cp874.c | 14 ++++++++----
fs/nls/nls_cp932.c | 14 ++++++++----
fs/nls/nls_cp936.c | 14 ++++++++----
fs/nls/nls_cp949.c | 14 ++++++++----
fs/nls/nls_cp950.c | 14 ++++++++----
fs/nls/nls_default.c | 9 ++++++--
fs/nls/nls_euc-jp.c | 12 ++++++++---
fs/nls/nls_iso8859-1.c | 12 ++++++++---
fs/nls/nls_iso8859-13.c | 12 ++++++++---
fs/nls/nls_iso8859-14.c | 12 ++++++++---
fs/nls/nls_iso8859-15.c | 12 ++++++++---
fs/nls/nls_iso8859-2.c | 12 ++++++++---
fs/nls/nls_iso8859-3.c | 12 ++++++++---
fs/nls/nls_iso8859-4.c | 12 ++++++++---
fs/nls/nls_iso8859-5.c | 12 ++++++++---
fs/nls/nls_iso8859-6.c | 12 ++++++++---
fs/nls/nls_iso8859-7.c | 12 ++++++++---
fs/nls/nls_iso8859-9.c | 12 ++++++++---
fs/nls/nls_koi8-r.c | 12 ++++++++---
fs/nls/nls_koi8-ru.c | 12 ++++++++---
fs/nls/nls_koi8-u.c | 12 ++++++++---
fs/nls/nls_utf8.c | 12 ++++++++---
include/linux/nls.h | 18 ++++++++++------
54 files changed, 516 insertions(+), 183 deletions(-)

diff --git a/fs/nls/mac-celtic.c b/fs/nls/mac-celtic.c
index 1b59b04f26f2..4fe7347c55d6 100644
--- a/fs/nls/mac-celtic.c
+++ b/fs/nls/mac-celtic.c
@@ -582,21 +582,27 @@ static const struct nls_ops charset_ops = {
.char2uni = char2uni,
};

+static struct nls_charset nls_charset;
static struct nls_table table = {
- .charset = "macceltic",
+ .charset = &nls_charset,
.ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};

+static struct nls_charset nls_charset = {
+ .charset = "macceltic",
+ .tables = &table,
+};
+
static int __init init_nls_macceltic(void)
{
- return register_nls(&table);
+ return register_nls(&nls_charset);
}

static void __exit exit_nls_macceltic(void)
{
- unregister_nls(&table);
+ unregister_nls(&nls_charset);
}

module_init(init_nls_macceltic)
diff --git a/fs/nls/mac-centeuro.c b/fs/nls/mac-centeuro.c
index d5b8f38f97b6..2d115aae4240 100644
--- a/fs/nls/mac-centeuro.c
+++ b/fs/nls/mac-centeuro.c
@@ -512,21 +512,27 @@ static const struct nls_ops charset_ops = {
.char2uni = char2uni,
};

+static struct nls_charset nls_charset;
static struct nls_table table = {
- .charset = "maccenteuro",
+ .charset = &nls_charset,
.ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};

+static struct nls_charset nls_charset = {
+ .charset = "maccenteuro",
+ .tables = &table,
+};
+
static int __init init_nls_maccenteuro(void)
{
- return register_nls(&table);
+ return register_nls(&nls_charset);
}

static void __exit exit_nls_maccenteuro(void)
{
- unregister_nls(&table);
+ unregister_nls(&nls_charset);
}

module_init(init_nls_maccenteuro)
diff --git a/fs/nls/mac-croatian.c b/fs/nls/mac-croatian.c
index 32de6accd526..b496b85fcde1 100644
--- a/fs/nls/mac-croatian.c
+++ b/fs/nls/mac-croatian.c
@@ -582,21 +582,27 @@ static const struct nls_ops charset_ops = {
.char2uni = char2uni,
};

+static struct nls_charset nls_charset;
static struct nls_table table = {
- .charset = "maccroatian",
+ .charset = &nls_charset,
.ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};

+static struct nls_charset nls_charset = {
+ .charset = "maccroatian",
+ .tables = &table,
+};
+
static int __init init_nls_maccroatian(void)
{
- return register_nls(&table);
+ return register_nls(&nls_charset);
}

static void __exit exit_nls_maccroatian(void)
{
- unregister_nls(&table);
+ unregister_nls(&nls_charset);
}

module_init(init_nls_maccroatian)
diff --git a/fs/nls/mac-cyrillic.c b/fs/nls/mac-cyrillic.c
index 34d5c1c05ff1..18c9e0eb8e58 100644
--- a/fs/nls/mac-cyrillic.c
+++ b/fs/nls/mac-cyrillic.c
@@ -477,21 +477,27 @@ static const struct nls_ops charset_ops = {
.char2uni = char2uni,
};

+static struct nls_charset nls_charset;
static struct nls_table table = {
- .charset = "maccyrillic",
+ .charset = &nls_charset,
.ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};

+static struct nls_charset nls_charset = {
+ .charset = "maccyrillic",
+ .tables = &table,
+};
+
static int __init init_nls_maccyrillic(void)
{
- return register_nls(&table);
+ return register_nls(&nls_charset);
}

static void __exit exit_nls_maccyrillic(void)
{
- unregister_nls(&table);
+ unregister_nls(&nls_charset);
}

module_init(init_nls_maccyrillic)
diff --git a/fs/nls/mac-gaelic.c b/fs/nls/mac-gaelic.c
index 2aabf5213176..8f8d6ae20f02 100644
--- a/fs/nls/mac-gaelic.c
+++ b/fs/nls/mac-gaelic.c
@@ -547,21 +547,27 @@ static const struct nls_ops charset_ops = {
.char2uni = char2uni,
};

+static struct nls_charset nls_charset;
static struct nls_table table = {
- .charset = "macgaelic",
+ .charset = &nls_charset,
.ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};

+static struct nls_charset nls_charset = {
+ .charset = "macgaelic",
+ .tables = &table,
+};
+
static int __init init_nls_macgaelic(void)
{
- return register_nls(&table);
+ return register_nls(&nls_charset);
}

static void __exit exit_nls_macgaelic(void)
{
- unregister_nls(&table);
+ unregister_nls(&nls_charset);
}

module_init(init_nls_macgaelic)
diff --git a/fs/nls/mac-greek.c b/fs/nls/mac-greek.c
index df62909ef57e..0e2c12fe3447 100644
--- a/fs/nls/mac-greek.c
+++ b/fs/nls/mac-greek.c
@@ -477,21 +477,27 @@ static const struct nls_ops charset_ops = {
.char2uni = char2uni,
};

+static struct nls_charset nls_charset;
static struct nls_table table = {
- .charset = "macgreek",
+ .charset = &nls_charset,
.ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};

+static struct nls_charset nls_charset = {
+ .charset = "macgreek",
+ .tables = &table,
+};
+
static int __init init_nls_macgreek(void)
{
- return register_nls(&table);
+ return register_nls(&nls_charset);
}

static void __exit exit_nls_macgreek(void)
{
- unregister_nls(&table);
+ unregister_nls(&nls_charset);
}

module_init(init_nls_macgreek)
diff --git a/fs/nls/mac-iceland.c b/fs/nls/mac-iceland.c
index 8daa68b995bc..414767fa47a4 100644
--- a/fs/nls/mac-iceland.c
+++ b/fs/nls/mac-iceland.c
@@ -582,21 +582,27 @@ static const struct nls_ops charset_ops = {
.char2uni = char2uni,
};

+static struct nls_charset nls_charset;
static struct nls_table table = {
- .charset = "maciceland",
+ .charset = &nls_charset,
.ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};

+static struct nls_charset nls_charset = {
+ .charset = "maciceland",
+ .tables = &table,
+};
+
static int __init init_nls_maciceland(void)
{
- return register_nls(&table);
+ return register_nls(&nls_charset);
}

static void __exit exit_nls_maciceland(void)
{
- unregister_nls(&table);
+ unregister_nls(&nls_charset);
}

module_init(init_nls_maciceland)
diff --git a/fs/nls/mac-inuit.c b/fs/nls/mac-inuit.c
index b0799693502a..0e06fd3a0c8f 100644
--- a/fs/nls/mac-inuit.c
+++ b/fs/nls/mac-inuit.c
@@ -512,21 +512,27 @@ static const struct nls_ops charset_ops = {
.char2uni = char2uni,
};

+static struct nls_charset nls_charset;
static struct nls_table table = {
- .charset = "macinuit",
+ .charset = &nls_charset,
.ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};

+static struct nls_charset nls_charset = {
+ .charset = "macinuit",
+ .tables = &table,
+};
+
static int __init init_nls_macinuit(void)
{
- return register_nls(&table);
+ return register_nls(&nls_charset);
}

static void __exit exit_nls_macinuit(void)
{
- unregister_nls(&table);
+ unregister_nls(&nls_charset);
}

module_init(init_nls_macinuit)
diff --git a/fs/nls/mac-roman.c b/fs/nls/mac-roman.c
index ba358b864b05..fcfd387cfaa8 100644
--- a/fs/nls/mac-roman.c
+++ b/fs/nls/mac-roman.c
@@ -617,21 +617,27 @@ static const struct nls_ops charset_ops = {
.char2uni = char2uni,
};

+static struct nls_charset nls_charset;
static struct nls_table table = {
- .charset = "macroman",
+ .charset = &nls_charset,
.ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};

+static struct nls_charset nls_charset = {
+ .charset = "macroman",
+ .tables = &table,
+};
+
static int __init init_nls_macroman(void)
{
- return register_nls(&table);
+ return register_nls(&nls_charset);
}

static void __exit exit_nls_macroman(void)
{
- unregister_nls(&table);
+ unregister_nls(&nls_charset);
}

module_init(init_nls_macroman)
diff --git a/fs/nls/mac-romanian.c b/fs/nls/mac-romanian.c
index 7a8a7f9a0bbc..74027022a135 100644
--- a/fs/nls/mac-romanian.c
+++ b/fs/nls/mac-romanian.c
@@ -582,21 +582,27 @@ static const struct nls_ops charset_ops = {
.char2uni = char2uni,
};

+static struct nls_charset nls_charset;
static struct nls_table table = {
- .charset = "macromanian",
+ .charset = &nls_charset,
.ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};

+static struct nls_charset nls_charset = {
+ .charset = "macromanian",
+ .tables = &table,
+};
+
static int __init init_nls_macromanian(void)
{
- return register_nls(&table);
+ return register_nls(&nls_charset);
}

static void __exit exit_nls_macromanian(void)
{
- unregister_nls(&table);
+ unregister_nls(&nls_charset);
}

module_init(init_nls_macromanian)
diff --git a/fs/nls/mac-turkish.c b/fs/nls/mac-turkish.c
index eb3c5e53ec88..0edc0f8b1f4d 100644
--- a/fs/nls/mac-turkish.c
+++ b/fs/nls/mac-turkish.c
@@ -582,21 +582,27 @@ static const struct nls_ops charset_ops = {
.char2uni = char2uni,
};

+static struct nls_charset nls_charset;
static struct nls_table table = {
- .charset = "macturkish",
+ .charset = &nls_charset,
.ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};

+static struct nls_charset nls_charset = {
+ .charset = "macturkish",
+ .tables = &table,
+};
+
static int __init init_nls_macturkish(void)
{
- return register_nls(&table);
+ return register_nls(&nls_charset);
}

static void __exit exit_nls_macturkish(void)
{
- unregister_nls(&table);
+ unregister_nls(&nls_charset);
}

module_init(init_nls_macturkish)
diff --git a/fs/nls/nls_ascii.c b/fs/nls/nls_ascii.c
index 6bad3e779284..3c3ee908d1ed 100644
--- a/fs/nls/nls_ascii.c
+++ b/fs/nls/nls_ascii.c
@@ -147,21 +147,27 @@ static const struct nls_ops charset_ops = {
.char2uni = char2uni,
};

+static struct nls_charset nls_charset;
static struct nls_table table = {
- .charset = "ascii",
+ .charset = &nls_charset,
.ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};

+static struct nls_charset nls_charset = {
+ .charset = "ascii",
+ .tables = &table,
+};
+
static int __init init_nls_ascii(void)
{
- return register_nls(&table);
+ return register_nls(&nls_charset);
}

static void __exit exit_nls_ascii(void)
{
- unregister_nls(&table);
+ unregister_nls(&nls_charset);
}

module_init(init_nls_ascii)
diff --git a/fs/nls/nls_core.c b/fs/nls/nls_core.c
index 3f7de8f4c5b2..200a7f8165e6 100644
--- a/fs/nls/nls_core.c
+++ b/fs/nls/nls_core.c
@@ -16,13 +16,18 @@
#include <linux/kmod.h>
#include <linux/spinlock.h>

-static struct nls_table default_table;
-static struct nls_table *tables = &default_table;
+extern struct nls_charset default_charset;
+static struct nls_charset *charsets = &default_charset;
static DEFINE_SPINLOCK(nls_lock);
+static struct nls_table *nls_load_table(struct nls_charset *charset)
+{
+ /* For now, return the default table, which is the first one found. */
+ return charset->tables;
+}

-int __register_nls(struct nls_table *nls, struct module *owner)
+int __register_nls(struct nls_charset *nls, struct module *owner)
{
- struct nls_table ** tmp = &tables;
+ struct nls_charset **tmp = &charsets;

if (nls->next)
return -EBUSY;
@@ -36,16 +41,16 @@ int __register_nls(struct nls_table *nls, struct module *owner)
}
tmp = &(*tmp)->next;
}
- nls->next = tables;
- tables = nls;
+ nls->next = charsets;
+ charsets = nls;
spin_unlock(&nls_lock);
return 0;
}
EXPORT_SYMBOL(__register_nls);

-int unregister_nls(struct nls_table * nls)
+int unregister_nls(struct nls_charset * nls)
{
- struct nls_table ** tmp = &tables;
+ struct nls_charset **tmp = &charsets;

spin_lock(&nls_lock);
while (*tmp) {
@@ -60,31 +65,42 @@ int unregister_nls(struct nls_table * nls)
return -EINVAL;
}

-static struct nls_table *find_nls(char *charset)
+static struct nls_charset *find_nls(const char *charset)
{
- struct nls_table *nls;
+ struct nls_charset *nls;
spin_lock(&nls_lock);
- for (nls = tables; nls; nls = nls->next) {
- if (!strcmp(nls_charset_name(nls), charset))
+ for (nls = charsets; nls; nls = nls->next) {
+ if (!strcmp(nls->charset, charset))
break;
if (nls->alias && !strcmp(nls->alias, charset))
break;
}
- if (nls && !try_module_get(nls->owner))
- nls = NULL;
+
+ if (!nls)
+ nls = ERR_PTR(-EINVAL);
+ else if (!try_module_get(nls->owner))
+ nls = ERR_PTR(-EBUSY);
+
spin_unlock(&nls_lock);
return nls;
}

struct nls_table *load_nls(char *charset)
{
- return try_then_request_module(find_nls(charset), "nls_%s", charset);
+ struct nls_charset *nls_charset;
+
+ nls_charset = try_then_request_module(find_nls(charset),
+ "nls_%s", charset);
+ if (!IS_ERR(nls_charset))
+ return NULL;
+
+ return nls_load_table(nls_charset);
}

void unload_nls(struct nls_table *nls)
{
if (nls)
- module_put(nls->owner);
+ module_put(nls->charset->owner);
}

EXPORT_SYMBOL(unregister_nls);
diff --git a/fs/nls/nls_cp1250.c b/fs/nls/nls_cp1250.c
index 08902e86fc8e..080717694405 100644
--- a/fs/nls/nls_cp1250.c
+++ b/fs/nls/nls_cp1250.c
@@ -328,20 +328,26 @@ static const struct nls_ops charset_ops = {
.char2uni = char2uni,
};

+static struct nls_charset nls_charset;
static struct nls_table table = {
- .charset = "cp1250",
+ .charset = &nls_charset,
.ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};

+static struct nls_charset nls_charset = {
+ .charset = "cp1250",
+ .tables = &table,
+};
+
static int __init init_nls_cp1250(void)
{
- return register_nls(&table);
+ return register_nls(&nls_charset);
}
static void __exit exit_nls_cp1250(void)
{
- unregister_nls(&table);
+ unregister_nls(&nls_charset);
}

module_init(init_nls_cp1250)
diff --git a/fs/nls/nls_cp1251.c b/fs/nls/nls_cp1251.c
index 2bb88c8cc5bf..2fba498ab289 100644
--- a/fs/nls/nls_cp1251.c
+++ b/fs/nls/nls_cp1251.c
@@ -282,21 +282,27 @@ static const struct nls_ops charset_ops = {
.char2uni = char2uni,
};

+static struct nls_charset nls_charset;
static struct nls_table table = {
- .charset = "cp1251",
+ .charset = &nls_charset,
.ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};

+static struct nls_charset nls_charset = {
+ .charset = "cp1251",
+ .tables = &table,
+};
+
static int __init init_nls_cp1251(void)
{
- return register_nls(&table);
+ return register_nls(&nls_charset);
}

static void __exit exit_nls_cp1251(void)
{
- unregister_nls(&table);
+ unregister_nls(&nls_charset);
}

module_init(init_nls_cp1251)
diff --git a/fs/nls/nls_cp1255.c b/fs/nls/nls_cp1255.c
index c6bf8d575c5b..c268e8d8c038 100644
--- a/fs/nls/nls_cp1255.c
+++ b/fs/nls/nls_cp1255.c
@@ -363,22 +363,28 @@ static const struct nls_ops charset_ops = {
.char2uni = char2uni,
};

+static struct nls_charset nls_charset;
static struct nls_table table = {
- .charset = "cp1255",
- .alias = "iso8859-8",
+ .charset = &nls_charset,
.ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};

+static struct nls_charset nls_charset = {
+ .alias = "iso8859-8",
+ .charset = "cp1255",
+ .tables = &table,
+};
+
static int __init init_nls_cp1255(void)
{
- return register_nls(&table);
+ return register_nls(&nls_charset);
}

static void __exit exit_nls_cp1255(void)
{
- unregister_nls(&table);
+ unregister_nls(&nls_charset);
}

module_init(init_nls_cp1255)
diff --git a/fs/nls/nls_cp437.c b/fs/nls/nls_cp437.c
index 0f3f8bdbb62b..f24f8691e720 100644
--- a/fs/nls/nls_cp437.c
+++ b/fs/nls/nls_cp437.c
@@ -368,21 +368,27 @@ static const struct nls_ops charset_ops = {
.char2uni = char2uni,
};

+static struct nls_charset nls_charset;
static struct nls_table table = {
- .charset = "cp437",
+ .charset = &nls_charset,
.ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};

+static struct nls_charset nls_charset = {
+ .charset = "cp437",
+ .tables = &table,
+};
+
static int __init init_nls_cp437(void)
{
- return register_nls(&table);
+ return register_nls(&nls_charset);
}

static void __exit exit_nls_cp437(void)
{
- unregister_nls(&table);
+ unregister_nls(&nls_charset);
}

module_init(init_nls_cp437)
diff --git a/fs/nls/nls_cp737.c b/fs/nls/nls_cp737.c
index 9383359ca25f..f5a8b9e88165 100644
--- a/fs/nls/nls_cp737.c
+++ b/fs/nls/nls_cp737.c
@@ -331,21 +331,27 @@ static const struct nls_ops charset_ops = {
.char2uni = char2uni,
};

+static struct nls_charset nls_charset;
static struct nls_table table = {
- .charset = "cp737",
+ .charset = &nls_charset,
.ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};

+static struct nls_charset nls_charset = {
+ .charset = "cp737",
+ .tables = &table,
+};
+
static int __init init_nls_cp737(void)
{
- return register_nls(&table);
+ return register_nls(&nls_charset);
}

static void __exit exit_nls_cp737(void)
{
- unregister_nls(&table);
+ unregister_nls(&nls_charset);
}

module_init(init_nls_cp737)
diff --git a/fs/nls/nls_cp775.c b/fs/nls/nls_cp775.c
index 6c787b9079ed..d268bfb873e4 100644
--- a/fs/nls/nls_cp775.c
+++ b/fs/nls/nls_cp775.c
@@ -300,21 +300,27 @@ static const struct nls_ops charset_ops = {
.char2uni = char2uni,
};

+static struct nls_charset nls_charset;
static struct nls_table table = {
- .charset = "cp775",
+ .charset = &nls_charset,
.ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};

+static struct nls_charset nls_charset = {
+ .charset = "cp775",
+ .tables = &table,
+};
+
static int __init init_nls_cp775(void)
{
- return register_nls(&table);
+ return register_nls(&nls_charset);
}

static void __exit exit_nls_cp775(void)
{
- unregister_nls(&table);
+ unregister_nls(&nls_charset);
}

module_init(init_nls_cp775)
diff --git a/fs/nls/nls_cp850.c b/fs/nls/nls_cp850.c
index 50a57138a571..b698b0df65e3 100644
--- a/fs/nls/nls_cp850.c
+++ b/fs/nls/nls_cp850.c
@@ -296,21 +296,27 @@ static const struct nls_ops charset_ops = {
.char2uni = char2uni,
};

+static struct nls_charset nls_charset;
static struct nls_table table = {
- .charset = "cp850",
+ .charset = &nls_charset,
.ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};

+static struct nls_charset nls_charset = {
+ .charset = "cp850",
+ .tables = &table,
+};
+
static int __init init_nls_cp850(void)
{
- return register_nls(&table);
+ return register_nls(&nls_charset);
}

static void __exit exit_nls_cp850(void)
{
- unregister_nls(&table);
+ unregister_nls(&nls_charset);
}

module_init(init_nls_cp850)
diff --git a/fs/nls/nls_cp852.c b/fs/nls/nls_cp852.c
index 0cbb199f1cd5..738e95346b34 100644
--- a/fs/nls/nls_cp852.c
+++ b/fs/nls/nls_cp852.c
@@ -318,21 +318,27 @@ static const struct nls_ops charset_ops = {
.char2uni = char2uni,
};

+static struct nls_charset nls_charset;
static struct nls_table table = {
- .charset = "cp852",
+ .charset = &nls_charset,
.ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};

+static struct nls_charset nls_charset = {
+ .charset = "cp852",
+ .tables = &table,
+};
+
static int __init init_nls_cp852(void)
{
- return register_nls(&table);
+ return register_nls(&nls_charset);
}

static void __exit exit_nls_cp852(void)
{
- unregister_nls(&table);
+ unregister_nls(&nls_charset);
}

module_init(init_nls_cp852)
diff --git a/fs/nls/nls_cp855.c b/fs/nls/nls_cp855.c
index 530b77c86363..9a1c4e307cb1 100644
--- a/fs/nls/nls_cp855.c
+++ b/fs/nls/nls_cp855.c
@@ -280,21 +280,27 @@ static const struct nls_ops charset_ops = {
.char2uni = char2uni,
};

+static struct nls_charset nls_charset;
static struct nls_table table = {
- .charset = "cp855",
+ .charset = &nls_charset,
.ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};

+static struct nls_charset nls_charset = {
+ .charset = "cp855",
+ .tables = &table,
+};
+
static int __init init_nls_cp855(void)
{
- return register_nls(&table);
+ return register_nls(&nls_charset);
}

static void __exit exit_nls_cp855(void)
{
- unregister_nls(&table);
+ unregister_nls(&nls_charset);
}

module_init(init_nls_cp855)
diff --git a/fs/nls/nls_cp857.c b/fs/nls/nls_cp857.c
index 0db642ec6f45..782e31cb9f5a 100644
--- a/fs/nls/nls_cp857.c
+++ b/fs/nls/nls_cp857.c
@@ -282,21 +282,27 @@ static const struct nls_ops charset_ops = {
.char2uni = char2uni,
};

+static struct nls_charset nls_charset;
static struct nls_table table = {
- .charset = "cp857",
+ .charset = &nls_charset,
.ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};

+static struct nls_charset nls_charset = {
+ .charset = "cp857",
+ .tables = &table,
+};
+
static int __init init_nls_cp857(void)
{
- return register_nls(&table);
+ return register_nls(&nls_charset);
}

static void __exit exit_nls_cp857(void)
{
- unregister_nls(&table);
+ unregister_nls(&nls_charset);
}

module_init(init_nls_cp857)
diff --git a/fs/nls/nls_cp860.c b/fs/nls/nls_cp860.c
index 44a40dac26bd..2ad1954b84e6 100644
--- a/fs/nls/nls_cp860.c
+++ b/fs/nls/nls_cp860.c
@@ -345,21 +345,27 @@ static const struct nls_ops charset_ops = {
.char2uni = char2uni,
};

+static struct nls_charset nls_charset;
static struct nls_table table = {
- .charset = "cp860",
+ .charset = &nls_charset,
.ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};

+static struct nls_charset nls_charset = {
+ .charset = "cp860",
+ .tables = &table,
+};
+
static int __init init_nls_cp860(void)
{
- return register_nls(&table);
+ return register_nls(&nls_charset);
}

static void __exit exit_nls_cp860(void)
{
- unregister_nls(&table);
+ unregister_nls(&nls_charset);
}

module_init(init_nls_cp860)
diff --git a/fs/nls/nls_cp861.c b/fs/nls/nls_cp861.c
index 50e08174fc48..5930b0e6e8f1 100644
--- a/fs/nls/nls_cp861.c
+++ b/fs/nls/nls_cp861.c
@@ -368,21 +368,27 @@ static const struct nls_ops charset_ops = {
.char2uni = char2uni,
};

+static struct nls_charset nls_charset;
static struct nls_table table = {
- .charset = "cp861",
+ .charset = &nls_charset,
.ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};

+static struct nls_charset nls_charset = {
+ .charset = "cp861",
+ .tables = &table,
+};
+
static int __init init_nls_cp861(void)
{
- return register_nls(&table);
+ return register_nls(&nls_charset);
}

static void __exit exit_nls_cp861(void)
{
- unregister_nls(&table);
+ unregister_nls(&nls_charset);
}

module_init(init_nls_cp861)
diff --git a/fs/nls/nls_cp862.c b/fs/nls/nls_cp862.c
index 3505f3437972..63c27b24a011 100644
--- a/fs/nls/nls_cp862.c
+++ b/fs/nls/nls_cp862.c
@@ -402,21 +402,27 @@ static const struct nls_ops charset_ops = {
.char2uni = char2uni,
};

+static struct nls_charset nls_charset;
static struct nls_table table = {
- .charset = "cp862",
+ .charset = &nls_charset,
.ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};

+static struct nls_charset nls_charset = {
+ .charset = "cp862",
+ .tables = &table,
+};
+
static int __init init_nls_cp862(void)
{
- return register_nls(&table);
+ return register_nls(&nls_charset);
}

static void __exit exit_nls_cp862(void)
{
- unregister_nls(&table);
+ unregister_nls(&nls_charset);
}

module_init(init_nls_cp862)
diff --git a/fs/nls/nls_cp863.c b/fs/nls/nls_cp863.c
index e3489cdc0c04..aa815cdc7481 100644
--- a/fs/nls/nls_cp863.c
+++ b/fs/nls/nls_cp863.c
@@ -362,21 +362,27 @@ static const struct nls_ops charset_ops = {
.char2uni = char2uni,
};

+static struct nls_charset nls_charset;
static struct nls_table table = {
- .charset = "cp863",
+ .charset = &nls_charset,
.ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};

+static struct nls_charset nls_charset = {
+ .charset = "cp863",
+ .tables = &table,
+};
+
static int __init init_nls_cp863(void)
{
- return register_nls(&table);
+ return register_nls(&nls_charset);
}

static void __exit exit_nls_cp863(void)
{
- unregister_nls(&table);
+ unregister_nls(&nls_charset);
}

module_init(init_nls_cp863)
diff --git a/fs/nls/nls_cp864.c b/fs/nls/nls_cp864.c
index d4185bc7f1bf..a20725f661e9 100644
--- a/fs/nls/nls_cp864.c
+++ b/fs/nls/nls_cp864.c
@@ -388,21 +388,27 @@ static const struct nls_ops charset_ops = {
.char2uni = char2uni,
};

+static struct nls_charset nls_charset;
static struct nls_table table = {
- .charset = "cp864",
+ .charset = &nls_charset,
.ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};

+static struct nls_charset nls_charset = {
+ .charset = "cp864",
+ .tables = &table,
+};
+
static int __init init_nls_cp864(void)
{
- return register_nls(&table);
+ return register_nls(&nls_charset);
}

static void __exit exit_nls_cp864(void)
{
- unregister_nls(&table);
+ unregister_nls(&nls_charset);
}

module_init(init_nls_cp864)
diff --git a/fs/nls/nls_cp865.c b/fs/nls/nls_cp865.c
index 9f468944e577..3d22ec2bd7af 100644
--- a/fs/nls/nls_cp865.c
+++ b/fs/nls/nls_cp865.c
@@ -368,21 +368,27 @@ static const struct nls_ops charset_ops = {
.char2uni = char2uni,
};

+static struct nls_charset nls_charset;
static struct nls_table table = {
- .charset = "cp865",
+ .charset = &nls_charset,
.ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};

+static struct nls_charset nls_charset = {
+ .charset = "cp865",
+ .tables = &table,
+};
+
static int __init init_nls_cp865(void)
{
- return register_nls(&table);
+ return register_nls(&nls_charset);
}

static void __exit exit_nls_cp865(void)
{
- unregister_nls(&table);
+ unregister_nls(&nls_charset);
}

module_init(init_nls_cp865)
diff --git a/fs/nls/nls_cp866.c b/fs/nls/nls_cp866.c
index ee46fd5a76b1..35dc7b2f023a 100644
--- a/fs/nls/nls_cp866.c
+++ b/fs/nls/nls_cp866.c
@@ -286,21 +286,27 @@ static const struct nls_ops charset_ops = {
.char2uni = char2uni,
};

+static struct nls_charset nls_charset;
static struct nls_table table = {
- .charset = "cp866",
+ .charset = &nls_charset,
.ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};

+static struct nls_charset nls_charset = {
+ .charset = "cp866",
+ .tables = &table,
+};
+
static int __init init_nls_cp866(void)
{
- return register_nls(&table);
+ return register_nls(&nls_charset);
}

static void __exit exit_nls_cp866(void)
{
- unregister_nls(&table);
+ unregister_nls(&nls_charset);
}

module_init(init_nls_cp866)
diff --git a/fs/nls/nls_cp869.c b/fs/nls/nls_cp869.c
index da29a4a53e1d..56504ab0f405 100644
--- a/fs/nls/nls_cp869.c
+++ b/fs/nls/nls_cp869.c
@@ -296,21 +296,27 @@ static const struct nls_ops charset_ops = {
.char2uni = char2uni,
};

+static struct nls_charset nls_charset;
static struct nls_table table = {
- .charset = "cp869",
+ .charset = &nls_charset,
.ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};

+static struct nls_charset nls_charset = {
+ .charset = "cp869",
+ .tables = &table,
+};
+
static int __init init_nls_cp869(void)
{
- return register_nls(&table);
+ return register_nls(&nls_charset);
}

static void __exit exit_nls_cp869(void)
{
- unregister_nls(&table);
+ unregister_nls(&nls_charset);
}

module_init(init_nls_cp869)
diff --git a/fs/nls/nls_cp874.c b/fs/nls/nls_cp874.c
index 642659b9ed89..41394620d000 100644
--- a/fs/nls/nls_cp874.c
+++ b/fs/nls/nls_cp874.c
@@ -254,22 +254,28 @@ static const struct nls_ops charset_ops = {
.char2uni = char2uni,
};

+static struct nls_charset nls_charset;
static struct nls_table table = {
- .charset = "cp874",
- .alias = "tis-620",
+ .charset = &nls_charset,
.ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};

+static struct nls_charset nls_charset = {
+ .alias = "tis-620",
+ .charset = "cp874",
+ .tables = &table,
+};
+
static int __init init_nls_cp874(void)
{
- return register_nls(&table);
+ return register_nls(&nls_charset);
}

static void __exit exit_nls_cp874(void)
{
- unregister_nls(&table);
+ unregister_nls(&nls_charset);
}

module_init(init_nls_cp874)
diff --git a/fs/nls/nls_cp932.c b/fs/nls/nls_cp932.c
index 3e7bdefdca90..25fe26fb2603 100644
--- a/fs/nls/nls_cp932.c
+++ b/fs/nls/nls_cp932.c
@@ -7912,22 +7912,28 @@ static const struct nls_ops charset_ops = {
.char2uni = char2uni,
};

+static struct nls_charset nls_charset;
static struct nls_table table = {
- .charset = "cp932",
- .alias = "sjis",
+ .charset = &nls_charset,
.ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};

+static struct nls_charset nls_charset = {
+ .alias = "sjis",
+ .charset = "cp932",
+ .tables = &table,
+};
+
static int __init init_nls_cp932(void)
{
- return register_nls(&table);
+ return register_nls(&nls_charset);
}

static void __exit exit_nls_cp932(void)
{
- unregister_nls(&table);
+ unregister_nls(&nls_charset);
}

module_init(init_nls_cp932)
diff --git a/fs/nls/nls_cp936.c b/fs/nls/nls_cp936.c
index b1fa2918992b..766f86b53a7b 100644
--- a/fs/nls/nls_cp936.c
+++ b/fs/nls/nls_cp936.c
@@ -11090,22 +11090,28 @@ static const struct nls_ops charset_ops = {
.char2uni = char2uni,
};

+static struct nls_charset nls_charset;
static struct nls_table table = {
- .charset = "cp936",
- .alias = "gb2312",
+ .charset = &nls_charset,
.ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};

+static struct nls_charset nls_charset = {
+ .alias = "gb2312",
+ .charset = "cp936",
+ .tables = &table,
+};
+
static int __init init_nls_cp936(void)
{
- return register_nls(&table);
+ return register_nls(&nls_charset);
}

static void __exit exit_nls_cp936(void)
{
- unregister_nls(&table);
+ unregister_nls(&nls_charset);
}

module_init(init_nls_cp936)
diff --git a/fs/nls/nls_cp949.c b/fs/nls/nls_cp949.c
index 1d334095d86c..138eec74bb3f 100644
--- a/fs/nls/nls_cp949.c
+++ b/fs/nls/nls_cp949.c
@@ -13925,22 +13925,28 @@ static const struct nls_ops charset_ops = {
.char2uni = char2uni,
};

+static struct nls_charset nls_charset;
static struct nls_table table = {
- .charset = "cp949",
- .alias = "euc-kr",
+ .charset = &nls_charset,
.ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};

+static struct nls_charset nls_charset = {
+ .alias = "euc-kr",
+ .charset = "cp949",
+ .tables = &table,
+};
+
static int __init init_nls_cp949(void)
{
- return register_nls(&table);
+ return register_nls(&nls_charset);
}

static void __exit exit_nls_cp949(void)
{
- unregister_nls(&table);
+ unregister_nls(&nls_charset);
}

module_init(init_nls_cp949)
diff --git a/fs/nls/nls_cp950.c b/fs/nls/nls_cp950.c
index d936160a48f9..899da09fe0d7 100644
--- a/fs/nls/nls_cp950.c
+++ b/fs/nls/nls_cp950.c
@@ -9461,22 +9461,28 @@ static const struct nls_ops charset_ops = {
.char2uni = char2uni,
};

+static struct nls_charset nls_charset;
static struct nls_table table = {
- .charset = "cp950",
- .alias = "big5",
+ .charset = &nls_charset,
.ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};

+static struct nls_charset nls_charset = {
+ .alias = "big5",
+ .charset = "cp950",
+ .tables = &table,
+};
+
static int __init init_nls_cp950(void)
{
- return register_nls(&table);
+ return register_nls(&nls_charset);
}

static void __exit exit_nls_cp950(void)
{
- unregister_nls(&table);
+ unregister_nls(&nls_charset);
}

module_init(init_nls_cp950)
diff --git a/fs/nls/nls_default.c b/fs/nls/nls_default.c
index c5d7e8391b22..ef8c0efb8a3c 100644
--- a/fs/nls/nls_default.c
+++ b/fs/nls/nls_default.c
@@ -17,7 +17,7 @@
#include <asm/byteorder.h>
#include <linux/nls.h>

-static struct nls_table default_table;
+struct nls_charset default_charset;

struct utf8_table {
int cmask;
@@ -453,12 +453,17 @@ static const struct nls_ops charset_ops = {
};

static struct nls_table default_table = {
- .charset = "default",
+ .charset = &default_charset,
.ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};

+struct nls_charset default_charset = {
+ .charset = "default",
+ .tables = &default_table,
+};
+
/* Returns a simple default translation table */
struct nls_table *load_nls_default(void)
{
diff --git a/fs/nls/nls_euc-jp.c b/fs/nls/nls_euc-jp.c
index 0af73982738b..8bc5d9991452 100644
--- a/fs/nls/nls_euc-jp.c
+++ b/fs/nls/nls_euc-jp.c
@@ -554,11 +554,17 @@ static const struct nls_ops charset_ops = {
.char2uni = char2uni,
};

+static struct nls_charset nls_charset;
static struct nls_table table = {
- .charset = "euc-jp",
+ .charset = &nls_charset,
.ops = &charset_ops,
};

+static struct nls_charset nls_charset = {
+ .charset = "euc-jp",
+ .tables = &table,
+};
+
static int __init init_nls_euc_jp(void)
{
p_nls = load_nls("cp932");
@@ -566,7 +572,7 @@ static int __init init_nls_euc_jp(void)
if (p_nls) {
table.charset2upper = p_nls->charset2upper;
table.charset2lower = p_nls->charset2lower;
- return register_nls(&table);
+ return register_nls(&nls_charset);
}

return -EINVAL;
@@ -574,7 +580,7 @@ static int __init init_nls_euc_jp(void)

static void __exit exit_nls_euc_jp(void)
{
- unregister_nls(&table);
+ unregister_nls(&nls_charset);
unload_nls(p_nls);
}

diff --git a/fs/nls/nls_iso8859-1.c b/fs/nls/nls_iso8859-1.c
index 6212b2925fa0..78e9c0169f69 100644
--- a/fs/nls/nls_iso8859-1.c
+++ b/fs/nls/nls_iso8859-1.c
@@ -238,21 +238,27 @@ static const struct nls_ops charset_ops = {
.char2uni = char2uni,
};

+static struct nls_charset nls_charset;
static struct nls_table table = {
- .charset = "iso8859-1",
+ .charset = &nls_charset,
.ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};

+static struct nls_charset nls_charset = {
+ .charset = "iso8859-1",
+ .tables = &table,
+};
+
static int __init init_nls_iso8859_1(void)
{
- return register_nls(&table);
+ return register_nls(&nls_charset);
}

static void __exit exit_nls_iso8859_1(void)
{
- unregister_nls(&table);
+ unregister_nls(&nls_charset);
}

module_init(init_nls_iso8859_1)
diff --git a/fs/nls/nls_iso8859-13.c b/fs/nls/nls_iso8859-13.c
index 8f0a23109207..eb8665629e0f 100644
--- a/fs/nls/nls_iso8859-13.c
+++ b/fs/nls/nls_iso8859-13.c
@@ -266,21 +266,27 @@ static const struct nls_ops charset_ops = {
.char2uni = char2uni,
};

+static struct nls_charset nls_charset;
static struct nls_table table = {
- .charset = "iso8859-13",
+ .charset = &nls_charset,
.ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};

+static struct nls_charset nls_charset = {
+ .charset = "iso8859-13",
+ .tables = &table,
+};
+
static int __init init_nls_iso8859_13(void)
{
- return register_nls(&table);
+ return register_nls(&nls_charset);
}

static void __exit exit_nls_iso8859_13(void)
{
- unregister_nls(&table);
+ unregister_nls(&nls_charset);
}

module_init(init_nls_iso8859_13)
diff --git a/fs/nls/nls_iso8859-14.c b/fs/nls/nls_iso8859-14.c
index 80ab77f37480..c8d5a48f869c 100644
--- a/fs/nls/nls_iso8859-14.c
+++ b/fs/nls/nls_iso8859-14.c
@@ -322,21 +322,27 @@ static const struct nls_ops charset_ops = {
.char2uni = char2uni,
};

+static struct nls_charset nls_charset;
static struct nls_table table = {
- .charset = "iso8859-14",
+ .charset = &nls_charset,
.ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};

+static struct nls_charset nls_charset = {
+ .charset = "iso8859-14",
+ .tables = &table,
+};
+
static int __init init_nls_iso8859_14(void)
{
- return register_nls(&table);
+ return register_nls(&nls_charset);
}

static void __exit exit_nls_iso8859_14(void)
{
- unregister_nls(&table);
+ unregister_nls(&nls_charset);
}

module_init(init_nls_iso8859_14)
diff --git a/fs/nls/nls_iso8859-15.c b/fs/nls/nls_iso8859-15.c
index 5c02f93e7b20..0611c6cb56b4 100644
--- a/fs/nls/nls_iso8859-15.c
+++ b/fs/nls/nls_iso8859-15.c
@@ -288,21 +288,27 @@ static const struct nls_ops charset_ops = {
.char2uni = char2uni,
};

+static struct nls_charset nls_charset;
static struct nls_table table = {
- .charset = "iso8859-15",
+ .charset = &nls_charset,
.ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};

+static struct nls_charset nls_charset = {
+ .charset = "iso8859-15",
+ .tables = &table,
+};
+
static int __init init_nls_iso8859_15(void)
{
- return register_nls(&table);
+ return register_nls(&nls_charset);
}

static void __exit exit_nls_iso8859_15(void)
{
- unregister_nls(&table);
+ unregister_nls(&nls_charset);
}

module_init(init_nls_iso8859_15)
diff --git a/fs/nls/nls_iso8859-2.c b/fs/nls/nls_iso8859-2.c
index 97afc1233da1..5255d92a25eb 100644
--- a/fs/nls/nls_iso8859-2.c
+++ b/fs/nls/nls_iso8859-2.c
@@ -289,21 +289,27 @@ static const struct nls_ops charset_ops = {
.char2uni = char2uni,
};

+static struct nls_charset nls_charset;
static struct nls_table table = {
- .charset = "iso8859-2",
+ .charset = &nls_charset,
.ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};

+static struct nls_charset nls_charset = {
+ .charset = "iso8859-2",
+ .tables = &table,
+};
+
static int __init init_nls_iso8859_2(void)
{
- return register_nls(&table);
+ return register_nls(&nls_charset);
}

static void __exit exit_nls_iso8859_2(void)
{
- unregister_nls(&table);
+ unregister_nls(&nls_charset);
}

module_init(init_nls_iso8859_2)
diff --git a/fs/nls/nls_iso8859-3.c b/fs/nls/nls_iso8859-3.c
index f835fcec3aae..ad1b84f3e102 100644
--- a/fs/nls/nls_iso8859-3.c
+++ b/fs/nls/nls_iso8859-3.c
@@ -289,21 +289,27 @@ static const struct nls_ops charset_ops = {
.char2uni = char2uni,
};

+static struct nls_charset nls_charset;
static struct nls_table table = {
- .charset = "iso8859-3",
+ .charset = &nls_charset,
.ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};

+static struct nls_charset nls_charset = {
+ .charset = "iso8859-3",
+ .tables = &table,
+};
+
static int __init init_nls_iso8859_3(void)
{
- return register_nls(&table);
+ return register_nls(&nls_charset);
}

static void __exit exit_nls_iso8859_3(void)
{
- unregister_nls(&table);
+ unregister_nls(&nls_charset);
}

module_init(init_nls_iso8859_3)
diff --git a/fs/nls/nls_iso8859-4.c b/fs/nls/nls_iso8859-4.c
index 14acb68fb013..82469deee0ba 100644
--- a/fs/nls/nls_iso8859-4.c
+++ b/fs/nls/nls_iso8859-4.c
@@ -289,21 +289,27 @@ static const struct nls_ops charset_ops = {
.char2uni = char2uni,
};

+static struct nls_charset nls_charset;
static struct nls_table table = {
- .charset = "iso8859-4",
+ .charset = &nls_charset,
.ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};

+static struct nls_charset nls_charset = {
+ .charset = "iso8859-4",
+ .tables = &table,
+};
+
static int __init init_nls_iso8859_4(void)
{
- return register_nls(&table);
+ return register_nls(&nls_charset);
}

static void __exit exit_nls_iso8859_4(void)
{
- unregister_nls(&table);
+ unregister_nls(&nls_charset);
}

module_init(init_nls_iso8859_4)
diff --git a/fs/nls/nls_iso8859-5.c b/fs/nls/nls_iso8859-5.c
index f559bbb25045..3f3cd0c28797 100644
--- a/fs/nls/nls_iso8859-5.c
+++ b/fs/nls/nls_iso8859-5.c
@@ -253,21 +253,27 @@ static const struct nls_ops charset_ops = {
.char2uni = char2uni,
};

+static struct nls_charset nls_charset;
static struct nls_table table = {
- .charset = "iso8859-5",
+ .charset = &nls_charset,
.ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};

+static struct nls_charset nls_charset = {
+ .charset = "iso8859-5",
+ .tables = &table,
+};
+
static int __init init_nls_iso8859_5(void)
{
- return register_nls(&table);
+ return register_nls(&nls_charset);
}

static void __exit exit_nls_iso8859_5(void)
{
- unregister_nls(&table);
+ unregister_nls(&nls_charset);
}

module_init(init_nls_iso8859_5)
diff --git a/fs/nls/nls_iso8859-6.c b/fs/nls/nls_iso8859-6.c
index e3d7e28363b8..43e6675998bc 100644
--- a/fs/nls/nls_iso8859-6.c
+++ b/fs/nls/nls_iso8859-6.c
@@ -244,21 +244,27 @@ static const struct nls_ops charset_ops = {
.char2uni = char2uni,
};

+static struct nls_charset nls_charset;
static struct nls_table table = {
- .charset = "iso8859-6",
+ .charset = &nls_charset,
.ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};

+static struct nls_charset nls_charset = {
+ .charset = "iso8859-6",
+ .tables = &table,
+};
+
static int __init init_nls_iso8859_6(void)
{
- return register_nls(&table);
+ return register_nls(&nls_charset);
}

static void __exit exit_nls_iso8859_6(void)
{
- unregister_nls(&table);
+ unregister_nls(&nls_charset);
}

module_init(init_nls_iso8859_6)
diff --git a/fs/nls/nls_iso8859-7.c b/fs/nls/nls_iso8859-7.c
index 49fd2b24e492..83893e487f82 100644
--- a/fs/nls/nls_iso8859-7.c
+++ b/fs/nls/nls_iso8859-7.c
@@ -298,21 +298,27 @@ static const struct nls_ops charset_ops = {
.char2uni = char2uni,
};

+static struct nls_charset nls_charset;
static struct nls_table table = {
- .charset = "iso8859-7",
+ .charset = &nls_charset,
.ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};

+static struct nls_charset nls_charset = {
+ .charset = "iso8859-7",
+ .tables = &table,
+};
+
static int __init init_nls_iso8859_7(void)
{
- return register_nls(&table);
+ return register_nls(&nls_charset);
}

static void __exit exit_nls_iso8859_7(void)
{
- unregister_nls(&table);
+ unregister_nls(&nls_charset);
}

module_init(init_nls_iso8859_7)
diff --git a/fs/nls/nls_iso8859-9.c b/fs/nls/nls_iso8859-9.c
index 876696f89626..df03f97cd9d1 100644
--- a/fs/nls/nls_iso8859-9.c
+++ b/fs/nls/nls_iso8859-9.c
@@ -253,21 +253,27 @@ static const struct nls_ops charset_ops = {
.char2uni = char2uni,
};

+static struct nls_charset nls_charset;
static struct nls_table table = {
- .charset = "iso8859-9",
+ .charset = &nls_charset,
.ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};

+static struct nls_charset nls_charset = {
+ .charset = "iso8859-9",
+ .tables = &table,
+};
+
static int __init init_nls_iso8859_9(void)
{
- return register_nls(&table);
+ return register_nls(&nls_charset);
}

static void __exit exit_nls_iso8859_9(void)
{
- unregister_nls(&table);
+ unregister_nls(&nls_charset);
}

module_init(init_nls_iso8859_9)
diff --git a/fs/nls/nls_koi8-r.c b/fs/nls/nls_koi8-r.c
index 6a85211402a8..22918e154dbe 100644
--- a/fs/nls/nls_koi8-r.c
+++ b/fs/nls/nls_koi8-r.c
@@ -304,21 +304,27 @@ static const struct nls_ops charset_ops = {
.char2uni = char2uni,
};

+static struct nls_charset nls_charset;
static struct nls_table table = {
- .charset = "koi8-r",
+ .charset = &nls_charset,
.ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};

+static struct nls_charset nls_charset = {
+ .charset = "koi8-r",
+ .tables = &table,
+};
+
static int __init init_nls_koi8_r(void)
{
- return register_nls(&table);
+ return register_nls(&nls_charset);
}

static void __exit exit_nls_koi8_r(void)
{
- unregister_nls(&table);
+ unregister_nls(&nls_charset);
}

module_init(init_nls_koi8_r)
diff --git a/fs/nls/nls_koi8-ru.c b/fs/nls/nls_koi8-ru.c
index c4e382fd0f13..f4edbc313706 100644
--- a/fs/nls/nls_koi8-ru.c
+++ b/fs/nls/nls_koi8-ru.c
@@ -56,11 +56,17 @@ static const struct nls_ops charset_ops = {
.char2uni = char2uni,
};

+static struct nls_charset nls_charset;
static struct nls_table table = {
- .charset = "koi8-ru",
+ .charset = &nls_charset,
.ops = &charset_ops,
};

+static struct nls_charset nls_charset = {
+ .charset = "koi8-ru",
+ .tables = &table,
+};
+
static int __init init_nls_koi8_ru(void)
{
p_nls = load_nls("koi8-u");
@@ -68,7 +74,7 @@ static int __init init_nls_koi8_ru(void)
if (p_nls) {
table.charset2upper = p_nls->charset2upper;
table.charset2lower = p_nls->charset2lower;
- return register_nls(&table);
+ return register_nls(&nls_charset);
}

return -EINVAL;
@@ -76,7 +82,7 @@ static int __init init_nls_koi8_ru(void)

static void __exit exit_nls_koi8_ru(void)
{
- unregister_nls(&table);
+ unregister_nls(&nls_charset);
unload_nls(p_nls);
}

diff --git a/fs/nls/nls_koi8-u.c b/fs/nls/nls_koi8-u.c
index 5f91e9cdb165..b2421625e98b 100644
--- a/fs/nls/nls_koi8-u.c
+++ b/fs/nls/nls_koi8-u.c
@@ -311,21 +311,27 @@ static const struct nls_ops charset_ops = {
.char2uni = char2uni,
};

+static struct nls_charset nls_charset;
static struct nls_table table = {
- .charset = "koi8-u",
+ .charset = &nls_charset,
.ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};

+static struct nls_charset nls_charset = {
+ .charset = "koi8-u",
+ .tables = &table,
+};
+
static int __init init_nls_koi8_u(void)
{
- return register_nls(&table);
+ return register_nls(&nls_charset);
}

static void __exit exit_nls_koi8_u(void)
{
- unregister_nls(&table);
+ unregister_nls(&nls_charset);
}

module_init(init_nls_koi8_u)
diff --git a/fs/nls/nls_utf8.c b/fs/nls/nls_utf8.c
index 6988fffd5cf6..aecf460827ac 100644
--- a/fs/nls/nls_utf8.c
+++ b/fs/nls/nls_utf8.c
@@ -45,25 +45,31 @@ static const struct nls_ops charset_ops = {
.char2uni = char2uni,
};

+static struct nls_charset nls_charset;
static struct nls_table table = {
- .charset = "utf8",
+ .charset = &nls_charset,
.ops = &charset_ops,
.charset2lower = identity, /* no conversion */
.charset2upper = identity,
};

+static struct nls_charset nls_charset = {
+ .charset = "utf8",
+ .tables = &table,
+};
+
static int __init init_nls_utf8(void)
{
int i;
for (i=0; i<256; i++)
identity[i] = i;

- return register_nls(&table);
+ return register_nls(&nls_charset);
}

static void __exit exit_nls_utf8(void)
{
- unregister_nls(&table);
+ unregister_nls(&nls_charset);
}

module_init(init_nls_utf8)
diff --git a/include/linux/nls.h b/include/linux/nls.h
index 5d63fe6aa55e..cdc95cd9e5d4 100644
--- a/include/linux/nls.h
+++ b/include/linux/nls.h
@@ -29,15 +29,21 @@ struct nls_ops {
};

struct nls_table {
- const char *charset;
- const char *alias;
+ const struct nls_charset *charset;
const struct nls_ops *ops;
const unsigned char *charset2lower;
const unsigned char *charset2upper;
- struct module *owner;
struct nls_table *next;
};

+struct nls_charset {
+ const char *charset;
+ const char *alias;
+ struct module *owner;
+ struct nls_table *tables;
+ struct nls_charset *next;
+};
+
/* this value hold the maximum octet of charset */
#define NLS_MAX_CHARSET_SIZE 6 /* for UTF-8 */

@@ -49,8 +55,8 @@ enum utf16_endian {
};

/* nls_base.c */
-extern int __register_nls(struct nls_table *, struct module *);
-extern int unregister_nls(struct nls_table *);
+extern int __register_nls(struct nls_charset *, struct module *);
+extern int unregister_nls(struct nls_charset *);
extern struct nls_table *load_nls(char *);
extern void unload_nls(struct nls_table *);
extern struct nls_table *load_nls_default(void);
@@ -78,7 +84,7 @@ static inline int nls_char2uni(const struct nls_table *table,

static inline const char *nls_charset_name(const struct nls_table *table)
{
- return table->charset;
+ return table->charset->charset;
}

static inline unsigned char nls_tolower(struct nls_table *t, unsigned char c)
--
2.20.0.rc2

2018-12-06 22:05:55

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH v4 20/23] ext4: Include encoding information in the superblock

From: Gabriel Krisman Bertazi <[email protected]>

Support for encoding is considered an incompatible feature, since it has
potential to create collisions of file names in existing filesystems.
If the feature flag is not enabled, the entire filesystem will operate
on opaque byte sequences, respecting the original behavior.

The charset data is encoded in a new field in the superblock using a
magic number specific to ext4. This is the easiest way I found to avoid
writing the name of the charset in the superblock. The magic number is
mapped to the exact NLS table, but the mapping is specific to ext4.
Since we don't have any commitment to support old encodings, the only
encodings I am supporting right now is utf8-11.0 and ascii, both
using the NLS abstraction.

The current implementation prevents the user from enabling encoding and
per-directory encryption on the same filesystem at the same time. The
incompatibility between these features lies in how we do efficient
directory searches when we cannot be sure the encryption of the user
provided fname will match the actual hash stored in the disk without
decrypting every directory entry, because of normalization cases. My
quickest solution is to simply block the concurrent use of these
features for now, and enable it later, once we have a better solution.

Changes since v2:
- Split superblock bitfield reservation into another patch.
- Rename s_ioencoding -> s_encoding
- Remove encoding_info from in-memory superblock.

Changes since v1:
- Guard code with CONFIG_NLS.
- Use 16 bits for s_ioencoding.
- Split mount option from this patch

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
fs/ext4/ext4.h | 7 +++++
fs/ext4/super.c | 77 +++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 84 insertions(+)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 52c9e8b948a0..c21717a19106 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1338,6 +1338,9 @@ struct ext4_super_block {
/* Number of quota types we support */
#define EXT4_MAXQUOTAS 3

+#define EXT4_ENC_ASCII 0
+#define EXT4_ENC_UTF8_11_0 1
+
/*
* fourth extended-fs super-block data in memory
*/
@@ -1387,6 +1390,10 @@ struct ext4_sb_info {
struct kobject s_kobj;
struct completion s_kobj_unregister;
struct super_block *s_sb;
+#ifdef CONFIG_NLS
+ struct nls_table *s_encoding;
+ __u16 s_encoding_flags;
+#endif

/* Journaling */
struct journal_s *s_journal;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 53ff6c2a26ed..e64a9ed2ca12 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -42,6 +42,7 @@
#include <linux/cleancache.h>
#include <linux/uaccess.h>
#include <linux/iversion.h>
+#include <linux/nls.h>

#include <linux/kthread.h>
#include <linux/freezer.h>
@@ -1022,6 +1023,9 @@ static void ext4_put_super(struct super_block *sb)
crypto_free_shash(sbi->s_chksum_driver);
kfree(sbi->s_blockgroup_lock);
fs_put_dax(sbi->s_daxdev);
+#ifdef CONFIG_NLS
+ unload_nls(sbi->s_encoding);
+#endif
kfree(sbi);
}

@@ -1716,6 +1720,37 @@ static const struct mount_opts {
{Opt_err, 0, 0}
};

+#ifdef CONFIG_NLS
+static const struct ext4_sb_encodings {
+ __u16 magic;
+ char *name;
+ char *version;
+} ext4_sb_encoding_map[] = {
+ {EXT4_ENC_ASCII, "ascii", NULL},
+ {EXT4_ENC_UTF8_11_0, "utf8", "11.0.0"},
+};
+
+static int ext4_sb_read_encoding(const struct ext4_super_block *es,
+ const struct ext4_sb_encodings **encoding,
+ __u16 *flags)
+{
+ __u16 magic = le16_to_cpu(es->s_encoding);
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(ext4_sb_encoding_map); i++)
+ if (magic == ext4_sb_encoding_map[i].magic)
+ break;
+
+ if (i >= ARRAY_SIZE(ext4_sb_encoding_map))
+ return -EINVAL;
+
+ *encoding = &ext4_sb_encoding_map[i];
+ *flags = le16_to_cpu(es->s_encoding_flags);
+
+ return 0;
+}
+#endif
+
static int handle_mount_opt(struct super_block *sb, char *opt, int token,
substring_t *args, unsigned long *journal_devnum,
unsigned int *journal_ioprio, int is_remount)
@@ -3534,6 +3569,11 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
int err = 0;
unsigned int journal_ioprio = DEFAULT_JOURNAL_IOPRIO;
ext4_group_t first_not_zeroed;
+#ifdef CONFIG_NLS
+ struct nls_table *encoding;
+ const struct ext4_sb_encodings *encoding_info;
+ __u16 nls_flags;
+#endif

if ((data && !orig_data) || !sbi)
goto out_free_base;
@@ -3706,6 +3746,38 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
&journal_ioprio, 0))
goto failed_mount;

+#ifdef CONFIG_NLS
+ if (ext4_has_feature_fname_encoding(sb) && !sbi->s_encoding) {
+ if (ext4_has_feature_encrypt(sb)) {
+ ext4_msg(sb, KERN_ERR,
+ "Can't mount with both encoding and encryption");
+ goto failed_mount;
+ }
+
+ if (ext4_sb_read_encoding(es, &encoding_info, &nls_flags)) {
+ ext4_msg(sb, KERN_ERR,
+ "Encoding requested by superblock is unknown");
+ goto failed_mount;
+ }
+
+ encoding = load_nls_version(encoding_info->name,
+ encoding_info->version, nls_flags);
+ if (IS_ERR(encoding)) {
+ ext4_msg(sb, KERN_ERR, "can't mount with superblock charset: "
+ "%s-%s not supported by the kernel. flags: 0x%x",
+ encoding_info->name, encoding_info->version,
+ nls_flags);
+ goto failed_mount;
+ }
+ ext4_msg(sb, KERN_INFO,"Using encoding defined by superblock: "
+ "%s-%s with flags 0x%hx", encoding_info->name,
+ encoding_info->version?:"\b", nls_flags);
+
+ sbi->s_encoding = encoding;
+ sbi->s_encoding_flags = nls_flags;
+ }
+#endif
+
if (test_opt(sb, DATA_FLAGS) == EXT4_MOUNT_JOURNAL_DATA) {
printk_once(KERN_WARNING "EXT4-fs: Warning: mounting "
"with data=journal disables delayed "
@@ -4547,6 +4619,11 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
failed_mount:
if (sbi->s_chksum_driver)
crypto_free_shash(sbi->s_chksum_driver);
+
+#ifdef CONFIG_NLS
+ unload_nls(sbi->s_encoding);
+#endif
+
#ifdef CONFIG_QUOTA
for (i = 0; i < EXT4_MAXQUOTAS; i++)
kfree(sbi->s_qf_names[i]);
--
2.20.0.rc2

2018-12-06 22:50:12

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v4 00/23] Ext4 Encoding and Case-insensitive support

On Thu, Dec 06, 2018 at 05:04:06PM -0500, Gabriel Krisman Bertazi wrote:
> Hi,
>
> Following the e2fsprogs changes, these are the corresponding kernel-side
> modifications to support the fname_encoding feature.
>
> The patches are split in two parts. The fist 14 patches are refactoring
> and improvements to the NLS code, including the utf8 normalization
> support. The final patches implement the fname_encoding feature in ext4.

Please repost this all to [email protected]. You're
changing a significant amount of non-ext4 filesystem code, as well
as adding core filesystem infrastructure so it needs to have wider
visibility and review than just the ext4 list.

Thanks!

-Dave.
--
Dave Chinner
[email protected]

2018-12-06 22:05:23

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH v4 11/23] nls: ascii: Support validation and normalization operations

From: Gabriel Krisman Bertazi <[email protected]>

validation is trivial. Any byte that has the MSB set is an invalid
sequence.

Casefold can be implemented with uppercase or lowercase, and we have no
specification on that. Callers should be safe using either of them, as
long as it doesn't change.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
fs/nls/nls_ascii.c | 50 +++++++++++++++++++++++++++++++++++++++++++++
include/linux/nls.h | 8 ++++++++
2 files changed, 58 insertions(+)

diff --git a/fs/nls/nls_ascii.c b/fs/nls/nls_ascii.c
index 2f4826478d3d..079a1574c19d 100644
--- a/fs/nls/nls_ascii.c
+++ b/fs/nls/nls_ascii.c
@@ -12,6 +12,7 @@
#include <linux/string.h>
#include <linux/nls.h>
#include <linux/errno.h>
+#include <linux/slab.h>

static const wchar_t charset2uni[256] = {
/* 0x00*/
@@ -117,6 +118,8 @@ static const unsigned char charset2upper[256] = {
0x58, 0x59, 0x5a, 0x7b, 0x7c, 0x7d, 0x7e, 0x7f, /* 0x78-0x7f */
};

+#define VALID_ASCII(c) (c < 128)
+
static int uni2char(wchar_t uni, unsigned char *out, int boundlen)
{
const unsigned char *uni2charset;
@@ -142,6 +145,16 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static int ascii_validate(const struct nls_table *table,
+ const unsigned char *str, size_t len)
+{
+ int i;
+ for (i = 0; i < len && str[i]; i++)
+ if (!VALID_ASCII(str[i]))
+ return -1;
+ return 0;
+}
+
static unsigned char charset_tolower(const struct nls_table *table,
unsigned int c){
return charset2lower[c];
@@ -152,11 +165,36 @@ static unsigned char charset_toupper(const struct nls_table *table,
return charset2upper[c];
}

+static int ascii_casefold(const struct nls_table *charset,
+ const unsigned char *str, size_t len,
+ unsigned char *dest, size_t dlen)
+{
+ unsigned int i;
+
+ if (dlen < len)
+ return -EINVAL;
+
+ for (i = 0; i < len; i++) {
+ if (IS_STRICT_MODE(charset) && !VALID_ASCII(str[i]))
+ return -EINVAL;
+
+ if (IS_CASEFOLD_TYPE_ASCII_TOLOWER(charset))
+ dest[i] = charset_tolower(charset, str[i]);
+ else
+ dest[i] = charset_toupper(charset, str[i]);
+ }
+ dest[len] = '\0';
+
+ return len;
+}
+
static const struct nls_ops charset_ops = {
+ .validate = ascii_validate,
.lowercase = charset_toupper,
.uppercase = charset_tolower,
.uni2char = uni2char,
.char2uni = char2uni,
+ .casefold = ascii_casefold,
};

static struct nls_charset nls_charset;
@@ -165,9 +203,21 @@ static struct nls_table table = {
.ops = &charset_ops,
};

+struct nls_table *ascii_load_table(const char *version, unsigned int flags)
+{
+ if (flags & ~(NLS_STRICT_MODE) ||
+ (flags & NLS_NORMALIZATION_TYPE_MASK) != NLS_NORMALIZATION_TYPE_PLAIN)
+ return ERR_PTR(-EINVAL);
+
+ table.flags = flags;
+ return &table;
+}
+
+
static struct nls_charset nls_charset = {
.charset = "ascii",
.tables = &table,
+ .load_table = ascii_load_table,
};

static int __init init_nls_ascii(void)
diff --git a/include/linux/nls.h b/include/linux/nls.h
index 44a06a9c69e7..aab60d4858ee 100644
--- a/include/linux/nls.h
+++ b/include/linux/nls.h
@@ -178,6 +178,14 @@ IS_CASEFOLD_TYPE_##charset##_##type(const struct nls_table *c) \
NLS_NORMALIZATION_FUNCS(ALL, PLAIN, NLS_NORMALIZATION_TYPE_PLAIN)
NLS_CASEFOLD_FUNCS(ALL, TOUPPER, NLS_CASEFOLD_TYPE_TOUPPER)

+/* ASCII */
+
+#define NLS_ASCII_CASEFOLD_TOUPPER NLS_CASEFOLD_TYPE_TOUPPER
+#define NLS_ASCII_CASEFOLD_TOLOWER NLS_CASEFOLD_TYPE(1)
+
+NLS_CASEFOLD_FUNCS(ASCII, TOUPPER, NLS_ASCII_CASEFOLD_TOUPPER)
+NLS_CASEFOLD_FUNCS(ASCII, TOLOWER, NLS_ASCII_CASEFOLD_TOLOWER)
+
/* nls_base.c */
extern int __register_nls(struct nls_charset *, struct module *);
extern int unregister_nls(struct nls_charset *);
--
2.20.0.rc2

2018-12-09 21:05:48

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v4 00/23] Ext4 Encoding and Case-insensitive support

On Sun, Dec 9, 2018 at 12:53 PM Gabriel Krisman Bertazi
<[email protected]> wrote:
>
> As Ted mentioned the SMB case, in my understanding, we might have more
> users for in-kernel ut8 normalization/casefold comparison functions than
> just ext4 in the future.

Crossed emails.

See my note about how there really is not a single case-folding
library. It's simply not physically possible, because there are so
many different ideas about what case-folding actually means.

That's still true even if "everything is utf-8", sadly.

So how do you handle locale issues and things like "we have ten
different tables for utf-8 comparisons, and that's _ignoring_ the
issue of whether we combine or decompose characters"?

And there's no way you can use the existing nls interfaces for
upper/lower case, for example, since they are all limited to 256-byte
tables and direct accesses to said tables, afaik.

And if that is where the extensions were, and that is why you changed
other filesystems, this all matters.

My *guess* is that what you really want is not really about unicode at
all, but specifically about just the NTFS rules. Which, yes, might
find generic sharing interest between cifs/ext4/etc, but my gut feel
is that they'd be specifically about some NTFS interoperability
library.

Because even then I think you might have issues like "NTFS-5.1" vs
"NTFS-4.0" etc.

Maybe you don't care, and you're picking just *one* version. And I
haven't seen the code.

Basically, I would not be surprised if the sanest model is simply to
make a "ntfs" library. Because I'm really fairly sure that OS X rules
are very different indeed, even if it too is "unicode".

Linus

2018-12-06 22:04:57

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH v4 04/23] nls: Split default charset from NLS core

From: Gabriel Krisman Bertazi <[email protected]>

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
fs/nls/Makefile | 1 +
fs/nls/nls_core.c | 94 ++++++++++++++++++++++++++++
fs/nls/{nls_base.c => nls_default.c} | 93 +++------------------------
3 files changed, 102 insertions(+), 86 deletions(-)
create mode 100644 fs/nls/nls_core.c
rename fs/nls/{nls_base.c => nls_default.c} (90%)

diff --git a/fs/nls/Makefile b/fs/nls/Makefile
index ac54db297128..5f42ceff9d15 100644
--- a/fs/nls/Makefile
+++ b/fs/nls/Makefile
@@ -3,6 +3,7 @@
# Makefile for native language support
#

+nls_base-y := nls_core.o nls_default.o
obj-$(CONFIG_NLS) += nls_base.o

obj-$(CONFIG_NLS_CODEPAGE_437) += nls_cp437.o
diff --git a/fs/nls/nls_core.c b/fs/nls/nls_core.c
new file mode 100644
index 000000000000..3f7de8f4c5b2
--- /dev/null
+++ b/fs/nls/nls_core.c
@@ -0,0 +1,94 @@
+/*
+ * linux/fs/nls/nls_core.c
+ *
+ * Native language support--charsets and unicode translations.
+ * By Gordon Chaffee 1996, 1997
+ *
+ * Unicode based case conversion 1999 by Wolfram Pienkoss
+ *
+ */
+
+#include <linux/module.h>
+#include <linux/string.h>
+#include <linux/nls.h>
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/kmod.h>
+#include <linux/spinlock.h>
+
+static struct nls_table default_table;
+static struct nls_table *tables = &default_table;
+static DEFINE_SPINLOCK(nls_lock);
+
+int __register_nls(struct nls_table *nls, struct module *owner)
+{
+ struct nls_table ** tmp = &tables;
+
+ if (nls->next)
+ return -EBUSY;
+
+ nls->owner = owner;
+ spin_lock(&nls_lock);
+ while (*tmp) {
+ if (nls == *tmp) {
+ spin_unlock(&nls_lock);
+ return -EBUSY;
+ }
+ tmp = &(*tmp)->next;
+ }
+ nls->next = tables;
+ tables = nls;
+ spin_unlock(&nls_lock);
+ return 0;
+}
+EXPORT_SYMBOL(__register_nls);
+
+int unregister_nls(struct nls_table * nls)
+{
+ struct nls_table ** tmp = &tables;
+
+ spin_lock(&nls_lock);
+ while (*tmp) {
+ if (nls == *tmp) {
+ *tmp = nls->next;
+ spin_unlock(&nls_lock);
+ return 0;
+ }
+ tmp = &(*tmp)->next;
+ }
+ spin_unlock(&nls_lock);
+ return -EINVAL;
+}
+
+static struct nls_table *find_nls(char *charset)
+{
+ struct nls_table *nls;
+ spin_lock(&nls_lock);
+ for (nls = tables; nls; nls = nls->next) {
+ if (!strcmp(nls_charset_name(nls), charset))
+ break;
+ if (nls->alias && !strcmp(nls->alias, charset))
+ break;
+ }
+ if (nls && !try_module_get(nls->owner))
+ nls = NULL;
+ spin_unlock(&nls_lock);
+ return nls;
+}
+
+struct nls_table *load_nls(char *charset)
+{
+ return try_then_request_module(find_nls(charset), "nls_%s", charset);
+}
+
+void unload_nls(struct nls_table *nls)
+{
+ if (nls)
+ module_put(nls->owner);
+}
+
+EXPORT_SYMBOL(unregister_nls);
+EXPORT_SYMBOL(unload_nls);
+EXPORT_SYMBOL(load_nls);
+
+MODULE_LICENSE("Dual BSD/GPL");
diff --git a/fs/nls/nls_base.c b/fs/nls/nls_default.c
similarity index 90%
rename from fs/nls/nls_base.c
rename to fs/nls/nls_default.c
index 0bb0acf6893f..c5d7e8391b22 100644
--- a/fs/nls/nls_base.c
+++ b/fs/nls/nls_default.c
@@ -1,5 +1,5 @@
/*
- * linux/fs/nls/nls_base.c
+ * linux/fs/nls/nls_default.c
*
* Native language support--charsets and unicode translations.
* By Gordon Chaffee 1996, 1997
@@ -8,23 +8,17 @@
*
*/

+/*
+ * Sample implementation from Unicode home page.
+ * http://www.stonehand.com/unicode/standard/fss-utf.html
+ */
+
#include <linux/module.h>
-#include <linux/string.h>
-#include <linux/nls.h>
-#include <linux/kernel.h>
-#include <linux/errno.h>
-#include <linux/kmod.h>
-#include <linux/spinlock.h>
#include <asm/byteorder.h>
+#include <linux/nls.h>

static struct nls_table default_table;
-static struct nls_table *tables = &default_table;
-static DEFINE_SPINLOCK(nls_lock);

-/*
- * Sample implementation from Unicode home page.
- * http://www.stonehand.com/unicode/standard/fss-utf.html
- */
struct utf8_table {
int cmask;
int cval;
@@ -232,73 +226,6 @@ int utf16s_to_utf8s(const wchar_t *pwcs, int inlen, enum utf16_endian endian,
}
EXPORT_SYMBOL(utf16s_to_utf8s);

-int __register_nls(struct nls_table *nls, struct module *owner)
-{
- struct nls_table ** tmp = &tables;
-
- if (nls->next)
- return -EBUSY;
-
- nls->owner = owner;
- spin_lock(&nls_lock);
- while (*tmp) {
- if (nls == *tmp) {
- spin_unlock(&nls_lock);
- return -EBUSY;
- }
- tmp = &(*tmp)->next;
- }
- nls->next = tables;
- tables = nls;
- spin_unlock(&nls_lock);
- return 0;
-}
-EXPORT_SYMBOL(__register_nls);
-
-int unregister_nls(struct nls_table * nls)
-{
- struct nls_table ** tmp = &tables;
-
- spin_lock(&nls_lock);
- while (*tmp) {
- if (nls == *tmp) {
- *tmp = nls->next;
- spin_unlock(&nls_lock);
- return 0;
- }
- tmp = &(*tmp)->next;
- }
- spin_unlock(&nls_lock);
- return -EINVAL;
-}
-
-static struct nls_table *find_nls(char *charset)
-{
- struct nls_table *nls;
- spin_lock(&nls_lock);
- for (nls = tables; nls; nls = nls->next) {
- if (!strcmp(nls_charset_name(nls), charset))
- break;
- if (nls->alias && !strcmp(nls->alias, charset))
- break;
- }
- if (nls && !try_module_get(nls->owner))
- nls = NULL;
- spin_unlock(&nls_lock);
- return nls;
-}
-
-struct nls_table *load_nls(char *charset)
-{
- return try_then_request_module(find_nls(charset), "nls_%s", charset);
-}
-
-void unload_nls(struct nls_table *nls)
-{
- if (nls)
- module_put(nls->owner);
-}
-
static const wchar_t charset2uni[256] = {
/* 0x00*/
0x0000, 0x0001, 0x0002, 0x0003,
@@ -543,10 +470,4 @@ struct nls_table *load_nls_default(void)
else
return &default_table;
}
-
-EXPORT_SYMBOL(unregister_nls);
-EXPORT_SYMBOL(unload_nls);
-EXPORT_SYMBOL(load_nls);
EXPORT_SYMBOL(load_nls_default);
-
-MODULE_LICENSE("Dual BSD/GPL");
--
2.20.0.rc2

2018-12-06 22:05:04

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH v4 06/23] nls: Add support for multiple versions of an encoding

From: Gabriel Krisman Bertazi <[email protected]>

This allows a user to request a specific version of an encoding, like
the version 10.0.0 of Unicode encoded in utf8.

Supporting specific versions of encodings is important to ensure
stability of names in filesystems, specially when doing transformations
like casefold and normalization. Even for unicode, where defined
code-points are stable, there is instability for code points that
weren't defined on a previous version, so the user might want to use an
older version of the encoding to ensure the encoding is exact.

Not every NLS charset supports this feature. It doesn't make sense for
many of them, like ASCII. Others just don't implement it yet, and never
will. In those cases, the interface allows the caller to get the
un-versioned charset, which is the same original behavior as if this
patch weren't applied. A user that is not interested in a specific
version can also ask for a versioned charset without specifying the
version, and in this case, NLS will return the latest version available
of that charset.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
fs/nls/nls_core.c | 45 ++++++++++++++++++++++++++++++++++++++-------
include/linux/nls.h | 8 ++++++++
2 files changed, 46 insertions(+), 7 deletions(-)

diff --git a/fs/nls/nls_core.c b/fs/nls/nls_core.c
index 200a7f8165e6..20e00a8b968c 100644
--- a/fs/nls/nls_core.c
+++ b/fs/nls/nls_core.c
@@ -19,10 +19,26 @@
extern struct nls_charset default_charset;
static struct nls_charset *charsets = &default_charset;
static DEFINE_SPINLOCK(nls_lock);
-static struct nls_table *nls_load_table(struct nls_charset *charset)
+
+static struct nls_table *nls_load_table(struct nls_charset *charset,
+ const char *version,
+ unsigned int flags)
{
- /* For now, return the default table, which is the first one found. */
- return charset->tables;
+ struct nls_table *tbl;
+
+ /* If there is no load_table hook, only 1 table is supported and
+ * it must have been loaded statically.
+ */
+ if (charset->load_table)
+ tbl = charset->load_table(version, flags);
+ else
+ tbl = charset->tables;
+
+ if (IS_ERR(tbl))
+ return tbl;
+
+ tbl->flags = flags;
+ return tbl;
}

int __register_nls(struct nls_charset *nls, struct module *owner)
@@ -85,21 +101,36 @@ static struct nls_charset *find_nls(const char *charset)
return nls;
}

-struct nls_table *load_nls(char *charset)
+struct nls_table *load_nls_version(const char *charset, const char *version,
+ unsigned int flags)
{
struct nls_charset *nls_charset;

nls_charset = try_then_request_module(find_nls(charset),
"nls_%s", charset);
- if (!IS_ERR(nls_charset))
+ if (IS_ERR(nls_charset))
+ return ERR_PTR(-EINVAL);
+
+ return nls_load_table(nls_charset, version, flags);
+}
+EXPORT_SYMBOL(load_nls_version);
+
+struct nls_table *load_nls(char *charset)
+{
+ struct nls_table *table = load_nls_version(charset, NULL, 0);
+
+ /* Pre-versioned load_nls() didn't return error pointers. Let's
+ * keep the abi for now to prevent breakage.
+ */
+ if (IS_ERR(table))
return NULL;

- return nls_load_table(nls_charset);
+ return table;
}

void unload_nls(struct nls_table *nls)
{
- if (nls)
+ if (!IS_ERR_OR_NULL(nls))
module_put(nls->charset->owner);
}

diff --git a/include/linux/nls.h b/include/linux/nls.h
index cdc95cd9e5d4..91524bb4477b 100644
--- a/include/linux/nls.h
+++ b/include/linux/nls.h
@@ -30,6 +30,9 @@ struct nls_ops {

struct nls_table {
const struct nls_charset *charset;
+ unsigned int version;
+ unsigned int flags;
+
const struct nls_ops *ops;
const unsigned char *charset2lower;
const unsigned char *charset2upper;
@@ -42,6 +45,8 @@ struct nls_charset {
struct module *owner;
struct nls_table *tables;
struct nls_charset *next;
+ struct nls_table *(*load_table)(const char *version,
+ unsigned int flags);
};

/* this value hold the maximum octet of charset */
@@ -58,6 +63,9 @@ enum utf16_endian {
extern int __register_nls(struct nls_charset *, struct module *);
extern int unregister_nls(struct nls_charset *);
extern struct nls_table *load_nls(char *);
+extern struct nls_table *load_nls_version(const char *charset,
+ const char *version,
+ unsigned int flags);
extern void unload_nls(struct nls_table *);
extern struct nls_table *load_nls_default(void);
#define register_nls(nls) __register_nls((nls), THIS_MODULE)
--
2.20.0.rc2

2018-12-06 22:05:18

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH v4 10/23] nls: Add optional normalization and casefold hooks

From: Gabriel Krisman Bertazi <[email protected]>

The Normalization operation applies a transformation to strings to
obtain the normalization form, which allow the user to determine whether
any two strings are equivalent to each other. The NLS subsystem doesn't
impose any constraint on what means to be equivalent, for any charsets.
Unicode-based charsets, for instance, are free to support one, a few or
all kinds of Unicode equivalences.

The Casefold operation is similar to Normalization, in a sense that it
also allows the caller to identify equivalent strings, but it
disregards case, making it ideal for case insensitive comparisons.

Default implementation are provided by the nls core, such that existing
charsets can operate on the new interface. The Normalization default
operation is the format NLS_NORMALIZATION_TYPE_PLAIN, which returns the
identity of the string, which means no normalization. The casefold
default is NLS_CASEFOLD_TYPE_TOUPPER, which returns the string with all
characters converted to uppercase.

Changes since V1:
- Add default operations for casefold and normalization

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
fs/nls/nls_core.c | 11 +++++
include/linux/nls.h | 116 ++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 127 insertions(+)

diff --git a/fs/nls/nls_core.c b/fs/nls/nls_core.c
index 49a15bb2174f..c49088f36f4c 100644
--- a/fs/nls/nls_core.c
+++ b/fs/nls/nls_core.c
@@ -25,6 +25,17 @@ static int nls_validate_flags(struct nls_table *table, unsigned int flags)
if (flags & NLS_STRICT_MODE && !table->ops->validate)
return -1;

+ if ((flags & NLS_NORMALIZATION_TYPE_MASK) && !table->ops->normalize)
+ return -1;
+
+ if ((flags & NLS_CASEFOLD_TYPE_MASK) && !table->ops->casefold)
+ return -1;
+
+ /* Reject unused flags */
+ if (flags & ~(NLS_CASEFOLD_TYPE_MASK | NLS_NORMALIZATION_TYPE_MASK |
+ NLS_STRICT_MODE))
+ return -1;
+
return 0;
}

diff --git a/include/linux/nls.h b/include/linux/nls.h
index 980103d4c363..44a06a9c69e7 100644
--- a/include/linux/nls.h
+++ b/include/linux/nls.h
@@ -4,6 +4,7 @@

#include <linux/init.h>
#include <linux/string.h>
+#include <linux/errno.h>

/* Unicode has changed over the years. Unicode code points no longer
* fit into 16 bits; as of Unicode 5 valid code points range from 0
@@ -65,6 +66,51 @@ struct nls_ops {
int (*strncasecmp)(const struct nls_table *charset,
const unsigned char *str1, size_t len1,
const unsigned char *str2, size_t len2);
+ /**
+ * @normalize:
+ *
+ * Obtain the normalized form of a string, which can be used to
+ * determine whether any two strings are equivalent. The NLS
+ * subsystem doesn't impose any constraint on the charsets
+ * regarding what it means to be equivalent. Unicode-based
+ * charsets, for instance, are free to support one, a few or all
+ * kinds of Unicode equivalences. Different kinds of
+ * normalizations can be specified using the nls_table flags.
+ *
+ * This hook is responsible for performing string validation if
+ * the strict mode flag is set. The only case where it is not
+ * called by nls_core is when strict mode and normalization are
+ * disabled, because in this case the normalization is
+ * guaranteed to be the string identity.
+ *
+ * Not every charset implements this hook. It is only required
+ * if the charset supports strict mode or some kind of
+ * normalization.
+ *
+ * If this operation cannot be executed for this charset,
+ * -ENOTSUPP is returned. If the sequence is invalid, -EINVAL
+ * is returned. Otherwise, this function returns the size of the
+ * new string.
+ **/
+ int (*normalize)(const struct nls_table *charset,
+ const unsigned char *str, size_t len,
+ unsigned char *dest, size_t dlen);
+ /**
+ * @casefold:
+ *
+ * Casefold returns a version of the string that can be used to
+ * perform case-insensitive comparisons. The kind of casefold
+ * algorithm that will be used is charset dependent, and can be
+ * configured using the nls_table flags field.
+ *
+ * If this operation cannot be executed for this charset,
+ * -ENOTSUPP is returned. If the sequence fails, -EINVAL is
+ * returned. Otherwise, this function returns the size of the
+ * new string.
+ **/
+ int (*casefold)(const struct nls_table *charset,
+ const unsigned char *str, size_t len,
+ unsigned char *dest, size_t dlen);
unsigned char (*lowercase)(const struct nls_table *charset,
unsigned int c);
unsigned char (*uppercase)(const struct nls_table *charset,
@@ -101,13 +147,37 @@ enum utf16_endian {
UTF16_BIG_ENDIAN
};

+#define NLS_NORMALIZATION_TYPE(i) ((i & 0x7) << 1)
+#define NLS_CASEFOLD_TYPE(i) ((i & 0x7) << 4)
+
#define NLS_STRICT_MODE 0x00000001
+#define NLS_NORMALIZATION_TYPE_PLAIN NLS_NORMALIZATION_TYPE(0)
+#define NLS_NORMALIZATION_TYPE_MASK 0x0000000E
+#define NLS_CASEFOLD_TYPE_TOUPPER NLS_CASEFOLD_TYPE(0)
+#define NLS_CASEFOLD_TYPE_MASK 0x00000070

static inline int IS_STRICT_MODE(const struct nls_table *charset)
{
return (charset->flags & NLS_STRICT_MODE);
}

+#define NLS_NORMALIZATION_FUNCS(charset, type, i) \
+static inline int \
+IS_NORMALIZATION_TYPE_##charset##_##type(const struct nls_table *c) \
+{ \
+ return ((c->flags & NLS_NORMALIZATION_TYPE_MASK) == i); \
+}
+
+#define NLS_CASEFOLD_FUNCS(charset, type, i) \
+static inline int \
+IS_CASEFOLD_TYPE_##charset##_##type(const struct nls_table *c) \
+{ \
+ return ((c->flags & NLS_CASEFOLD_TYPE_MASK) == i); \
+}
+
+NLS_NORMALIZATION_FUNCS(ALL, PLAIN, NLS_NORMALIZATION_TYPE_PLAIN)
+NLS_CASEFOLD_FUNCS(ALL, TOUPPER, NLS_CASEFOLD_TYPE_TOUPPER)
+
/* nls_base.c */
extern int __register_nls(struct nls_charset *, struct module *);
extern int unregister_nls(struct nls_charset *);
@@ -213,6 +283,52 @@ static inline int nls_strnicmp(struct nls_table *t, const unsigned char *s1,
return nls_strncasecmp(t, s1, len, s2, len);
}

+static inline int nls_casefold(const struct nls_table *t,
+ const unsigned char *str, size_t len,
+ unsigned char *dest, size_t dlen)
+{
+ int i;
+
+ if (t->ops->casefold)
+ return t->ops->casefold(t, str, len, dest, dlen);
+
+ if (!IS_CASEFOLD_TYPE_ALL_TOUPPER(t))
+ return -ENOTSUPP;
+
+ if (IS_STRICT_MODE(t) && nls_validate(t, str, len))
+ return -EINVAL;
+
+ if (len > dlen)
+ return -EINVAL;
+
+ for (i = 0 ; i < len; i++)
+ dest[i] = nls_toupper(t, str[i]);
+
+ return len;
+}
+
+static inline int nls_normalize(const struct nls_table *t,
+ const unsigned char *str, size_t len,
+ unsigned char *dest, size_t dlen)
+{
+ if (t->ops->normalize)
+ return t->ops->normalize(t, str, len, dest, dlen);
+
+ if (!IS_NORMALIZATION_TYPE_ALL_PLAIN(t))
+ return -ENOTSUPP;
+
+ if (IS_STRICT_MODE(t) && nls_validate(t, str, len))
+ return -EINVAL;
+
+ if (len > dlen)
+ return -EINVAL;
+
+ /* If normalization are disabled, normalization is the
+ * identity. */
+ strncpy(dest, str, len);
+ return len;
+}
+
/*
* nls_nullsize - return length of null character for codepage
* @codepage - codepage for which to return length of NULL terminator
--
2.20.0.rc2

2018-12-08 21:49:15

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v4 00/23] Ext4 Encoding and Case-insensitive support

On Sat, Dec 8, 2018 at 12:22 PM Theodore Y. Ts'o <[email protected]> wrote:
>
> There's a patch series that's been baking for a while that will likely
> go upstream either in the next upcoming merge window, or the one after
> that. Since it adds support for Unicode case-folding, it involves a
> non-trivial number of changes to fs/nls. As near as I can tell, no
> one is really maintaining fs/nls.

Christ.

Why do people want to do this? We know it's a crazy and stupid thing
to do. And we know that, exactly because people have done it, and it
has always been a mistake.

It causes actual and very subtle security issues.

It breaks things subtly even when they supposedly "know" about case
folding because different things will do it differently (ie user space
vs kernel space not having the *exact* same rules due to using
different tables, for example).

It doesn't work with locales, because people often want different
locales at the same time.

And it slows things down enormously because you can't do hashing well,
and comparisons get hugely more expensive.

And to add insult to injury, people always implement it so *horribly*
badly that it's not even funny.

For example, the usual way that people do it is to case-fold two
strings, and then compare the end results. And that's *incredibly*
stupid and slow and generates extra temporary allocations etc.

Or people to it character-by-character instead, and don't understand
utf-8 (which is literally designed to be easy to see character
boundaries *without* having to do a full decode!), and do *that*
incredibly badly instead.

And when you create a file with an ambiguous name, what does readdir
report? Does it report the name you used, some normalized thing, or
what?

Finally, people then invariably do it in ways that preclude any
concurrent sane uses.

For example, they make it a single mount-time flag for the whole
filesystem, so now if you are (for example) wanting to do emulation of
bad system decisions, you now force the *host* to buy into the whole
mistake too.

And they make it a whole-filesystem flag, instead of (for example)
allowing just the emulated environment to do case-insensitive
filesystem operations on an operation-by-operation basis, and possibly
only within a particular subdirectory structure (or bind mount).

So the first thing I want to know is who really needs it, *why* they
need it, and what the design is for.

Because I can almost guarantee that the design is horrible, and the
reasons are really really bad.

And what *are* the case insensitivity rules, and how do you co-exist
when there are two *different* folding rules at the same time? For
example, OS X has some truly horrendously bad rules, that take the
badness that Windows did to a whole different level. What if you're a
file server (or emulation environment) and you want to expose the same
filesystem to both of those environments?

Because it would quite possibly be a whole lot better to allow
per-operation flags, so that you can do

fd = openat(dir, path, O_RDONLY | O_ICASE);

so that you can allow *one* process to treat a filesystem as if it was
case insensitive (think "Wine in with a ~/.wine/C directory"), without
forcing the whole filesystem to be icase.

Yes, allowing concurrent use then generates whole new "interesting"
questions, like "what happens if a case _sensitive_ user creates two
files with names that are identical to a in-sensitive user", but they
aren't necessarily any worse than the issues you face *not* allowing
that.

> Given your recent comments about not wanting to see pull requests for
> things outside of fs/xfs as part of the xfs pull, do you have any
> opinions about how to do manage this feature going upstream? My
> original plan was to send them through the ext4 tree, since I very
> much doubt Al cares much about nls issues, and they will only impact
> ext4.

I really want to know what is driving this insanity, and what the
actual use-case is.

You have a diffstat, but not a git tree to look at what the heck is going on.

Seriously, case insensitivity is *such* a horrendously bad idea that
people need to think about it deeply, and nobody seems to ever do
that.

And yes, we have d_hash() and some rudimentary support for it in the
VFS layer, but that VFS layer bit was always meant purely for
interoperability filesystems that nobody really cared about as a real
filesystem for Linux. Notably FAT and its ilk.

If we have a major native filesystem doing it, I think we need to
actively think about the big picture and do it *right*. None of the
crazy "ok, you can't even look things up in the dcache directly at
all" stuff that we have as a hack to just allow _bad_ filesystems to
do their thing.

So I think this is a bigger deal than that diffstat of yours implies.
I don't think people understand just how *bad* case insensitivity is.

The old DOS/Mac people thought case insensitivity was a "helpful"
idea, and that was understandable - but wrong - even back in the 80's.
They are still living with the end result of that horrendously bad
decision decades later. They've _tried_ to fix their bad decisions,
and have never been able to (except, apparently, in iOS where somebody
finally had a glimmer of a clue).

Linus

2018-12-10 19:35:38

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v4 00/23] Ext4 Encoding and Case-insensitive support

On Sun, Dec 9, 2018 at 4:08 PM Theodore Y. Ts'o <[email protected]> wrote:
>
> So things are much better in recent years. In the past it was kind of
> a disaster, but the world is converging enough that the latest
> versions of Mac OS'x APFS and Windows NTFS behave pretty much the same
> way. They are both case-insensitive, case-preserving and
> normalization-preserving, normalization-insensitive with respect to
> filenames.

Oh, so APFS at least fixed *that* horrific problem with their
filesystem. Oh how I despised the exposure of NFD (which should at
most be used as an internal representation, not externally visible).
Turning basic letters (coming from Finland, åäö) into character
combinations was an absolute abomination.

> In the bad old-days, MacOS X's HFS+ was not normalization-preserving.

Oh, I'm very aware.

It's not even that it wasn't normalization-preserving, it picked the
*wrong* normalization to use.

> Now, both file systems basically say, "we don't care whether you pass
> in U+212B or U+0041,U+030A; on the screen it looks identical, Ã…, so we
> will treat it as the same filename; but readdir(2) will return what
> you gave us."

Actually, the "on the screen it will look identical" is a horribly
incorrect thing to do too.

There are lots of things that look identical on the screen without
being at all the same thing. Sometimes it depends on font, sometimes
it's just how it is. A nonbreaking space is *not* the same as a
regular space, even if they may look identical on the screen.

I suspect (and sincerely _hope_) neither filesystem actually does
anything as stuipid as taking "glyph equivalence" into account.

I'm hoping it's just "convert to NFx, then lower-case, then compare
for equality". Where the 'x' doesn't much matter as long as it is
never _exposed_ in any way outside of the comparison (ie NFD is a fine
and probably simpler model for the lower-casing, the HFS+ mistake was
to then expose the corrupted form of the filename).

> It's been a *long* time since Unicode has changed case folding rules
> for pre-existing characters. The tables have only changed with
> respect to the new character sets have been added.

But new characters _have_ been added, and some of them do have
lower-case form, so the folding tables have changed.

Happily, maybe that is over. As long as the Unicode people continue to
mainly play with their Emoji list, I guess we can consider it done.

> So how about this? We'll put the unicode handling functions in a new
> directory, fs/unicode, just to make it really clear that this will now
> be changing any of the legacy fs/nls functions which other file
> systems will use. By putting it in a separate directory, it will be
> easier for other file systems to use it, whether it's for better Samba
> or NFSv4 support.

Ok, that sounds fine.

Some of the unicode translation functions from the NLS code could well
move into that, and NLS itself could be relegated to the sad
historical thing.

And please try to make the *interfaces* sane.

For example, the interface for "let's compare with folded case" should
*not* be about "convert to NFDK and lower case into a temp buffer,
then compare the results".

You can do a lot of "let's handle the simple cases" faster even if the
"oh, I hit a complex character" case might then become one of those
"convert to a temp buffer" cases.

And it shouldn't be about C strings, since we very much have cases
where it's not a C string but a {ptr,len} tuple. Maybe even use the
"struct qstr", which is a not-horrible way to pass those around.

Even if you have a C string, you can always just do

struct qstr str = QSTR_INIT(name, strlen(name));

and then pass that qstr pointer around.

Finally, don't do the NLS thing with "descriptors". that you register
and look up. The indirection kills you. Particularly the crazy "one
character at a time" model.

Just let people explicitly say "utf8_icasecmp(qstr, qstr)" or
something like that. With the interface at least allowing for the
common simple cases (ie everything is in the ASCII subset) to be
handled basically as a specialized thing.

Linus

2018-12-06 22:05:13

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH v4 08/23] nls: Let charsets define the behavior of tolower/toupper

From: Gabriel Krisman Bertazi <[email protected]>

Instead of always reading from a table, give the charset a chance to
implement tolower() and toupper() algorithmically.

This allow us to drop a lot of tables which hardcode the identity
functions (like ASCII), and replace them with a few lines of code in the
hooks.

This patch was created using the semantic patch below, with the
exception of the header files (hook definitions) and a fix to files that
didn't have the tables statically allocated (koi8-u and cp932).

<smpl>

@tbl@
identifier p;
expression lower_tbl;
expression upper_tbl;
@@

static struct nls_table p = {
- .charset2lower = lower_tbl,
- .charset2upper = upper_tbl,
};

@@
identifier charset_ops;
expression tbl.lower_tbl;
expression tbl.upper_tbl;
@@

+ static unsigned char charset_tolower(const struct nls_table *table, unsigned int c)
+ {
+ return lower_tbl[c];
+ }
+
+ static unsigned char charset_toupper(const struct nls_table *table, unsigned int c)
+ {
+ return upper_tbl[c];
+ }

static const struct nls_ops charset_ops = {
+ .lowercase = charset_toupper,
+ .uppercase = charset_tolower,
};

@@
struct nls_table *t;
expression A;
expression nc;
@@

(
- nc = t->charset2lower[A]
+ nc = nls_tolower(t, A)

|
- nc = t->charset2upper[A]
+ nc = nls_toupper(t, A)
)
<...
- if(!nc)
- nc = A;
...>

</smpl>

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
fs/fat/dir.c | 5 +----
fs/nls/mac-celtic.c | 14 ++++++++++++--
fs/nls/mac-centeuro.c | 14 ++++++++++++--
fs/nls/mac-croatian.c | 14 ++++++++++++--
fs/nls/mac-cyrillic.c | 14 ++++++++++++--
fs/nls/mac-gaelic.c | 14 ++++++++++++--
fs/nls/mac-greek.c | 14 ++++++++++++--
fs/nls/mac-iceland.c | 14 ++++++++++++--
fs/nls/mac-inuit.c | 14 ++++++++++++--
fs/nls/mac-roman.c | 14 ++++++++++++--
fs/nls/mac-romanian.c | 14 ++++++++++++--
fs/nls/mac-turkish.c | 14 ++++++++++++--
fs/nls/nls_ascii.c | 14 ++++++++++++--
fs/nls/nls_cp1250.c | 14 ++++++++++++--
fs/nls/nls_cp1251.c | 14 ++++++++++++--
fs/nls/nls_cp1255.c | 14 ++++++++++++--
fs/nls/nls_cp437.c | 14 ++++++++++++--
fs/nls/nls_cp737.c | 14 ++++++++++++--
fs/nls/nls_cp775.c | 14 ++++++++++++--
fs/nls/nls_cp850.c | 14 ++++++++++++--
fs/nls/nls_cp852.c | 14 ++++++++++++--
fs/nls/nls_cp855.c | 14 ++++++++++++--
fs/nls/nls_cp857.c | 14 ++++++++++++--
fs/nls/nls_cp860.c | 14 ++++++++++++--
fs/nls/nls_cp861.c | 14 ++++++++++++--
fs/nls/nls_cp862.c | 14 ++++++++++++--
fs/nls/nls_cp863.c | 14 ++++++++++++--
fs/nls/nls_cp864.c | 14 ++++++++++++--
fs/nls/nls_cp865.c | 14 ++++++++++++--
fs/nls/nls_cp866.c | 14 ++++++++++++--
fs/nls/nls_cp869.c | 14 ++++++++++++--
fs/nls/nls_cp874.c | 14 ++++++++++++--
fs/nls/nls_cp932.c | 14 ++++++++++++--
fs/nls/nls_cp936.c | 14 ++++++++++++--
fs/nls/nls_cp949.c | 14 ++++++++++++--
fs/nls/nls_cp950.c | 14 ++++++++++++--
fs/nls/nls_default.c | 14 ++++++++++++--
fs/nls/nls_euc-jp.c | 7 ++++---
fs/nls/nls_iso8859-1.c | 14 ++++++++++++--
fs/nls/nls_iso8859-13.c | 14 ++++++++++++--
fs/nls/nls_iso8859-14.c | 14 ++++++++++++--
fs/nls/nls_iso8859-15.c | 14 ++++++++++++--
fs/nls/nls_iso8859-2.c | 14 ++++++++++++--
fs/nls/nls_iso8859-3.c | 14 ++++++++++++--
fs/nls/nls_iso8859-4.c | 14 ++++++++++++--
fs/nls/nls_iso8859-5.c | 14 ++++++++++++--
fs/nls/nls_iso8859-6.c | 14 ++++++++++++--
fs/nls/nls_iso8859-7.c | 14 ++++++++++++--
fs/nls/nls_iso8859-9.c | 14 ++++++++++++--
fs/nls/nls_koi8-r.c | 14 ++++++++++++--
fs/nls/nls_koi8-ru.c | 6 +++---
fs/nls/nls_koi8-u.c | 14 ++++++++++++--
fs/nls/nls_utf8.c | 14 ++++++++++++--
include/linux/nls.h | 17 +++++++++++------
54 files changed, 619 insertions(+), 116 deletions(-)

diff --git a/fs/fat/dir.c b/fs/fat/dir.c
index d5f856651a08..6518886ee5cf 100644
--- a/fs/fat/dir.c
+++ b/fs/fat/dir.c
@@ -215,10 +215,7 @@ fat_short2lower_uni(struct nls_table *t, unsigned char *c,
*uni = 0x003f; /* a question mark */
charlen = 1;
} else if (charlen <= 1) {
- unsigned char nc = t->charset2lower[*c];
-
- if (!nc)
- nc = *c;
+ unsigned char nc = nls_tolower(t, *c);

charlen = nls_char2uni(t, &nc, 1, uni);
if (charlen < 0) {
diff --git a/fs/nls/mac-celtic.c b/fs/nls/mac-celtic.c
index 4fe7347c55d6..7207f9a14342 100644
--- a/fs/nls/mac-celtic.c
+++ b/fs/nls/mac-celtic.c
@@ -577,7 +577,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static unsigned char charset_tolower(const struct nls_table *table,
+ unsigned int c){
+ return charset2lower[c];
+}
+
+static unsigned char charset_toupper(const struct nls_table *table,
+ unsigned int c) {
+ return charset2upper[c];
+}
+
static const struct nls_ops charset_ops = {
+ .lowercase = charset_toupper,
+ .uppercase = charset_tolower,
.uni2char = uni2char,
.char2uni = char2uni,
};
@@ -586,8 +598,6 @@ static struct nls_charset nls_charset;
static struct nls_table table = {
.charset = &nls_charset,
.ops = &charset_ops,
- .charset2lower = charset2lower,
- .charset2upper = charset2upper,
};

static struct nls_charset nls_charset = {
diff --git a/fs/nls/mac-centeuro.c b/fs/nls/mac-centeuro.c
index 2d115aae4240..0664408e4451 100644
--- a/fs/nls/mac-centeuro.c
+++ b/fs/nls/mac-centeuro.c
@@ -507,7 +507,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static unsigned char charset_tolower(const struct nls_table *table,
+ unsigned int c){
+ return charset2lower[c];
+}
+
+static unsigned char charset_toupper(const struct nls_table *table,
+ unsigned int c) {
+ return charset2upper[c];
+}
+
static const struct nls_ops charset_ops = {
+ .lowercase = charset_toupper,
+ .uppercase = charset_tolower,
.uni2char = uni2char,
.char2uni = char2uni,
};
@@ -516,8 +528,6 @@ static struct nls_charset nls_charset;
static struct nls_table table = {
.charset = &nls_charset,
.ops = &charset_ops,
- .charset2lower = charset2lower,
- .charset2upper = charset2upper,
};

static struct nls_charset nls_charset = {
diff --git a/fs/nls/mac-croatian.c b/fs/nls/mac-croatian.c
index b496b85fcde1..a4b7992ef8ec 100644
--- a/fs/nls/mac-croatian.c
+++ b/fs/nls/mac-croatian.c
@@ -577,7 +577,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static unsigned char charset_tolower(const struct nls_table *table,
+ unsigned int c){
+ return charset2lower[c];
+}
+
+static unsigned char charset_toupper(const struct nls_table *table,
+ unsigned int c) {
+ return charset2upper[c];
+}
+
static const struct nls_ops charset_ops = {
+ .lowercase = charset_toupper,
+ .uppercase = charset_tolower,
.uni2char = uni2char,
.char2uni = char2uni,
};
@@ -586,8 +598,6 @@ static struct nls_charset nls_charset;
static struct nls_table table = {
.charset = &nls_charset,
.ops = &charset_ops,
- .charset2lower = charset2lower,
- .charset2upper = charset2upper,
};

static struct nls_charset nls_charset = {
diff --git a/fs/nls/mac-cyrillic.c b/fs/nls/mac-cyrillic.c
index 18c9e0eb8e58..cb60563911ea 100644
--- a/fs/nls/mac-cyrillic.c
+++ b/fs/nls/mac-cyrillic.c
@@ -472,7 +472,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static unsigned char charset_tolower(const struct nls_table *table,
+ unsigned int c){
+ return charset2lower[c];
+}
+
+static unsigned char charset_toupper(const struct nls_table *table,
+ unsigned int c) {
+ return charset2upper[c];
+}
+
static const struct nls_ops charset_ops = {
+ .lowercase = charset_toupper,
+ .uppercase = charset_tolower,
.uni2char = uni2char,
.char2uni = char2uni,
};
@@ -481,8 +493,6 @@ static struct nls_charset nls_charset;
static struct nls_table table = {
.charset = &nls_charset,
.ops = &charset_ops,
- .charset2lower = charset2lower,
- .charset2upper = charset2upper,
};

static struct nls_charset nls_charset = {
diff --git a/fs/nls/mac-gaelic.c b/fs/nls/mac-gaelic.c
index 8f8d6ae20f02..e683881f4a13 100644
--- a/fs/nls/mac-gaelic.c
+++ b/fs/nls/mac-gaelic.c
@@ -542,7 +542,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static unsigned char charset_tolower(const struct nls_table *table,
+ unsigned int c){
+ return charset2lower[c];
+}
+
+static unsigned char charset_toupper(const struct nls_table *table,
+ unsigned int c) {
+ return charset2upper[c];
+}
+
static const struct nls_ops charset_ops = {
+ .lowercase = charset_toupper,
+ .uppercase = charset_tolower,
.uni2char = uni2char,
.char2uni = char2uni,
};
@@ -551,8 +563,6 @@ static struct nls_charset nls_charset;
static struct nls_table table = {
.charset = &nls_charset,
.ops = &charset_ops,
- .charset2lower = charset2lower,
- .charset2upper = charset2upper,
};

static struct nls_charset nls_charset = {
diff --git a/fs/nls/mac-greek.c b/fs/nls/mac-greek.c
index 0e2c12fe3447..bd2245238512 100644
--- a/fs/nls/mac-greek.c
+++ b/fs/nls/mac-greek.c
@@ -472,7 +472,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static unsigned char charset_tolower(const struct nls_table *table,
+ unsigned int c){
+ return charset2lower[c];
+}
+
+static unsigned char charset_toupper(const struct nls_table *table,
+ unsigned int c) {
+ return charset2upper[c];
+}
+
static const struct nls_ops charset_ops = {
+ .lowercase = charset_toupper,
+ .uppercase = charset_tolower,
.uni2char = uni2char,
.char2uni = char2uni,
};
@@ -481,8 +493,6 @@ static struct nls_charset nls_charset;
static struct nls_table table = {
.charset = &nls_charset,
.ops = &charset_ops,
- .charset2lower = charset2lower,
- .charset2upper = charset2upper,
};

static struct nls_charset nls_charset = {
diff --git a/fs/nls/mac-iceland.c b/fs/nls/mac-iceland.c
index 414767fa47a4..3ce3e27b3660 100644
--- a/fs/nls/mac-iceland.c
+++ b/fs/nls/mac-iceland.c
@@ -577,7 +577,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static unsigned char charset_tolower(const struct nls_table *table,
+ unsigned int c){
+ return charset2lower[c];
+}
+
+static unsigned char charset_toupper(const struct nls_table *table,
+ unsigned int c) {
+ return charset2upper[c];
+}
+
static const struct nls_ops charset_ops = {
+ .lowercase = charset_toupper,
+ .uppercase = charset_tolower,
.uni2char = uni2char,
.char2uni = char2uni,
};
@@ -586,8 +598,6 @@ static struct nls_charset nls_charset;
static struct nls_table table = {
.charset = &nls_charset,
.ops = &charset_ops,
- .charset2lower = charset2lower,
- .charset2upper = charset2upper,
};

static struct nls_charset nls_charset = {
diff --git a/fs/nls/mac-inuit.c b/fs/nls/mac-inuit.c
index 0e06fd3a0c8f..6f12cccccb37 100644
--- a/fs/nls/mac-inuit.c
+++ b/fs/nls/mac-inuit.c
@@ -507,7 +507,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static unsigned char charset_tolower(const struct nls_table *table,
+ unsigned int c){
+ return charset2lower[c];
+}
+
+static unsigned char charset_toupper(const struct nls_table *table,
+ unsigned int c) {
+ return charset2upper[c];
+}
+
static const struct nls_ops charset_ops = {
+ .lowercase = charset_toupper,
+ .uppercase = charset_tolower,
.uni2char = uni2char,
.char2uni = char2uni,
};
@@ -516,8 +528,6 @@ static struct nls_charset nls_charset;
static struct nls_table table = {
.charset = &nls_charset,
.ops = &charset_ops,
- .charset2lower = charset2lower,
- .charset2upper = charset2upper,
};

static struct nls_charset nls_charset = {
diff --git a/fs/nls/mac-roman.c b/fs/nls/mac-roman.c
index fcfd387cfaa8..d8e411c82c69 100644
--- a/fs/nls/mac-roman.c
+++ b/fs/nls/mac-roman.c
@@ -612,7 +612,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static unsigned char charset_tolower(const struct nls_table *table,
+ unsigned int c){
+ return charset2lower[c];
+}
+
+static unsigned char charset_toupper(const struct nls_table *table,
+ unsigned int c) {
+ return charset2upper[c];
+}
+
static const struct nls_ops charset_ops = {
+ .lowercase = charset_toupper,
+ .uppercase = charset_tolower,
.uni2char = uni2char,
.char2uni = char2uni,
};
@@ -621,8 +633,6 @@ static struct nls_charset nls_charset;
static struct nls_table table = {
.charset = &nls_charset,
.ops = &charset_ops,
- .charset2lower = charset2lower,
- .charset2upper = charset2upper,
};

static struct nls_charset nls_charset = {
diff --git a/fs/nls/mac-romanian.c b/fs/nls/mac-romanian.c
index 74027022a135..cd638dfe9d7c 100644
--- a/fs/nls/mac-romanian.c
+++ b/fs/nls/mac-romanian.c
@@ -577,7 +577,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static unsigned char charset_tolower(const struct nls_table *table,
+ unsigned int c){
+ return charset2lower[c];
+}
+
+static unsigned char charset_toupper(const struct nls_table *table,
+ unsigned int c) {
+ return charset2upper[c];
+}
+
static const struct nls_ops charset_ops = {
+ .lowercase = charset_toupper,
+ .uppercase = charset_tolower,
.uni2char = uni2char,
.char2uni = char2uni,
};
@@ -586,8 +598,6 @@ static struct nls_charset nls_charset;
static struct nls_table table = {
.charset = &nls_charset,
.ops = &charset_ops,
- .charset2lower = charset2lower,
- .charset2upper = charset2upper,
};

static struct nls_charset nls_charset = {
diff --git a/fs/nls/mac-turkish.c b/fs/nls/mac-turkish.c
index 0edc0f8b1f4d..82ba6f6b4c24 100644
--- a/fs/nls/mac-turkish.c
+++ b/fs/nls/mac-turkish.c
@@ -577,7 +577,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static unsigned char charset_tolower(const struct nls_table *table,
+ unsigned int c){
+ return charset2lower[c];
+}
+
+static unsigned char charset_toupper(const struct nls_table *table,
+ unsigned int c) {
+ return charset2upper[c];
+}
+
static const struct nls_ops charset_ops = {
+ .lowercase = charset_toupper,
+ .uppercase = charset_tolower,
.uni2char = uni2char,
.char2uni = char2uni,
};
@@ -586,8 +598,6 @@ static struct nls_charset nls_charset;
static struct nls_table table = {
.charset = &nls_charset,
.ops = &charset_ops,
- .charset2lower = charset2lower,
- .charset2upper = charset2upper,
};

static struct nls_charset nls_charset = {
diff --git a/fs/nls/nls_ascii.c b/fs/nls/nls_ascii.c
index 3c3ee908d1ed..2f4826478d3d 100644
--- a/fs/nls/nls_ascii.c
+++ b/fs/nls/nls_ascii.c
@@ -142,7 +142,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static unsigned char charset_tolower(const struct nls_table *table,
+ unsigned int c){
+ return charset2lower[c];
+}
+
+static unsigned char charset_toupper(const struct nls_table *table,
+ unsigned int c) {
+ return charset2upper[c];
+}
+
static const struct nls_ops charset_ops = {
+ .lowercase = charset_toupper,
+ .uppercase = charset_tolower,
.uni2char = uni2char,
.char2uni = char2uni,
};
@@ -151,8 +163,6 @@ static struct nls_charset nls_charset;
static struct nls_table table = {
.charset = &nls_charset,
.ops = &charset_ops,
- .charset2lower = charset2lower,
- .charset2upper = charset2upper,
};

static struct nls_charset nls_charset = {
diff --git a/fs/nls/nls_cp1250.c b/fs/nls/nls_cp1250.c
index 080717694405..1cfe65851185 100644
--- a/fs/nls/nls_cp1250.c
+++ b/fs/nls/nls_cp1250.c
@@ -323,7 +323,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static unsigned char charset_tolower(const struct nls_table *table,
+ unsigned int c){
+ return charset2lower[c];
+}
+
+static unsigned char charset_toupper(const struct nls_table *table,
+ unsigned int c) {
+ return charset2upper[c];
+}
+
static const struct nls_ops charset_ops = {
+ .lowercase = charset_toupper,
+ .uppercase = charset_tolower,
.uni2char = uni2char,
.char2uni = char2uni,
};
@@ -332,8 +344,6 @@ static struct nls_charset nls_charset;
static struct nls_table table = {
.charset = &nls_charset,
.ops = &charset_ops,
- .charset2lower = charset2lower,
- .charset2upper = charset2upper,
};

static struct nls_charset nls_charset = {
diff --git a/fs/nls/nls_cp1251.c b/fs/nls/nls_cp1251.c
index 2fba498ab289..061eb23892f1 100644
--- a/fs/nls/nls_cp1251.c
+++ b/fs/nls/nls_cp1251.c
@@ -277,7 +277,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static unsigned char charset_tolower(const struct nls_table *table,
+ unsigned int c){
+ return charset2lower[c];
+}
+
+static unsigned char charset_toupper(const struct nls_table *table,
+ unsigned int c) {
+ return charset2upper[c];
+}
+
static const struct nls_ops charset_ops = {
+ .lowercase = charset_toupper,
+ .uppercase = charset_tolower,
.uni2char = uni2char,
.char2uni = char2uni,
};
@@ -286,8 +298,6 @@ static struct nls_charset nls_charset;
static struct nls_table table = {
.charset = &nls_charset,
.ops = &charset_ops,
- .charset2lower = charset2lower,
- .charset2upper = charset2upper,
};

static struct nls_charset nls_charset = {
diff --git a/fs/nls/nls_cp1255.c b/fs/nls/nls_cp1255.c
index c268e8d8c038..2a71dc175c9b 100644
--- a/fs/nls/nls_cp1255.c
+++ b/fs/nls/nls_cp1255.c
@@ -358,7 +358,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static unsigned char charset_tolower(const struct nls_table *table,
+ unsigned int c){
+ return charset2lower[c];
+}
+
+static unsigned char charset_toupper(const struct nls_table *table,
+ unsigned int c) {
+ return charset2upper[c];
+}
+
static const struct nls_ops charset_ops = {
+ .lowercase = charset_toupper,
+ .uppercase = charset_tolower,
.uni2char = uni2char,
.char2uni = char2uni,
};
@@ -367,8 +379,6 @@ static struct nls_charset nls_charset;
static struct nls_table table = {
.charset = &nls_charset,
.ops = &charset_ops,
- .charset2lower = charset2lower,
- .charset2upper = charset2upper,
};

static struct nls_charset nls_charset = {
diff --git a/fs/nls/nls_cp437.c b/fs/nls/nls_cp437.c
index f24f8691e720..4f763761b699 100644
--- a/fs/nls/nls_cp437.c
+++ b/fs/nls/nls_cp437.c
@@ -363,7 +363,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static unsigned char charset_tolower(const struct nls_table *table,
+ unsigned int c){
+ return charset2lower[c];
+}
+
+static unsigned char charset_toupper(const struct nls_table *table,
+ unsigned int c) {
+ return charset2upper[c];
+}
+
static const struct nls_ops charset_ops = {
+ .lowercase = charset_toupper,
+ .uppercase = charset_tolower,
.uni2char = uni2char,
.char2uni = char2uni,
};
@@ -372,8 +384,6 @@ static struct nls_charset nls_charset;
static struct nls_table table = {
.charset = &nls_charset,
.ops = &charset_ops,
- .charset2lower = charset2lower,
- .charset2upper = charset2upper,
};

static struct nls_charset nls_charset = {
diff --git a/fs/nls/nls_cp737.c b/fs/nls/nls_cp737.c
index f5a8b9e88165..2f2ab91340e7 100644
--- a/fs/nls/nls_cp737.c
+++ b/fs/nls/nls_cp737.c
@@ -326,7 +326,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static unsigned char charset_tolower(const struct nls_table *table,
+ unsigned int c){
+ return charset2lower[c];
+}
+
+static unsigned char charset_toupper(const struct nls_table *table,
+ unsigned int c) {
+ return charset2upper[c];
+}
+
static const struct nls_ops charset_ops = {
+ .lowercase = charset_toupper,
+ .uppercase = charset_tolower,
.uni2char = uni2char,
.char2uni = char2uni,
};
@@ -335,8 +347,6 @@ static struct nls_charset nls_charset;
static struct nls_table table = {
.charset = &nls_charset,
.ops = &charset_ops,
- .charset2lower = charset2lower,
- .charset2upper = charset2upper,
};

static struct nls_charset nls_charset = {
diff --git a/fs/nls/nls_cp775.c b/fs/nls/nls_cp775.c
index d268bfb873e4..92f311e620f3 100644
--- a/fs/nls/nls_cp775.c
+++ b/fs/nls/nls_cp775.c
@@ -295,7 +295,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static unsigned char charset_tolower(const struct nls_table *table,
+ unsigned int c){
+ return charset2lower[c];
+}
+
+static unsigned char charset_toupper(const struct nls_table *table,
+ unsigned int c) {
+ return charset2upper[c];
+}
+
static const struct nls_ops charset_ops = {
+ .lowercase = charset_toupper,
+ .uppercase = charset_tolower,
.uni2char = uni2char,
.char2uni = char2uni,
};
@@ -304,8 +316,6 @@ static struct nls_charset nls_charset;
static struct nls_table table = {
.charset = &nls_charset,
.ops = &charset_ops,
- .charset2lower = charset2lower,
- .charset2upper = charset2upper,
};

static struct nls_charset nls_charset = {
diff --git a/fs/nls/nls_cp850.c b/fs/nls/nls_cp850.c
index b698b0df65e3..77cdce20ced6 100644
--- a/fs/nls/nls_cp850.c
+++ b/fs/nls/nls_cp850.c
@@ -291,7 +291,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static unsigned char charset_tolower(const struct nls_table *table,
+ unsigned int c){
+ return charset2lower[c];
+}
+
+static unsigned char charset_toupper(const struct nls_table *table,
+ unsigned int c) {
+ return charset2upper[c];
+}
+
static const struct nls_ops charset_ops = {
+ .lowercase = charset_toupper,
+ .uppercase = charset_tolower,
.uni2char = uni2char,
.char2uni = char2uni,
};
@@ -300,8 +312,6 @@ static struct nls_charset nls_charset;
static struct nls_table table = {
.charset = &nls_charset,
.ops = &charset_ops,
- .charset2lower = charset2lower,
- .charset2upper = charset2upper,
};

static struct nls_charset nls_charset = {
diff --git a/fs/nls/nls_cp852.c b/fs/nls/nls_cp852.c
index 738e95346b34..47722904e9f1 100644
--- a/fs/nls/nls_cp852.c
+++ b/fs/nls/nls_cp852.c
@@ -313,7 +313,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static unsigned char charset_tolower(const struct nls_table *table,
+ unsigned int c){
+ return charset2lower[c];
+}
+
+static unsigned char charset_toupper(const struct nls_table *table,
+ unsigned int c) {
+ return charset2upper[c];
+}
+
static const struct nls_ops charset_ops = {
+ .lowercase = charset_toupper,
+ .uppercase = charset_tolower,
.uni2char = uni2char,
.char2uni = char2uni,
};
@@ -322,8 +334,6 @@ static struct nls_charset nls_charset;
static struct nls_table table = {
.charset = &nls_charset,
.ops = &charset_ops,
- .charset2lower = charset2lower,
- .charset2upper = charset2upper,
};

static struct nls_charset nls_charset = {
diff --git a/fs/nls/nls_cp855.c b/fs/nls/nls_cp855.c
index 9a1c4e307cb1..b52709886900 100644
--- a/fs/nls/nls_cp855.c
+++ b/fs/nls/nls_cp855.c
@@ -275,7 +275,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static unsigned char charset_tolower(const struct nls_table *table,
+ unsigned int c){
+ return charset2lower[c];
+}
+
+static unsigned char charset_toupper(const struct nls_table *table,
+ unsigned int c) {
+ return charset2upper[c];
+}
+
static const struct nls_ops charset_ops = {
+ .lowercase = charset_toupper,
+ .uppercase = charset_tolower,
.uni2char = uni2char,
.char2uni = char2uni,
};
@@ -284,8 +296,6 @@ static struct nls_charset nls_charset;
static struct nls_table table = {
.charset = &nls_charset,
.ops = &charset_ops,
- .charset2lower = charset2lower,
- .charset2upper = charset2upper,
};

static struct nls_charset nls_charset = {
diff --git a/fs/nls/nls_cp857.c b/fs/nls/nls_cp857.c
index 782e31cb9f5a..fcdf30a540f8 100644
--- a/fs/nls/nls_cp857.c
+++ b/fs/nls/nls_cp857.c
@@ -277,7 +277,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static unsigned char charset_tolower(const struct nls_table *table,
+ unsigned int c){
+ return charset2lower[c];
+}
+
+static unsigned char charset_toupper(const struct nls_table *table,
+ unsigned int c) {
+ return charset2upper[c];
+}
+
static const struct nls_ops charset_ops = {
+ .lowercase = charset_toupper,
+ .uppercase = charset_tolower,
.uni2char = uni2char,
.char2uni = char2uni,
};
@@ -286,8 +298,6 @@ static struct nls_charset nls_charset;
static struct nls_table table = {
.charset = &nls_charset,
.ops = &charset_ops,
- .charset2lower = charset2lower,
- .charset2upper = charset2upper,
};

static struct nls_charset nls_charset = {
diff --git a/fs/nls/nls_cp860.c b/fs/nls/nls_cp860.c
index 2ad1954b84e6..a1504424e923 100644
--- a/fs/nls/nls_cp860.c
+++ b/fs/nls/nls_cp860.c
@@ -340,7 +340,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static unsigned char charset_tolower(const struct nls_table *table,
+ unsigned int c){
+ return charset2lower[c];
+}
+
+static unsigned char charset_toupper(const struct nls_table *table,
+ unsigned int c) {
+ return charset2upper[c];
+}
+
static const struct nls_ops charset_ops = {
+ .lowercase = charset_toupper,
+ .uppercase = charset_tolower,
.uni2char = uni2char,
.char2uni = char2uni,
};
@@ -349,8 +361,6 @@ static struct nls_charset nls_charset;
static struct nls_table table = {
.charset = &nls_charset,
.ops = &charset_ops,
- .charset2lower = charset2lower,
- .charset2upper = charset2upper,
};

static struct nls_charset nls_charset = {
diff --git a/fs/nls/nls_cp861.c b/fs/nls/nls_cp861.c
index 5930b0e6e8f1..9fa1f54cee0d 100644
--- a/fs/nls/nls_cp861.c
+++ b/fs/nls/nls_cp861.c
@@ -363,7 +363,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static unsigned char charset_tolower(const struct nls_table *table,
+ unsigned int c){
+ return charset2lower[c];
+}
+
+static unsigned char charset_toupper(const struct nls_table *table,
+ unsigned int c) {
+ return charset2upper[c];
+}
+
static const struct nls_ops charset_ops = {
+ .lowercase = charset_toupper,
+ .uppercase = charset_tolower,
.uni2char = uni2char,
.char2uni = char2uni,
};
@@ -372,8 +384,6 @@ static struct nls_charset nls_charset;
static struct nls_table table = {
.charset = &nls_charset,
.ops = &charset_ops,
- .charset2lower = charset2lower,
- .charset2upper = charset2upper,
};

static struct nls_charset nls_charset = {
diff --git a/fs/nls/nls_cp862.c b/fs/nls/nls_cp862.c
index 63c27b24a011..00474e2b2102 100644
--- a/fs/nls/nls_cp862.c
+++ b/fs/nls/nls_cp862.c
@@ -397,7 +397,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static unsigned char charset_tolower(const struct nls_table *table,
+ unsigned int c){
+ return charset2lower[c];
+}
+
+static unsigned char charset_toupper(const struct nls_table *table,
+ unsigned int c) {
+ return charset2upper[c];
+}
+
static const struct nls_ops charset_ops = {
+ .lowercase = charset_toupper,
+ .uppercase = charset_tolower,
.uni2char = uni2char,
.char2uni = char2uni,
};
@@ -406,8 +418,6 @@ static struct nls_charset nls_charset;
static struct nls_table table = {
.charset = &nls_charset,
.ops = &charset_ops,
- .charset2lower = charset2lower,
- .charset2upper = charset2upper,
};

static struct nls_charset nls_charset = {
diff --git a/fs/nls/nls_cp863.c b/fs/nls/nls_cp863.c
index aa815cdc7481..908e573c1c42 100644
--- a/fs/nls/nls_cp863.c
+++ b/fs/nls/nls_cp863.c
@@ -357,7 +357,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static unsigned char charset_tolower(const struct nls_table *table,
+ unsigned int c){
+ return charset2lower[c];
+}
+
+static unsigned char charset_toupper(const struct nls_table *table,
+ unsigned int c) {
+ return charset2upper[c];
+}
+
static const struct nls_ops charset_ops = {
+ .lowercase = charset_toupper,
+ .uppercase = charset_tolower,
.uni2char = uni2char,
.char2uni = char2uni,
};
@@ -366,8 +378,6 @@ static struct nls_charset nls_charset;
static struct nls_table table = {
.charset = &nls_charset,
.ops = &charset_ops,
- .charset2lower = charset2lower,
- .charset2upper = charset2upper,
};

static struct nls_charset nls_charset = {
diff --git a/fs/nls/nls_cp864.c b/fs/nls/nls_cp864.c
index a20725f661e9..6cae9e9c73aa 100644
--- a/fs/nls/nls_cp864.c
+++ b/fs/nls/nls_cp864.c
@@ -383,7 +383,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static unsigned char charset_tolower(const struct nls_table *table,
+ unsigned int c){
+ return charset2lower[c];
+}
+
+static unsigned char charset_toupper(const struct nls_table *table,
+ unsigned int c) {
+ return charset2upper[c];
+}
+
static const struct nls_ops charset_ops = {
+ .lowercase = charset_toupper,
+ .uppercase = charset_tolower,
.uni2char = uni2char,
.char2uni = char2uni,
};
@@ -392,8 +404,6 @@ static struct nls_charset nls_charset;
static struct nls_table table = {
.charset = &nls_charset,
.ops = &charset_ops,
- .charset2lower = charset2lower,
- .charset2upper = charset2upper,
};

static struct nls_charset nls_charset = {
diff --git a/fs/nls/nls_cp865.c b/fs/nls/nls_cp865.c
index 3d22ec2bd7af..5aa6415ec357 100644
--- a/fs/nls/nls_cp865.c
+++ b/fs/nls/nls_cp865.c
@@ -363,7 +363,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static unsigned char charset_tolower(const struct nls_table *table,
+ unsigned int c){
+ return charset2lower[c];
+}
+
+static unsigned char charset_toupper(const struct nls_table *table,
+ unsigned int c) {
+ return charset2upper[c];
+}
+
static const struct nls_ops charset_ops = {
+ .lowercase = charset_toupper,
+ .uppercase = charset_tolower,
.uni2char = uni2char,
.char2uni = char2uni,
};
@@ -372,8 +384,6 @@ static struct nls_charset nls_charset;
static struct nls_table table = {
.charset = &nls_charset,
.ops = &charset_ops,
- .charset2lower = charset2lower,
- .charset2upper = charset2upper,
};

static struct nls_charset nls_charset = {
diff --git a/fs/nls/nls_cp866.c b/fs/nls/nls_cp866.c
index 35dc7b2f023a..f24b73839680 100644
--- a/fs/nls/nls_cp866.c
+++ b/fs/nls/nls_cp866.c
@@ -281,7 +281,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static unsigned char charset_tolower(const struct nls_table *table,
+ unsigned int c){
+ return charset2lower[c];
+}
+
+static unsigned char charset_toupper(const struct nls_table *table,
+ unsigned int c) {
+ return charset2upper[c];
+}
+
static const struct nls_ops charset_ops = {
+ .lowercase = charset_toupper,
+ .uppercase = charset_tolower,
.uni2char = uni2char,
.char2uni = char2uni,
};
@@ -290,8 +302,6 @@ static struct nls_charset nls_charset;
static struct nls_table table = {
.charset = &nls_charset,
.ops = &charset_ops,
- .charset2lower = charset2lower,
- .charset2upper = charset2upper,
};

static struct nls_charset nls_charset = {
diff --git a/fs/nls/nls_cp869.c b/fs/nls/nls_cp869.c
index 56504ab0f405..c2ba80140906 100644
--- a/fs/nls/nls_cp869.c
+++ b/fs/nls/nls_cp869.c
@@ -291,7 +291,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static unsigned char charset_tolower(const struct nls_table *table,
+ unsigned int c){
+ return charset2lower[c];
+}
+
+static unsigned char charset_toupper(const struct nls_table *table,
+ unsigned int c) {
+ return charset2upper[c];
+}
+
static const struct nls_ops charset_ops = {
+ .lowercase = charset_toupper,
+ .uppercase = charset_tolower,
.uni2char = uni2char,
.char2uni = char2uni,
};
@@ -300,8 +312,6 @@ static struct nls_charset nls_charset;
static struct nls_table table = {
.charset = &nls_charset,
.ops = &charset_ops,
- .charset2lower = charset2lower,
- .charset2upper = charset2upper,
};

static struct nls_charset nls_charset = {
diff --git a/fs/nls/nls_cp874.c b/fs/nls/nls_cp874.c
index 41394620d000..844bb205deee 100644
--- a/fs/nls/nls_cp874.c
+++ b/fs/nls/nls_cp874.c
@@ -249,7 +249,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static unsigned char charset_tolower(const struct nls_table *table,
+ unsigned int c){
+ return charset2lower[c];
+}
+
+static unsigned char charset_toupper(const struct nls_table *table,
+ unsigned int c) {
+ return charset2upper[c];
+}
+
static const struct nls_ops charset_ops = {
+ .lowercase = charset_toupper,
+ .uppercase = charset_tolower,
.uni2char = uni2char,
.char2uni = char2uni,
};
@@ -258,8 +270,6 @@ static struct nls_charset nls_charset;
static struct nls_table table = {
.charset = &nls_charset,
.ops = &charset_ops,
- .charset2lower = charset2lower,
- .charset2upper = charset2upper,
};

static struct nls_charset nls_charset = {
diff --git a/fs/nls/nls_cp932.c b/fs/nls/nls_cp932.c
index 25fe26fb2603..0a5db2a0a6b3 100644
--- a/fs/nls/nls_cp932.c
+++ b/fs/nls/nls_cp932.c
@@ -7907,7 +7907,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen,
return -EINVAL;
}

+static unsigned char charset_tolower(const struct nls_table *table,
+ unsigned int c){
+ return charset2lower[c];
+}
+
+static unsigned char charset_toupper(const struct nls_table *table,
+ unsigned int c) {
+ return charset2upper[c];
+}
+
static const struct nls_ops charset_ops = {
+ .lowercase = charset_toupper,
+ .uppercase = charset_tolower,
.uni2char = uni2char,
.char2uni = char2uni,
};
@@ -7916,8 +7928,6 @@ static struct nls_charset nls_charset;
static struct nls_table table = {
.charset = &nls_charset,
.ops = &charset_ops,
- .charset2lower = charset2lower,
- .charset2upper = charset2upper,
};

static struct nls_charset nls_charset = {
diff --git a/fs/nls/nls_cp936.c b/fs/nls/nls_cp936.c
index 766f86b53a7b..6b0d725cdfab 100644
--- a/fs/nls/nls_cp936.c
+++ b/fs/nls/nls_cp936.c
@@ -11085,7 +11085,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen,
return n;
}

+static unsigned char charset_tolower(const struct nls_table *table,
+ unsigned int c){
+ return charset2lower[c];
+}
+
+static unsigned char charset_toupper(const struct nls_table *table,
+ unsigned int c) {
+ return charset2upper[c];
+}
+
static const struct nls_ops charset_ops = {
+ .lowercase = charset_toupper,
+ .uppercase = charset_tolower,
.uni2char = uni2char,
.char2uni = char2uni,
};
@@ -11094,8 +11106,6 @@ static struct nls_charset nls_charset;
static struct nls_table table = {
.charset = &nls_charset,
.ops = &charset_ops,
- .charset2lower = charset2lower,
- .charset2upper = charset2upper,
};

static struct nls_charset nls_charset = {
diff --git a/fs/nls/nls_cp949.c b/fs/nls/nls_cp949.c
index 138eec74bb3f..292c2d02d2c2 100644
--- a/fs/nls/nls_cp949.c
+++ b/fs/nls/nls_cp949.c
@@ -13920,7 +13920,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen,
return n;
}

+static unsigned char charset_tolower(const struct nls_table *table,
+ unsigned int c){
+ return charset2lower[c];
+}
+
+static unsigned char charset_toupper(const struct nls_table *table,
+ unsigned int c) {
+ return charset2upper[c];
+}
+
static const struct nls_ops charset_ops = {
+ .lowercase = charset_toupper,
+ .uppercase = charset_tolower,
.uni2char = uni2char,
.char2uni = char2uni,
};
@@ -13929,8 +13941,6 @@ static struct nls_charset nls_charset;
static struct nls_table table = {
.charset = &nls_charset,
.ops = &charset_ops,
- .charset2lower = charset2lower,
- .charset2upper = charset2upper,
};

static struct nls_charset nls_charset = {
diff --git a/fs/nls/nls_cp950.c b/fs/nls/nls_cp950.c
index 899da09fe0d7..d4e35bfd8dbd 100644
--- a/fs/nls/nls_cp950.c
+++ b/fs/nls/nls_cp950.c
@@ -9456,7 +9456,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen,
return n;
}

+static unsigned char charset_tolower(const struct nls_table *table,
+ unsigned int c){
+ return charset2lower[c];
+}
+
+static unsigned char charset_toupper(const struct nls_table *table,
+ unsigned int c) {
+ return charset2upper[c];
+}
+
static const struct nls_ops charset_ops = {
+ .lowercase = charset_toupper,
+ .uppercase = charset_tolower,
.uni2char = uni2char,
.char2uni = char2uni,
};
@@ -9465,8 +9477,6 @@ static struct nls_charset nls_charset;
static struct nls_table table = {
.charset = &nls_charset,
.ops = &charset_ops,
- .charset2lower = charset2lower,
- .charset2upper = charset2upper,
};

static struct nls_charset nls_charset = {
diff --git a/fs/nls/nls_default.c b/fs/nls/nls_default.c
index ef8c0efb8a3c..602eeec24b3d 100644
--- a/fs/nls/nls_default.c
+++ b/fs/nls/nls_default.c
@@ -447,7 +447,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static unsigned char charset_tolower(const struct nls_table *table,
+ unsigned int c){
+ return charset2lower[c];
+}
+
+static unsigned char charset_toupper(const struct nls_table *table,
+ unsigned int c) {
+ return charset2upper[c];
+}
+
static const struct nls_ops charset_ops = {
+ .lowercase = charset_toupper,
+ .uppercase = charset_tolower,
.uni2char = uni2char,
.char2uni = char2uni,
};
@@ -455,8 +467,6 @@ static const struct nls_ops charset_ops = {
static struct nls_table default_table = {
.charset = &default_charset,
.ops = &charset_ops,
- .charset2lower = charset2lower,
- .charset2upper = charset2upper,
};

struct nls_charset default_charset = {
diff --git a/fs/nls/nls_euc-jp.c b/fs/nls/nls_euc-jp.c
index 8bc5d9991452..b3a81350cbea 100644
--- a/fs/nls/nls_euc-jp.c
+++ b/fs/nls/nls_euc-jp.c
@@ -549,7 +549,7 @@ static int char2uni(const unsigned char *rawstring, int boundlen,
return euc_offset;
}

-static const struct nls_ops charset_ops = {
+static struct nls_ops charset_ops = {
.uni2char = uni2char,
.char2uni = char2uni,
};
@@ -570,8 +570,9 @@ static int __init init_nls_euc_jp(void)
p_nls = load_nls("cp932");

if (p_nls) {
- table.charset2upper = p_nls->charset2upper;
- table.charset2lower = p_nls->charset2lower;
+
+ charset_ops.uppercase = p_nls->ops->uppercase;
+ charset_ops.lowercase = p_nls->ops->lowercase;
return register_nls(&nls_charset);
}

diff --git a/fs/nls/nls_iso8859-1.c b/fs/nls/nls_iso8859-1.c
index 78e9c0169f69..a98298bd5de5 100644
--- a/fs/nls/nls_iso8859-1.c
+++ b/fs/nls/nls_iso8859-1.c
@@ -233,7 +233,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static unsigned char charset_tolower(const struct nls_table *table,
+ unsigned int c){
+ return charset2lower[c];
+}
+
+static unsigned char charset_toupper(const struct nls_table *table,
+ unsigned int c) {
+ return charset2upper[c];
+}
+
static const struct nls_ops charset_ops = {
+ .lowercase = charset_toupper,
+ .uppercase = charset_tolower,
.uni2char = uni2char,
.char2uni = char2uni,
};
@@ -242,8 +254,6 @@ static struct nls_charset nls_charset;
static struct nls_table table = {
.charset = &nls_charset,
.ops = &charset_ops,
- .charset2lower = charset2lower,
- .charset2upper = charset2upper,
};

static struct nls_charset nls_charset = {
diff --git a/fs/nls/nls_iso8859-13.c b/fs/nls/nls_iso8859-13.c
index eb8665629e0f..811f4cf1d1a3 100644
--- a/fs/nls/nls_iso8859-13.c
+++ b/fs/nls/nls_iso8859-13.c
@@ -261,7 +261,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static unsigned char charset_tolower(const struct nls_table *table,
+ unsigned int c){
+ return charset2lower[c];
+}
+
+static unsigned char charset_toupper(const struct nls_table *table,
+ unsigned int c) {
+ return charset2upper[c];
+}
+
static const struct nls_ops charset_ops = {
+ .lowercase = charset_toupper,
+ .uppercase = charset_tolower,
.uni2char = uni2char,
.char2uni = char2uni,
};
@@ -270,8 +282,6 @@ static struct nls_charset nls_charset;
static struct nls_table table = {
.charset = &nls_charset,
.ops = &charset_ops,
- .charset2lower = charset2lower,
- .charset2upper = charset2upper,
};

static struct nls_charset nls_charset = {
diff --git a/fs/nls/nls_iso8859-14.c b/fs/nls/nls_iso8859-14.c
index c8d5a48f869c..d8dafca31d26 100644
--- a/fs/nls/nls_iso8859-14.c
+++ b/fs/nls/nls_iso8859-14.c
@@ -317,7 +317,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static unsigned char charset_tolower(const struct nls_table *table,
+ unsigned int c){
+ return charset2lower[c];
+}
+
+static unsigned char charset_toupper(const struct nls_table *table,
+ unsigned int c) {
+ return charset2upper[c];
+}
+
static const struct nls_ops charset_ops = {
+ .lowercase = charset_toupper,
+ .uppercase = charset_tolower,
.uni2char = uni2char,
.char2uni = char2uni,
};
@@ -326,8 +338,6 @@ static struct nls_charset nls_charset;
static struct nls_table table = {
.charset = &nls_charset,
.ops = &charset_ops,
- .charset2lower = charset2lower,
- .charset2upper = charset2upper,
};

static struct nls_charset nls_charset = {
diff --git a/fs/nls/nls_iso8859-15.c b/fs/nls/nls_iso8859-15.c
index 0611c6cb56b4..9de12c9e25a3 100644
--- a/fs/nls/nls_iso8859-15.c
+++ b/fs/nls/nls_iso8859-15.c
@@ -283,7 +283,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static unsigned char charset_tolower(const struct nls_table *table,
+ unsigned int c){
+ return charset2lower[c];
+}
+
+static unsigned char charset_toupper(const struct nls_table *table,
+ unsigned int c) {
+ return charset2upper[c];
+}
+
static const struct nls_ops charset_ops = {
+ .lowercase = charset_toupper,
+ .uppercase = charset_tolower,
.uni2char = uni2char,
.char2uni = char2uni,
};
@@ -292,8 +304,6 @@ static struct nls_charset nls_charset;
static struct nls_table table = {
.charset = &nls_charset,
.ops = &charset_ops,
- .charset2lower = charset2lower,
- .charset2upper = charset2upper,
};

static struct nls_charset nls_charset = {
diff --git a/fs/nls/nls_iso8859-2.c b/fs/nls/nls_iso8859-2.c
index 5255d92a25eb..c59e2424f2b5 100644
--- a/fs/nls/nls_iso8859-2.c
+++ b/fs/nls/nls_iso8859-2.c
@@ -284,7 +284,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static unsigned char charset_tolower(const struct nls_table *table,
+ unsigned int c){
+ return charset2lower[c];
+}
+
+static unsigned char charset_toupper(const struct nls_table *table,
+ unsigned int c) {
+ return charset2upper[c];
+}
+
static const struct nls_ops charset_ops = {
+ .lowercase = charset_toupper,
+ .uppercase = charset_tolower,
.uni2char = uni2char,
.char2uni = char2uni,
};
@@ -293,8 +305,6 @@ static struct nls_charset nls_charset;
static struct nls_table table = {
.charset = &nls_charset,
.ops = &charset_ops,
- .charset2lower = charset2lower,
- .charset2upper = charset2upper,
};

static struct nls_charset nls_charset = {
diff --git a/fs/nls/nls_iso8859-3.c b/fs/nls/nls_iso8859-3.c
index ad1b84f3e102..4bab1b607059 100644
--- a/fs/nls/nls_iso8859-3.c
+++ b/fs/nls/nls_iso8859-3.c
@@ -284,7 +284,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static unsigned char charset_tolower(const struct nls_table *table,
+ unsigned int c){
+ return charset2lower[c];
+}
+
+static unsigned char charset_toupper(const struct nls_table *table,
+ unsigned int c) {
+ return charset2upper[c];
+}
+
static const struct nls_ops charset_ops = {
+ .lowercase = charset_toupper,
+ .uppercase = charset_tolower,
.uni2char = uni2char,
.char2uni = char2uni,
};
@@ -293,8 +305,6 @@ static struct nls_charset nls_charset;
static struct nls_table table = {
.charset = &nls_charset,
.ops = &charset_ops,
- .charset2lower = charset2lower,
- .charset2upper = charset2upper,
};

static struct nls_charset nls_charset = {
diff --git a/fs/nls/nls_iso8859-4.c b/fs/nls/nls_iso8859-4.c
index 82469deee0ba..1a3cf5f507f6 100644
--- a/fs/nls/nls_iso8859-4.c
+++ b/fs/nls/nls_iso8859-4.c
@@ -284,7 +284,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static unsigned char charset_tolower(const struct nls_table *table,
+ unsigned int c){
+ return charset2lower[c];
+}
+
+static unsigned char charset_toupper(const struct nls_table *table,
+ unsigned int c) {
+ return charset2upper[c];
+}
+
static const struct nls_ops charset_ops = {
+ .lowercase = charset_toupper,
+ .uppercase = charset_tolower,
.uni2char = uni2char,
.char2uni = char2uni,
};
@@ -293,8 +305,6 @@ static struct nls_charset nls_charset;
static struct nls_table table = {
.charset = &nls_charset,
.ops = &charset_ops,
- .charset2lower = charset2lower,
- .charset2upper = charset2upper,
};

static struct nls_charset nls_charset = {
diff --git a/fs/nls/nls_iso8859-5.c b/fs/nls/nls_iso8859-5.c
index 3f3cd0c28797..0a26cea9d578 100644
--- a/fs/nls/nls_iso8859-5.c
+++ b/fs/nls/nls_iso8859-5.c
@@ -248,7 +248,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static unsigned char charset_tolower(const struct nls_table *table,
+ unsigned int c){
+ return charset2lower[c];
+}
+
+static unsigned char charset_toupper(const struct nls_table *table,
+ unsigned int c) {
+ return charset2upper[c];
+}
+
static const struct nls_ops charset_ops = {
+ .lowercase = charset_toupper,
+ .uppercase = charset_tolower,
.uni2char = uni2char,
.char2uni = char2uni,
};
@@ -257,8 +269,6 @@ static struct nls_charset nls_charset;
static struct nls_table table = {
.charset = &nls_charset,
.ops = &charset_ops,
- .charset2lower = charset2lower,
- .charset2upper = charset2upper,
};

static struct nls_charset nls_charset = {
diff --git a/fs/nls/nls_iso8859-6.c b/fs/nls/nls_iso8859-6.c
index 43e6675998bc..d5a230888eed 100644
--- a/fs/nls/nls_iso8859-6.c
+++ b/fs/nls/nls_iso8859-6.c
@@ -239,7 +239,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static unsigned char charset_tolower(const struct nls_table *table,
+ unsigned int c){
+ return charset2lower[c];
+}
+
+static unsigned char charset_toupper(const struct nls_table *table,
+ unsigned int c) {
+ return charset2upper[c];
+}
+
static const struct nls_ops charset_ops = {
+ .lowercase = charset_toupper,
+ .uppercase = charset_tolower,
.uni2char = uni2char,
.char2uni = char2uni,
};
@@ -248,8 +260,6 @@ static struct nls_charset nls_charset;
static struct nls_table table = {
.charset = &nls_charset,
.ops = &charset_ops,
- .charset2lower = charset2lower,
- .charset2upper = charset2upper,
};

static struct nls_charset nls_charset = {
diff --git a/fs/nls/nls_iso8859-7.c b/fs/nls/nls_iso8859-7.c
index 83893e487f82..a5a171849ae4 100644
--- a/fs/nls/nls_iso8859-7.c
+++ b/fs/nls/nls_iso8859-7.c
@@ -293,7 +293,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static unsigned char charset_tolower(const struct nls_table *table,
+ unsigned int c){
+ return charset2lower[c];
+}
+
+static unsigned char charset_toupper(const struct nls_table *table,
+ unsigned int c) {
+ return charset2upper[c];
+}
+
static const struct nls_ops charset_ops = {
+ .lowercase = charset_toupper,
+ .uppercase = charset_tolower,
.uni2char = uni2char,
.char2uni = char2uni,
};
@@ -302,8 +314,6 @@ static struct nls_charset nls_charset;
static struct nls_table table = {
.charset = &nls_charset,
.ops = &charset_ops,
- .charset2lower = charset2lower,
- .charset2upper = charset2upper,
};

static struct nls_charset nls_charset = {
diff --git a/fs/nls/nls_iso8859-9.c b/fs/nls/nls_iso8859-9.c
index df03f97cd9d1..795093547cd6 100644
--- a/fs/nls/nls_iso8859-9.c
+++ b/fs/nls/nls_iso8859-9.c
@@ -248,7 +248,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static unsigned char charset_tolower(const struct nls_table *table,
+ unsigned int c){
+ return charset2lower[c];
+}
+
+static unsigned char charset_toupper(const struct nls_table *table,
+ unsigned int c) {
+ return charset2upper[c];
+}
+
static const struct nls_ops charset_ops = {
+ .lowercase = charset_toupper,
+ .uppercase = charset_tolower,
.uni2char = uni2char,
.char2uni = char2uni,
};
@@ -257,8 +269,6 @@ static struct nls_charset nls_charset;
static struct nls_table table = {
.charset = &nls_charset,
.ops = &charset_ops,
- .charset2lower = charset2lower,
- .charset2upper = charset2upper,
};

static struct nls_charset nls_charset = {
diff --git a/fs/nls/nls_koi8-r.c b/fs/nls/nls_koi8-r.c
index 22918e154dbe..bbce9a608419 100644
--- a/fs/nls/nls_koi8-r.c
+++ b/fs/nls/nls_koi8-r.c
@@ -299,7 +299,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static unsigned char charset_tolower(const struct nls_table *table,
+ unsigned int c){
+ return charset2lower[c];
+}
+
+static unsigned char charset_toupper(const struct nls_table *table,
+ unsigned int c) {
+ return charset2upper[c];
+}
+
static const struct nls_ops charset_ops = {
+ .lowercase = charset_toupper,
+ .uppercase = charset_tolower,
.uni2char = uni2char,
.char2uni = char2uni,
};
@@ -308,8 +320,6 @@ static struct nls_charset nls_charset;
static struct nls_table table = {
.charset = &nls_charset,
.ops = &charset_ops,
- .charset2lower = charset2lower,
- .charset2upper = charset2upper,
};

static struct nls_charset nls_charset = {
diff --git a/fs/nls/nls_koi8-ru.c b/fs/nls/nls_koi8-ru.c
index f4edbc313706..d3e946652bf6 100644
--- a/fs/nls/nls_koi8-ru.c
+++ b/fs/nls/nls_koi8-ru.c
@@ -51,7 +51,7 @@ static int char2uni(const unsigned char *rawstring, int boundlen,
return n;
}

-static const struct nls_ops charset_ops = {
+static struct nls_ops charset_ops = {
.uni2char = uni2char,
.char2uni = char2uni,
};
@@ -72,8 +72,8 @@ static int __init init_nls_koi8_ru(void)
p_nls = load_nls("koi8-u");

if (p_nls) {
- table.charset2upper = p_nls->charset2upper;
- table.charset2lower = p_nls->charset2lower;
+ charset_ops.uppercase = p_nls->ops->uppercase;
+ charset_ops.lowercase = p_nls->ops->lowercase;
return register_nls(&nls_charset);
}

diff --git a/fs/nls/nls_koi8-u.c b/fs/nls/nls_koi8-u.c
index b2421625e98b..5de52a74f0b3 100644
--- a/fs/nls/nls_koi8-u.c
+++ b/fs/nls/nls_koi8-u.c
@@ -306,7 +306,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static unsigned char charset_tolower(const struct nls_table *table,
+ unsigned int c){
+ return charset2lower[c];
+}
+
+static unsigned char charset_toupper(const struct nls_table *table,
+ unsigned int c) {
+ return charset2upper[c];
+}
+
static const struct nls_ops charset_ops = {
+ .lowercase = charset_toupper,
+ .uppercase = charset_tolower,
.uni2char = uni2char,
.char2uni = char2uni,
};
@@ -315,8 +327,6 @@ static struct nls_charset nls_charset;
static struct nls_table table = {
.charset = &nls_charset,
.ops = &charset_ops,
- .charset2lower = charset2lower,
- .charset2upper = charset2upper,
};

static struct nls_charset nls_charset = {
diff --git a/fs/nls/nls_utf8.c b/fs/nls/nls_utf8.c
index aecf460827ac..fe1ac5efaa37 100644
--- a/fs/nls/nls_utf8.c
+++ b/fs/nls/nls_utf8.c
@@ -40,7 +40,19 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return n;
}

+static unsigned char charset_tolower(const struct nls_table *table,
+ unsigned int c){
+ return identity[c];
+}
+
+static unsigned char charset_toupper(const struct nls_table *table,
+ unsigned int c) {
+ return identity[c];
+}
+
static const struct nls_ops charset_ops = {
+ .lowercase = charset_toupper,
+ .uppercase = charset_tolower,
.uni2char = uni2char,
.char2uni = char2uni,
};
@@ -49,8 +61,6 @@ static struct nls_charset nls_charset;
static struct nls_table table = {
.charset = &nls_charset,
.ops = &charset_ops,
- .charset2lower = identity, /* no conversion */
- .charset2upper = identity,
};

static struct nls_charset nls_charset = {
diff --git a/include/linux/nls.h b/include/linux/nls.h
index 9f61015a54bf..c43746bd390e 100644
--- a/include/linux/nls.h
+++ b/include/linux/nls.h
@@ -38,6 +38,10 @@ struct nls_ops {
**/
int (*validate)(const struct nls_table *charset,
const unsigned char *str, size_t len);
+ unsigned char (*lowercase)(const struct nls_table *charset,
+ unsigned int c);
+ unsigned char (*uppercase)(const struct nls_table *charset,
+ unsigned int c);
};

struct nls_table {
@@ -46,9 +50,8 @@ struct nls_table {
unsigned int flags;

const struct nls_ops *ops;
- const unsigned char *charset2lower;
- const unsigned char *charset2upper;
struct nls_table *next;
+
};

struct nls_charset {
@@ -120,16 +123,18 @@ static inline const char *nls_charset_name(const struct nls_table *table)
return table->charset->charset;
}

-static inline unsigned char nls_tolower(struct nls_table *t, unsigned char c)
+static inline unsigned char nls_tolower(const struct nls_table *t,
+ unsigned char c)
{
- unsigned char nc = t->charset2lower[c];
+ unsigned char nc = t->ops->lowercase(t, c);

return nc ? nc : c;
}

-static inline unsigned char nls_toupper(struct nls_table *t, unsigned char c)
+static inline unsigned char nls_toupper(const struct nls_table *t,
+ unsigned char c)
{
- unsigned char nc = t->charset2upper[c];
+ unsigned char nc = t->ops->uppercase(t, c);

return nc ? nc : c;
}
--
2.20.0.rc2

2018-12-07 18:41:16

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH v4 00/23] Ext4 Encoding and Case-insensitive support

On 12/6/18 3:08 PM, Gabriel Krisman Bertazi wrote:
> Hi,
>
> [Resending to include fsdevel, as requested by Dave Chinner]
>
> Following the e2fsprogs changes, these are the corresponding kernel-side
> modifications to support the fname_encoding feature.
>
> The patches are split in two parts. The fist 14 patches are refactoring
> and improvements to the NLS code, including the utf8 normalization
> support. The final patches implement the fname_encoding feature in ext4.

Hi,

Please include some justification and use case(s) in the patch description.

Thanks.

> To test this feature, you need to use the tip of e2fsprogs branch, which
> already include support for enabling this feature.
>
> As usual, the ucd files are not included in this email because they are
> too large, and would actually cause the email message to bounce.
>
> There are two test files for this in a private xfstests branch, that I
> plan to submit upstream once we get this series merged:
>
> https://gitlab.collabora.com/krisman/xfstests.git -b encoding_v4
>
> I also tested this with the xfstests smoke tests using two scenarios:
> (1) a non-encoding TEST_DEV; (2) a utf8 enabled TEST_DEV. On both
> cases, no unrelated regressions where observed. With my branch of
> xfstests above, that fixes some related tests, I didn't observe any
> regressions.
>
> Gabriel Krisman Bertazi (19):
> nls: Wrap uni2char/char2uni callers
> nls: Wrap charset field access
> nls: Wrap charset hooks in ops structure
> nls: Split default charset from NLS core
> nls: Split struct nls_charset from struct nls_table
> nls: Add support for multiple versions of an encoding
> nls: Implement NLS_STRICT_MODE flag
> nls: Let charsets define the behavior of tolower/toupper
> nls: Add new interface for string comparisons
> nls: Add optional normalization and casefold hooks
> nls: ascii: Support validation and normalization operations
> nls: utf8: Move nls-utf8{,-core}.c
> nls: utf8: Integrate utf8 normalization code with utf8 charset
> nls: utf8: Introduce test module for normalized utf8 implementation
> ext4: Reserve superblock fields for encoding information
> ext4: Include encoding information in the superblock
> ext4: Support encoding-aware file name lookups
> ext4: Implement EXT4_CASEFOLD_FL flag
> docs: ext4.rst: Document encoding and case-insensitive
>
> Olaf Weber (4):
> nls: utf8: Add unicode character database files
> scripts: add trie generator for UTF-8
> nls: utf8: Introduce code for UTF-8 normalization
> nls: utf8n: reduce the size of utf8data[]
>
> Documentation/admin-guide/ext4.rst | 29 +
> fs/befs/linuxvfs.c | 8 +-
> fs/cifs/cifs_unicode.c | 15 +-
> fs/cifs/cifsfs.c | 2 +-
> fs/cifs/connect.c | 2 +-
> fs/cifs/dir.c | 7 +-
> fs/ext4/dir.c | 59 +
> fs/ext4/ext4.h | 33 +-
> fs/ext4/hash.c | 38 +-
> fs/ext4/ialloc.c | 2 +-
> fs/ext4/inline.c | 2 +-
> fs/ext4/inode.c | 4 +-
> fs/ext4/ioctl.c | 18 +
> fs/ext4/namei.c | 85 +-
> fs/ext4/super.c | 83 +
> fs/fat/dir.c | 13 +-
> fs/fat/inode.c | 6 +-
> fs/fat/namei_vfat.c | 6 +-
> fs/hfs/super.c | 6 +-
> fs/hfs/trans.c | 9 +-
> fs/hfsplus/options.c | 2 +-
> fs/hfsplus/unicode.c | 6 +-
> fs/isofs/inode.c | 5 +-
> fs/isofs/joliet.c | 3 +-
> fs/jfs/jfs_unicode.c | 9 +-
> fs/jfs/super.c | 3 +-
> fs/nls/Kconfig | 15 +
> fs/nls/Makefile | 20 +
> fs/nls/mac-celtic.c | 34 +-
> fs/nls/mac-centeuro.c | 34 +-
> fs/nls/mac-croatian.c | 34 +-
> fs/nls/mac-cyrillic.c | 34 +-
> fs/nls/mac-gaelic.c | 34 +-
> fs/nls/mac-greek.c | 34 +-
> fs/nls/mac-iceland.c | 34 +-
> fs/nls/mac-inuit.c | 34 +-
> fs/nls/mac-roman.c | 34 +-
> fs/nls/mac-romanian.c | 34 +-
> fs/nls/mac-turkish.c | 34 +-
> fs/nls/nls_ascii.c | 84 +-
> fs/nls/nls_core.c | 163 ++
> fs/nls/nls_cp1250.c | 34 +-
> fs/nls/nls_cp1251.c | 34 +-
> fs/nls/nls_cp1255.c | 36 +-
> fs/nls/nls_cp437.c | 34 +-
> fs/nls/nls_cp737.c | 34 +-
> fs/nls/nls_cp775.c | 34 +-
> fs/nls/nls_cp850.c | 34 +-
> fs/nls/nls_cp852.c | 34 +-
> fs/nls/nls_cp855.c | 34 +-
> fs/nls/nls_cp857.c | 34 +-
> fs/nls/nls_cp860.c | 34 +-
> fs/nls/nls_cp861.c | 34 +-
> fs/nls/nls_cp862.c | 34 +-
> fs/nls/nls_cp863.c | 34 +-
> fs/nls/nls_cp864.c | 34 +-
> fs/nls/nls_cp865.c | 34 +-
> fs/nls/nls_cp866.c | 34 +-
> fs/nls/nls_cp869.c | 34 +-
> fs/nls/nls_cp874.c | 36 +-
> fs/nls/nls_cp932.c | 36 +-
> fs/nls/nls_cp936.c | 36 +-
> fs/nls/nls_cp949.c | 36 +-
> fs/nls/nls_cp950.c | 36 +-
> fs/nls/{nls_base.c => nls_default.c} | 124 +-
> fs/nls/nls_euc-jp.c | 29 +-
> fs/nls/nls_iso8859-1.c | 34 +-
> fs/nls/nls_iso8859-13.c | 34 +-
> fs/nls/nls_iso8859-14.c | 34 +-
> fs/nls/nls_iso8859-15.c | 34 +-
> fs/nls/nls_iso8859-2.c | 34 +-
> fs/nls/nls_iso8859-3.c | 34 +-
> fs/nls/nls_iso8859-4.c | 34 +-
> fs/nls/nls_iso8859-5.c | 34 +-
> fs/nls/nls_iso8859-6.c | 34 +-
> fs/nls/nls_iso8859-7.c | 34 +-
> fs/nls/nls_iso8859-9.c | 34 +-
> fs/nls/nls_koi8-r.c | 34 +-
> fs/nls/nls_koi8-ru.c | 30 +-
> fs/nls/nls_koi8-u.c | 34 +-
> fs/nls/nls_utf8-core.c | 328 +++
> fs/nls/nls_utf8-norm.c | 797 ++++++
> fs/nls/nls_utf8-selftest.c | 316 +++
> fs/nls/nls_utf8.c | 67 -
> fs/nls/ucd/README | 34 +
> fs/nls/utf8n.h | 117 +
> fs/ntfs/inode.c | 2 +-
> fs/ntfs/super.c | 6 +-
> fs/ntfs/unistr.c | 13 +-
> fs/udf/super.c | 3 +-
> fs/udf/unicode.c | 4 +-
> include/linux/fs.h | 2 +
> include/linux/nls.h | 293 ++-
> scripts/Makefile | 1 +
> scripts/mkutf8data.c | 3392 ++++++++++++++++++++++++++
> 95 files changed, 7287 insertions(+), 618 deletions(-)
> create mode 100644 fs/nls/nls_core.c
> rename fs/nls/{nls_base.c => nls_default.c} (89%)
> create mode 100644 fs/nls/nls_utf8-core.c
> create mode 100644 fs/nls/nls_utf8-norm.c
> create mode 100644 fs/nls/nls_utf8-selftest.c
> delete mode 100644 fs/nls/nls_utf8.c
> create mode 100644 fs/nls/ucd/README
> create mode 100644 fs/nls/utf8n.h
> create mode 100644 scripts/mkutf8data.c
>


--
~Randy

2018-12-06 22:05:06

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH v4 07/23] nls: Implement NLS_STRICT_MODE flag

From: Gabriel Krisman Bertazi <[email protected]>

The flag NLS_STRICT_MODE indicates whether NLS should reject invalid
characters or ignore them. Support for this relies on the .validate()
hook, which is implemented by each charset and states whether a given
string is valid within that charset.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
fs/nls/nls_core.c | 11 +++++++++++
include/linux/nls.h | 25 +++++++++++++++++++++++++
2 files changed, 36 insertions(+)

diff --git a/fs/nls/nls_core.c b/fs/nls/nls_core.c
index 20e00a8b968c..49a15bb2174f 100644
--- a/fs/nls/nls_core.c
+++ b/fs/nls/nls_core.c
@@ -20,6 +20,14 @@ extern struct nls_charset default_charset;
static struct nls_charset *charsets = &default_charset;
static DEFINE_SPINLOCK(nls_lock);

+static int nls_validate_flags(struct nls_table *table, unsigned int flags)
+{
+ if (flags & NLS_STRICT_MODE && !table->ops->validate)
+ return -1;
+
+ return 0;
+}
+
static struct nls_table *nls_load_table(struct nls_charset *charset,
const char *version,
unsigned int flags)
@@ -37,6 +45,9 @@ static struct nls_table *nls_load_table(struct nls_charset *charset,
if (IS_ERR(tbl))
return tbl;

+ if (nls_validate_flags(tbl, flags) < 0)
+ return ERR_PTR(-EINVAL);
+
tbl->flags = flags;
return tbl;
}
diff --git a/include/linux/nls.h b/include/linux/nls.h
index 91524bb4477b..9f61015a54bf 100644
--- a/include/linux/nls.h
+++ b/include/linux/nls.h
@@ -22,10 +22,22 @@ typedef u16 wchar_t;
/* Arbitrary Unicode character */
typedef u32 unicode_t;

+struct nls_table;
+
struct nls_ops {
int (*uni2char) (wchar_t uni, unsigned char *out, int boundlen);
int (*char2uni) (const unsigned char *rawstring, int boundlen,
wchar_t *uni);
+ /**
+ * @validate:
+ *
+ * Returns 0 if the argument is a valid string in this charset.
+ * Otherwise, return non-zero.
+ *
+ * This is required iff the charset supports strict mode.
+ **/
+ int (*validate)(const struct nls_table *charset,
+ const unsigned char *str, size_t len);
};

struct nls_table {
@@ -59,6 +71,13 @@ enum utf16_endian {
UTF16_BIG_ENDIAN
};

+#define NLS_STRICT_MODE 0x00000001
+
+static inline int IS_STRICT_MODE(const struct nls_table *charset)
+{
+ return (charset->flags & NLS_STRICT_MODE);
+}
+
/* nls_base.c */
extern int __register_nls(struct nls_charset *, struct module *);
extern int unregister_nls(struct nls_charset *);
@@ -90,6 +109,12 @@ static inline int nls_char2uni(const struct nls_table *table,
return table->ops->char2uni(rawstring, boundlen, uni);
}

+static inline int nls_validate(const struct nls_table *t, const unsigned char *str,
+ const size_t len)
+{
+ return t->ops->validate(t, str, len);
+}
+
static inline const char *nls_charset_name(const struct nls_table *table)
{
return table->charset->charset;
--
2.20.0.rc2

2018-12-06 22:05:35

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH v4 13/23] scripts: add trie generator for UTF-8

From: Olaf Weber <[email protected]>

mkutf8data.c is the source for a program that generates utf8data.h, which
contains the trie that utf8norm.c uses. The trie is generated from the
Unicode 11.0.0 data files. The format of the utf8data[] table is described
in utf8norm.c, which is added in the next patch.

Signed-off-by: Olaf Weber <[email protected]>
Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
[Rebase to mainline]
[Fix out-of-tree-build]
[Fix checkpatch warnings]
[Merge back robustness fixes from original patch. Requested by Dave Chinner]
[Update makefile to build 11.0.0 ucd files]
[drop references to xfs]
---
fs/nls/Kconfig | 10 +
fs/nls/Makefile | 13 +
scripts/Makefile | 1 +
scripts/mkutf8data.c | 3168 ++++++++++++++++++++++++++++++++++++++++++
4 files changed, 3192 insertions(+)
create mode 100644 scripts/mkutf8data.c

diff --git a/fs/nls/Kconfig b/fs/nls/Kconfig
index e2ce79ef48c4..7cb2848da608 100644
--- a/fs/nls/Kconfig
+++ b/fs/nls/Kconfig
@@ -616,4 +616,14 @@ config NLS_UTF8
input/output character sets. Say Y here for the UTF-8 encoding of
the Unicode/ISO9646 universal character set.

+#
+# utf8 normalization
+#
+config NLS_UTF8_NORMALIZATION
+ bool "UTF-8 normalization and casefolding support"
+ depends on NLS_UTF8
+ help
+ Say Y here to enable utf8 NFKD normalization and casefolding
+ support.
+
endif # NLS
diff --git a/fs/nls/Makefile b/fs/nls/Makefile
index 5f42ceff9d15..840e06aefd47 100644
--- a/fs/nls/Makefile
+++ b/fs/nls/Makefile
@@ -55,3 +55,16 @@ obj-$(CONFIG_NLS_MAC_INUIT) += mac-inuit.o
obj-$(CONFIG_NLS_MAC_ROMANIAN) += mac-romanian.o
obj-$(CONFIG_NLS_MAC_ROMAN) += mac-roman.o
obj-$(CONFIG_NLS_MAC_TURKISH) += mac-turkish.o
+
+$(obj)/utf8data.h: $(srctree)/$(src)/ucd/*.txt $(objtree)/scripts/mkutf8data FORCE
+ $(call cmd,mkutf8data)
+quiet_cmd_mkutf8data = MKUTF8DATA $@
+ cmd_mkutf8data = $(objtree)/scripts/mkutf8data \
+ -a $(srctree)/$(src)/ucd/DerivedAge-11.0.0.txt \
+ -c $(srctree)/$(src)/ucd/DerivedCombiningClass-11.0.0.txt \
+ -p $(srctree)/$(src)/ucd/DerivedCoreProperties-11.0.0.txt \
+ -d $(srctree)/$(src)/ucd/UnicodeData-11.0.0.txt \
+ -f $(srctree)/$(src)/ucd/CaseFolding-11.0.0.txt \
+ -n $(srctree)/$(src)/ucd/NormalizationCorrections-11.0.0.txt \
+ -t $(srctree)/$(src)/ucd/NormalizationTest-11.0.0.txt \
+ -o $@
diff --git a/scripts/Makefile b/scripts/Makefile
index ece52ff20171..b36208c62c17 100644
--- a/scripts/Makefile
+++ b/scripts/Makefile
@@ -20,6 +20,7 @@ hostprogs-$(CONFIG_ASN1) += asn1_compiler
hostprogs-$(CONFIG_MODULE_SIG) += sign-file
hostprogs-$(CONFIG_SYSTEM_TRUSTED_KEYRING) += extract-cert
hostprogs-$(CONFIG_SYSTEM_EXTRA_CERTIFICATE) += insert-sys-cert
+hostprogs-$(CONFIG_NLS_UTF8_NORMALIZATION) += mkutf8data

HOSTCFLAGS_sortextable.o = -I$(srctree)/tools/include
HOSTCFLAGS_asn1_compiler.o = -I$(srctree)/include
diff --git a/scripts/mkutf8data.c b/scripts/mkutf8data.c
new file mode 100644
index 000000000000..26794053d0d4
--- /dev/null
+++ b/scripts/mkutf8data.c
@@ -0,0 +1,3168 @@
+/*
+ * Copyright (c) 2014 SGI.
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write the Free Software Foundation,
+ * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+/* Generator for a compact trie for unicode normalization */
+
+#include <sys/types.h>
+#include <stddef.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <assert.h>
+#include <string.h>
+#include <unistd.h>
+#include <errno.h>
+
+/* Default names of the in- and output files. */
+
+#define AGE_NAME "DerivedAge.txt"
+#define CCC_NAME "DerivedCombiningClass.txt"
+#define PROP_NAME "DerivedCoreProperties.txt"
+#define DATA_NAME "UnicodeData.txt"
+#define FOLD_NAME "CaseFolding.txt"
+#define NORM_NAME "NormalizationCorrections.txt"
+#define TEST_NAME "NormalizationTest.txt"
+#define UTF8_NAME "utf8data.h"
+
+const char *age_name = AGE_NAME;
+const char *ccc_name = CCC_NAME;
+const char *prop_name = PROP_NAME;
+const char *data_name = DATA_NAME;
+const char *fold_name = FOLD_NAME;
+const char *norm_name = NORM_NAME;
+const char *test_name = TEST_NAME;
+const char *utf8_name = UTF8_NAME;
+
+int verbose = 0;
+
+/* An arbitrary line size limit on input lines. */
+
+#define LINESIZE 1024
+char line[LINESIZE];
+char buf0[LINESIZE];
+char buf1[LINESIZE];
+char buf2[LINESIZE];
+char buf3[LINESIZE];
+
+const char *argv0;
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * Unicode version numbers consist of three parts: major, minor, and a
+ * revision. These numbers are packed into an unsigned int to obtain
+ * a single version number.
+ *
+ * To save space in the generated trie, the unicode version is not
+ * stored directly, instead we calculate a generation number from the
+ * unicode versions seen in the DerivedAge file, and use that as an
+ * index into a table of unicode versions.
+ */
+#define UNICODE_MAJ_SHIFT (16)
+#define UNICODE_MIN_SHIFT (8)
+
+#define UNICODE_MAJ_MAX ((unsigned short)-1)
+#define UNICODE_MIN_MAX ((unsigned char)-1)
+#define UNICODE_REV_MAX ((unsigned char)-1)
+
+#define UNICODE_AGE(MAJ,MIN,REV) \
+ (((unsigned int)(MAJ) << UNICODE_MAJ_SHIFT) | \
+ ((unsigned int)(MIN) << UNICODE_MIN_SHIFT) | \
+ ((unsigned int)(REV)))
+
+unsigned int *ages;
+int ages_count;
+
+unsigned int unicode_maxage;
+
+static int age_valid(unsigned int major, unsigned int minor,
+ unsigned int revision)
+{
+ if (major > UNICODE_MAJ_MAX)
+ return 0;
+ if (minor > UNICODE_MIN_MAX)
+ return 0;
+ if (revision > UNICODE_REV_MAX)
+ return 0;
+ return 1;
+}
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * utf8trie_t
+ *
+ * A compact binary tree, used to decode UTF-8 characters.
+ *
+ * Internal nodes are one byte for the node itself, and up to three
+ * bytes for an offset into the tree. The first byte contains the
+ * following information:
+ * NEXTBYTE - flag - advance to next byte if set
+ * BITNUM - 3 bit field - the bit number to tested
+ * OFFLEN - 2 bit field - number of bytes in the offset
+ * if offlen == 0 (non-branching node)
+ * RIGHTPATH - 1 bit field - set if the following node is for the
+ * right-hand path (tested bit is set)
+ * TRIENODE - 1 bit field - set if the following node is an internal
+ * node, otherwise it is a leaf node
+ * if offlen != 0 (branching node)
+ * LEFTNODE - 1 bit field - set if the left-hand node is internal
+ * RIGHTNODE - 1 bit field - set if the right-hand node is internal
+ *
+ * Due to the way utf8 works, there cannot be branching nodes with
+ * NEXTBYTE set, and moreover those nodes always have a righthand
+ * descendant.
+ */
+typedef unsigned char utf8trie_t;
+#define BITNUM 0x07
+#define NEXTBYTE 0x08
+#define OFFLEN 0x30
+#define OFFLEN_SHIFT 4
+#define RIGHTPATH 0x40
+#define TRIENODE 0x80
+#define RIGHTNODE 0x40
+#define LEFTNODE 0x80
+
+/*
+ * utf8leaf_t
+ *
+ * The leaves of the trie are embedded in the trie, and so the same
+ * underlying datatype, unsigned char.
+ *
+ * leaf[0]: The unicode version, stored as a generation number that is
+ * an index into utf8agetab[]. With this we can filter code
+ * points based on the unicode version in which they were
+ * defined. The CCC of a non-defined code point is 0.
+ * leaf[1]: Canonical Combining Class. During normalization, we need
+ * to do a stable sort into ascending order of all characters
+ * with a non-zero CCC that occur between two characters with
+ * a CCC of 0, or at the begin or end of a string.
+ * The unicode standard guarantees that all CCC values are
+ * between 0 and 254 inclusive, which leaves 255 available as
+ * a special value.
+ * Code points with CCC 0 are known as stoppers.
+ * leaf[2]: Decomposition. If leaf[1] == 255, then leaf[2] is the
+ * start of a NUL-terminated string that is the decomposition
+ * of the character.
+ * The CCC of a decomposable character is the same as the CCC
+ * of the first character of its decomposition.
+ * Some characters decompose as the empty string: these are
+ * characters with the Default_Ignorable_Code_Point property.
+ * These do affect normalization, as they all have CCC 0.
+ *
+ * The decompositions in the trie have been fully expanded.
+ *
+ * Casefolding, if applicable, is also done using decompositions.
+ */
+typedef unsigned char utf8leaf_t;
+
+#define LEAF_GEN(LEAF) ((LEAF)[0])
+#define LEAF_CCC(LEAF) ((LEAF)[1])
+#define LEAF_STR(LEAF) ((const char*)((LEAF) + 2))
+
+#define MAXGEN (255)
+
+#define MINCCC (0)
+#define MAXCCC (254)
+#define STOPPER (0)
+#define DECOMPOSE (255)
+
+struct tree;
+static utf8leaf_t *utf8nlookup(struct tree *, const char *, size_t);
+static utf8leaf_t *utf8lookup(struct tree *, const char *);
+
+unsigned char *utf8data;
+size_t utf8data_size;
+
+utf8trie_t *nfkdi;
+utf8trie_t *nfkdicf;
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * UTF8 valid ranges.
+ *
+ * The UTF-8 encoding spreads the bits of a 32bit word over several
+ * bytes. This table gives the ranges that can be held and how they'd
+ * be represented.
+ *
+ * 0x00000000 0x0000007F: 0xxxxxxx
+ * 0x00000000 0x000007FF: 110xxxxx 10xxxxxx
+ * 0x00000000 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ *
+ * There is an additional requirement on UTF-8, in that only the
+ * shortest representation of a 32bit value is to be used. A decoder
+ * must not decode sequences that do not satisfy this requirement.
+ * Thus the allowed ranges have a lower bound.
+ *
+ * 0x00000000 0x0000007F: 0xxxxxxx
+ * 0x00000080 0x000007FF: 110xxxxx 10xxxxxx
+ * 0x00000800 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
+ * 0x00010000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00200000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x04000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ *
+ * Actual unicode characters are limited to the range 0x0 - 0x10FFFF,
+ * 17 planes of 65536 values. This limits the sequences actually seen
+ * even more, to just the following.
+ *
+ * 0 - 0x7f: 0 0x7f
+ * 0x80 - 0x7ff: 0xc2 0x80 0xdf 0xbf
+ * 0x800 - 0xffff: 0xe0 0xa0 0x80 0xef 0xbf 0xbf
+ * 0x10000 - 0x10ffff: 0xf0 0x90 0x80 0x80 0xf4 0x8f 0xbf 0xbf
+ *
+ * Even within those ranges not all values are allowed: the surrogates
+ * 0xd800 - 0xdfff should never be seen.
+ *
+ * Note that the longest sequence seen with valid usage is 4 bytes,
+ * the same a single UTF-32 character. This makes the UTF-8
+ * representation of Unicode strictly smaller than UTF-32.
+ *
+ * The shortest sequence requirement was introduced by:
+ * Corrigendum #1: UTF-8 Shortest Form
+ * It can be found here:
+ * http://www.unicode.org/versions/corrigendum1.html
+ *
+ */
+
+#define UTF8_2_BITS 0xC0
+#define UTF8_3_BITS 0xE0
+#define UTF8_4_BITS 0xF0
+#define UTF8_N_BITS 0x80
+#define UTF8_2_MASK 0xE0
+#define UTF8_3_MASK 0xF0
+#define UTF8_4_MASK 0xF8
+#define UTF8_N_MASK 0xC0
+#define UTF8_V_MASK 0x3F
+#define UTF8_V_SHIFT 6
+
+static int utf8encode(char *str, unsigned int val)
+{
+ int len;
+
+ if (val < 0x80) {
+ str[0] = val;
+ len = 1;
+ } else if (val < 0x800) {
+ str[1] = val & UTF8_V_MASK;
+ str[1] |= UTF8_N_BITS;
+ val >>= UTF8_V_SHIFT;
+ str[0] = val;
+ str[0] |= UTF8_2_BITS;
+ len = 2;
+ } else if (val < 0x10000) {
+ str[2] = val & UTF8_V_MASK;
+ str[2] |= UTF8_N_BITS;
+ val >>= UTF8_V_SHIFT;
+ str[1] = val & UTF8_V_MASK;
+ str[1] |= UTF8_N_BITS;
+ val >>= UTF8_V_SHIFT;
+ str[0] = val;
+ str[0] |= UTF8_3_BITS;
+ len = 3;
+ } else if (val < 0x110000) {
+ str[3] = val & UTF8_V_MASK;
+ str[3] |= UTF8_N_BITS;
+ val >>= UTF8_V_SHIFT;
+ str[2] = val & UTF8_V_MASK;
+ str[2] |= UTF8_N_BITS;
+ val >>= UTF8_V_SHIFT;
+ str[1] = val & UTF8_V_MASK;
+ str[1] |= UTF8_N_BITS;
+ val >>= UTF8_V_SHIFT;
+ str[0] = val;
+ str[0] |= UTF8_4_BITS;
+ len = 4;
+ } else {
+ printf("%#x: illegal val\n", val);
+ len = 0;
+ }
+ return len;
+}
+
+static unsigned int utf8decode(const char *str)
+{
+ const unsigned char *s = (const unsigned char*)str;
+ unsigned int unichar = 0;
+
+ if (*s < 0x80) {
+ unichar = *s;
+ } else if (*s < UTF8_3_BITS) {
+ unichar = *s++ & 0x1F;
+ unichar <<= UTF8_V_SHIFT;
+ unichar |= *s & 0x3F;
+ } else if (*s < UTF8_4_BITS) {
+ unichar = *s++ & 0x0F;
+ unichar <<= UTF8_V_SHIFT;
+ unichar |= *s++ & 0x3F;
+ unichar <<= UTF8_V_SHIFT;
+ unichar |= *s & 0x3F;
+ } else {
+ unichar = *s++ & 0x0F;
+ unichar <<= UTF8_V_SHIFT;
+ unichar |= *s++ & 0x3F;
+ unichar <<= UTF8_V_SHIFT;
+ unichar |= *s++ & 0x3F;
+ unichar <<= UTF8_V_SHIFT;
+ unichar |= *s & 0x3F;
+ }
+ return unichar;
+}
+
+static int utf32valid(unsigned int unichar)
+{
+ return unichar < 0x110000;
+}
+
+#define NODE 1
+#define LEAF 0
+
+struct tree {
+ void *root;
+ int childnode;
+ const char *type;
+ unsigned int maxage;
+ struct tree *next;
+ int (*leaf_equal)(void *, void *);
+ void (*leaf_print)(void *, int);
+ int (*leaf_mark)(void *);
+ int (*leaf_size)(void *);
+ int *(*leaf_index)(struct tree *, void *);
+ unsigned char *(*leaf_emit)(void *, unsigned char *);
+ int leafindex[0x110000];
+ int index;
+};
+
+struct node {
+ int index;
+ int offset;
+ int mark;
+ int size;
+ struct node *parent;
+ void *left;
+ void *right;
+ unsigned char bitnum;
+ unsigned char nextbyte;
+ unsigned char leftnode;
+ unsigned char rightnode;
+ unsigned int keybits;
+ unsigned int keymask;
+};
+
+/*
+ * Example lookup function for a tree.
+ */
+static void *lookup(struct tree *tree, const char *key)
+{
+ struct node *node;
+ void *leaf = NULL;
+
+ node = tree->root;
+ while (!leaf && node) {
+ if (node->nextbyte)
+ key++;
+ if (*key & (1 << (node->bitnum & 7))) {
+ /* Right leg */
+ if (node->rightnode == NODE) {
+ node = node->right;
+ } else if (node->rightnode == LEAF) {
+ leaf = node->right;
+ } else {
+ node = NULL;
+ }
+ } else {
+ /* Left leg */
+ if (node->leftnode == NODE) {
+ node = node->left;
+ } else if (node->leftnode == LEAF) {
+ leaf = node->left;
+ } else {
+ node = NULL;
+ }
+ }
+ }
+
+ return leaf;
+}
+
+/*
+ * A simple non-recursive tree walker: keep track of visits to the
+ * left and right branches in the leftmask and rightmask.
+ */
+static void tree_walk(struct tree *tree)
+{
+ struct node *node;
+ unsigned int leftmask;
+ unsigned int rightmask;
+ unsigned int bitmask;
+ int indent = 1;
+ int nodes, singletons, leaves;
+
+ nodes = singletons = leaves = 0;
+
+ printf("%s_%x root %p\n", tree->type, tree->maxage, tree->root);
+ if (tree->childnode == LEAF) {
+ assert(tree->root);
+ tree->leaf_print(tree->root, indent);
+ leaves = 1;
+ } else {
+ assert(tree->childnode == NODE);
+ node = tree->root;
+ leftmask = rightmask = 0;
+ while (node) {
+ printf("%*snode @ %p bitnum %d nextbyte %d"
+ " left %p right %p mask %x bits %x\n",
+ indent, "", node,
+ node->bitnum, node->nextbyte,
+ node->left, node->right,
+ node->keymask, node->keybits);
+ nodes += 1;
+ if (!(node->left && node->right))
+ singletons += 1;
+
+ while (node) {
+ bitmask = 1 << node->bitnum;
+ if ((leftmask & bitmask) == 0) {
+ leftmask |= bitmask;
+ if (node->leftnode == LEAF) {
+ assert(node->left);
+ tree->leaf_print(node->left,
+ indent+1);
+ leaves += 1;
+ } else if (node->left) {
+ assert(node->leftnode == NODE);
+ indent += 1;
+ node = node->left;
+ break;
+ }
+ }
+ if ((rightmask & bitmask) == 0) {
+ rightmask |= bitmask;
+ if (node->rightnode == LEAF) {
+ assert(node->right);
+ tree->leaf_print(node->right,
+ indent+1);
+ leaves += 1;
+ } else if (node->right) {
+ assert(node->rightnode==NODE);
+ indent += 1;
+ node = node->right;
+ break;
+ }
+ }
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ node = node->parent;
+ indent -= 1;
+ }
+ }
+ }
+ printf("nodes %d leaves %d singletons %d\n",
+ nodes, leaves, singletons);
+}
+
+/*
+ * Allocate an initialize a new internal node.
+ */
+static struct node *alloc_node(struct node *parent)
+{
+ struct node *node;
+ int bitnum;
+
+ node = malloc(sizeof(*node));
+ node->left = node->right = NULL;
+ node->parent = parent;
+ node->leftnode = NODE;
+ node->rightnode = NODE;
+ node->keybits = 0;
+ node->keymask = 0;
+ node->mark = 0;
+ node->index = 0;
+ node->offset = -1;
+ node->size = 4;
+
+ if (node->parent) {
+ bitnum = parent->bitnum;
+ if ((bitnum & 7) == 0) {
+ node->bitnum = bitnum + 7 + 8;
+ node->nextbyte = 1;
+ } else {
+ node->bitnum = bitnum - 1;
+ node->nextbyte = 0;
+ }
+ } else {
+ node->bitnum = 7;
+ node->nextbyte = 0;
+ }
+
+ return node;
+}
+
+/*
+ * Insert a new leaf into the tree, and collapse any subtrees that are
+ * fully populated and end in identical leaves. A nextbyte tagged
+ * internal node will not be removed to preserve the tree's integrity.
+ * Note that due to the structure of utf8, no nextbyte tagged node
+ * will be a candidate for removal.
+ */
+static int insert(struct tree *tree, char *key, int keylen, void *leaf)
+{
+ struct node *node;
+ struct node *parent;
+ void **cursor;
+ int keybits;
+
+ assert(keylen >= 1 && keylen <= 4);
+
+ node = NULL;
+ cursor = &tree->root;
+ keybits = 8 * keylen;
+
+ /* Insert, creating path along the way. */
+ while (keybits) {
+ if (!*cursor)
+ *cursor = alloc_node(node);
+ node = *cursor;
+ if (node->nextbyte)
+ key++;
+ if (*key & (1 << (node->bitnum & 7)))
+ cursor = &node->right;
+ else
+ cursor = &node->left;
+ keybits--;
+ }
+ *cursor = leaf;
+
+ /* Merge subtrees if possible. */
+ while (node) {
+ if (*key & (1 << (node->bitnum & 7)))
+ node->rightnode = LEAF;
+ else
+ node->leftnode = LEAF;
+ if (node->nextbyte)
+ break;
+ if (node->leftnode == NODE || node->rightnode == NODE)
+ break;
+ assert(node->left);
+ assert(node->right);
+ /* Compare */
+ if (! tree->leaf_equal(node->left, node->right))
+ break;
+ /* Keep left, drop right leaf. */
+ leaf = node->left;
+ /* Check in parent */
+ parent = node->parent;
+ if (!parent) {
+ /* root of tree! */
+ tree->root = leaf;
+ tree->childnode = LEAF;
+ } else if (parent->left == node) {
+ parent->left = leaf;
+ parent->leftnode = LEAF;
+ if (parent->right) {
+ parent->keymask = 0;
+ parent->keybits = 0;
+ } else {
+ parent->keymask |= (1 << node->bitnum);
+ }
+ } else if (parent->right == node) {
+ parent->right = leaf;
+ parent->rightnode = LEAF;
+ if (parent->left) {
+ parent->keymask = 0;
+ parent->keybits = 0;
+ } else {
+ parent->keymask |= (1 << node->bitnum);
+ parent->keybits |= (1 << node->bitnum);
+ }
+ } else {
+ /* internal tree error */
+ assert(0);
+ }
+ free(node);
+ node = parent;
+ }
+
+ /* Propagate keymasks up along singleton chains. */
+ while (node) {
+ parent = node->parent;
+ if (!parent)
+ break;
+ /* Nix the mask for parents with two children. */
+ if (node->keymask == 0) {
+ parent->keymask = 0;
+ parent->keybits = 0;
+ } else if (parent->left && parent->right) {
+ parent->keymask = 0;
+ parent->keybits = 0;
+ } else {
+ assert((parent->keymask & node->keymask) == 0);
+ parent->keymask |= node->keymask;
+ parent->keymask |= (1 << parent->bitnum);
+ parent->keybits |= node->keybits;
+ if (parent->right)
+ parent->keybits |= (1 << parent->bitnum);
+ }
+ node = parent;
+ }
+
+ return 0;
+}
+
+/*
+ * Prune internal nodes.
+ *
+ * Fully populated subtrees that end at the same leaf have already
+ * been collapsed. There are still internal nodes that have for both
+ * their left and right branches a sequence of singletons that make
+ * identical choices and end in identical leaves. The keymask and
+ * keybits collected in the nodes describe the choices made in these
+ * singleton chains. When they are identical for the left and right
+ * branch of a node, and the two leaves comare identical, the node in
+ * question can be removed.
+ *
+ * Note that nodes with the nextbyte tag set will not be removed by
+ * this to ensure tree integrity. Note as well that the structure of
+ * utf8 ensures that these nodes would not have been candidates for
+ * removal in any case.
+ */
+static void prune(struct tree *tree)
+{
+ struct node *node;
+ struct node *left;
+ struct node *right;
+ struct node *parent;
+ void *leftleaf;
+ void *rightleaf;
+ unsigned int leftmask;
+ unsigned int rightmask;
+ unsigned int bitmask;
+ int count;
+
+ if (verbose > 0)
+ printf("Pruning %s_%x\n", tree->type, tree->maxage);
+
+ count = 0;
+ if (tree->childnode == LEAF)
+ return;
+ if (!tree->root)
+ return;
+
+ leftmask = rightmask = 0;
+ node = tree->root;
+ while (node) {
+ if (node->nextbyte)
+ goto advance;
+ if (node->leftnode == LEAF)
+ goto advance;
+ if (node->rightnode == LEAF)
+ goto advance;
+ if (!node->left)
+ goto advance;
+ if (!node->right)
+ goto advance;
+ left = node->left;
+ right = node->right;
+ if (left->keymask == 0)
+ goto advance;
+ if (right->keymask == 0)
+ goto advance;
+ if (left->keymask != right->keymask)
+ goto advance;
+ if (left->keybits != right->keybits)
+ goto advance;
+ leftleaf = NULL;
+ while (!leftleaf) {
+ assert(left->left || left->right);
+ if (left->leftnode == LEAF)
+ leftleaf = left->left;
+ else if (left->rightnode == LEAF)
+ leftleaf = left->right;
+ else if (left->left)
+ left = left->left;
+ else if (left->right)
+ left = left->right;
+ else
+ assert(0);
+ }
+ rightleaf = NULL;
+ while (!rightleaf) {
+ assert(right->left || right->right);
+ if (right->leftnode == LEAF)
+ rightleaf = right->left;
+ else if (right->rightnode == LEAF)
+ rightleaf = right->right;
+ else if (right->left)
+ right = right->left;
+ else if (right->right)
+ right = right->right;
+ else
+ assert(0);
+ }
+ if (! tree->leaf_equal(leftleaf, rightleaf))
+ goto advance;
+ /*
+ * This node has identical singleton-only subtrees.
+ * Remove it.
+ */
+ parent = node->parent;
+ left = node->left;
+ right = node->right;
+ if (parent->left == node)
+ parent->left = left;
+ else if (parent->right == node)
+ parent->right = left;
+ else
+ assert(0);
+ left->parent = parent;
+ left->keymask |= (1 << node->bitnum);
+ node->left = NULL;
+ while (node) {
+ bitmask = 1 << node->bitnum;
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ if (node->leftnode == NODE && node->left) {
+ left = node->left;
+ free(node);
+ count++;
+ node = left;
+ } else if (node->rightnode == NODE && node->right) {
+ right = node->right;
+ free(node);
+ count++;
+ node = right;
+ } else {
+ node = NULL;
+ }
+ }
+ /* Propagate keymasks up along singleton chains. */
+ node = parent;
+ /* Force re-check */
+ bitmask = 1 << node->bitnum;
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ for (;;) {
+ if (node->left && node->right)
+ break;
+ if (node->left) {
+ left = node->left;
+ node->keymask |= left->keymask;
+ node->keybits |= left->keybits;
+ }
+ if (node->right) {
+ right = node->right;
+ node->keymask |= right->keymask;
+ node->keybits |= right->keybits;
+ }
+ node->keymask |= (1 << node->bitnum);
+ node = node->parent;
+ /* Force re-check */
+ bitmask = 1 << node->bitnum;
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ }
+ advance:
+ bitmask = 1 << node->bitnum;
+ if ((leftmask & bitmask) == 0 &&
+ node->leftnode == NODE &&
+ node->left) {
+ leftmask |= bitmask;
+ node = node->left;
+ } else if ((rightmask & bitmask) == 0 &&
+ node->rightnode == NODE &&
+ node->right) {
+ rightmask |= bitmask;
+ node = node->right;
+ } else {
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ node = node->parent;
+ }
+ }
+ if (verbose > 0)
+ printf("Pruned %d nodes\n", count);
+}
+
+/*
+ * Mark the nodes in the tree that lead to leaves that must be
+ * emitted.
+ */
+static void mark_nodes(struct tree *tree)
+{
+ struct node *node;
+ struct node *n;
+ unsigned int leftmask;
+ unsigned int rightmask;
+ unsigned int bitmask;
+ int marked;
+
+ marked = 0;
+ if (verbose > 0)
+ printf("Marking %s_%x\n", tree->type, tree->maxage);
+ if (tree->childnode == LEAF)
+ goto done;
+
+ assert(tree->childnode == NODE);
+ node = tree->root;
+ leftmask = rightmask = 0;
+ while (node) {
+ bitmask = 1 << node->bitnum;
+ if ((leftmask & bitmask) == 0) {
+ leftmask |= bitmask;
+ if (node->leftnode == LEAF) {
+ assert(node->left);
+ if (tree->leaf_mark(node->left)) {
+ n = node;
+ while (n && !n->mark) {
+ marked++;
+ n->mark = 1;
+ n = n->parent;
+ }
+ }
+ } else if (node->left) {
+ assert(node->leftnode == NODE);
+ node = node->left;
+ continue;
+ }
+ }
+ if ((rightmask & bitmask) == 0) {
+ rightmask |= bitmask;
+ if (node->rightnode == LEAF) {
+ assert(node->right);
+ if (tree->leaf_mark(node->right)) {
+ n = node;
+ while (n && !n->mark) {
+ marked++;
+ n->mark = 1;
+ n = n->parent;
+ }
+ }
+ } else if (node->right) {
+ assert(node->rightnode==NODE);
+ node = node->right;
+ continue;
+ }
+ }
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ node = node->parent;
+ }
+
+ /* second pass: left siblings and singletons */
+
+ assert(tree->childnode == NODE);
+ node = tree->root;
+ leftmask = rightmask = 0;
+ while (node) {
+ bitmask = 1 << node->bitnum;
+ if ((leftmask & bitmask) == 0) {
+ leftmask |= bitmask;
+ if (node->leftnode == LEAF) {
+ assert(node->left);
+ if (tree->leaf_mark(node->left)) {
+ n = node;
+ while (n && !n->mark) {
+ marked++;
+ n->mark = 1;
+ n = n->parent;
+ }
+ }
+ } else if (node->left) {
+ assert(node->leftnode == NODE);
+ node = node->left;
+ if (!node->mark && node->parent->mark) {
+ marked++;
+ node->mark = 1;
+ }
+ continue;
+ }
+ }
+ if ((rightmask & bitmask) == 0) {
+ rightmask |= bitmask;
+ if (node->rightnode == LEAF) {
+ assert(node->right);
+ if (tree->leaf_mark(node->right)) {
+ n = node;
+ while (n && !n->mark) {
+ marked++;
+ n->mark = 1;
+ n = n->parent;
+ }
+ }
+ } else if (node->right) {
+ assert(node->rightnode==NODE);
+ node = node->right;
+ if (!node->mark && node->parent->mark &&
+ !node->parent->left) {
+ marked++;
+ node->mark = 1;
+ }
+ continue;
+ }
+ }
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ node = node->parent;
+ }
+done:
+ if (verbose > 0)
+ printf("Marked %d nodes\n", marked);
+}
+
+/*
+ * Compute the index of each node and leaf, which is the offset in the
+ * emitted trie. These values must be pre-computed because relative
+ * offsets between nodes are used to navigate the tree.
+ */
+static int index_nodes(struct tree *tree, int index)
+{
+ struct node *node;
+ unsigned int leftmask;
+ unsigned int rightmask;
+ unsigned int bitmask;
+ int count;
+ int indent;
+
+ /* Align to a cache line (or half a cache line?). */
+ while (index % 64)
+ index++;
+ tree->index = index;
+ indent = 1;
+ count = 0;
+
+ if (verbose > 0)
+ printf("Indexing %s_%x: %d\n", tree->type, tree->maxage, index);
+ if (tree->childnode == LEAF) {
+ index += tree->leaf_size(tree->root);
+ goto done;
+ }
+
+ assert(tree->childnode == NODE);
+ node = tree->root;
+ leftmask = rightmask = 0;
+ while (node) {
+ if (!node->mark)
+ goto skip;
+ count++;
+ if (node->index != index)
+ node->index = index;
+ index += node->size;
+skip:
+ while (node) {
+ bitmask = 1 << node->bitnum;
+ if (node->mark && (leftmask & bitmask) == 0) {
+ leftmask |= bitmask;
+ if (node->leftnode == LEAF) {
+ assert(node->left);
+ *tree->leaf_index(tree, node->left) =
+ index;
+ index += tree->leaf_size(node->left);
+ count++;
+ } else if (node->left) {
+ assert(node->leftnode == NODE);
+ indent += 1;
+ node = node->left;
+ break;
+ }
+ }
+ if (node->mark && (rightmask & bitmask) == 0) {
+ rightmask |= bitmask;
+ if (node->rightnode == LEAF) {
+ assert(node->right);
+ *tree->leaf_index(tree, node->right) = index;
+ index += tree->leaf_size(node->right);
+ count++;
+ } else if (node->right) {
+ assert(node->rightnode==NODE);
+ indent += 1;
+ node = node->right;
+ break;
+ }
+ }
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ node = node->parent;
+ indent -= 1;
+ }
+ }
+done:
+ /* Round up to a multiple of 16 */
+ while (index % 16)
+ index++;
+ if (verbose > 0)
+ printf("Final index %d\n", index);
+ return index;
+}
+
+/*
+ * Compute the size of nodes and leaves. We start by assuming that
+ * each node needs to store a three-byte offset. The indexes of the
+ * nodes are calculated based on that, and then this function is
+ * called to see if the sizes of some nodes can be reduced. This is
+ * repeated until no more changes are seen.
+ */
+static int size_nodes(struct tree *tree)
+{
+ struct tree *next;
+ struct node *node;
+ struct node *right;
+ struct node *n;
+ unsigned int leftmask;
+ unsigned int rightmask;
+ unsigned int bitmask;
+ unsigned int pathbits;
+ unsigned int pathmask;
+ int changed;
+ int offset;
+ int size;
+ int indent;
+
+ indent = 1;
+ changed = 0;
+ size = 0;
+
+ if (verbose > 0)
+ printf("Sizing %s_%x\n", tree->type, tree->maxage);
+ if (tree->childnode == LEAF)
+ goto done;
+
+ assert(tree->childnode == NODE);
+ pathbits = 0;
+ pathmask = 0;
+ node = tree->root;
+ leftmask = rightmask = 0;
+ while (node) {
+ if (!node->mark)
+ goto skip;
+ offset = 0;
+ if (!node->left || !node->right) {
+ size = 1;
+ } else {
+ if (node->rightnode == NODE) {
+ right = node->right;
+ next = tree->next;
+ while (!right->mark) {
+ assert(next);
+ n = next->root;
+ while (n->bitnum != node->bitnum) {
+ if (pathbits & (1<<n->bitnum))
+ n = n->right;
+ else
+ n = n->left;
+ }
+ n = n->right;
+ assert(right->bitnum == n->bitnum);
+ right = n;
+ next = next->next;
+ }
+ offset = right->index - node->index;
+ } else {
+ offset = *tree->leaf_index(tree, node->right);
+ offset -= node->index;
+ }
+ assert(offset >= 0);
+ assert(offset <= 0xffffff);
+ if (offset <= 0xff) {
+ size = 2;
+ } else if (offset <= 0xffff) {
+ size = 3;
+ } else { /* offset <= 0xffffff */
+ size = 4;
+ }
+ }
+ if (node->size != size || node->offset != offset) {
+ node->size = size;
+ node->offset = offset;
+ changed++;
+ }
+skip:
+ while (node) {
+ bitmask = 1 << node->bitnum;
+ pathmask |= bitmask;
+ if (node->mark && (leftmask & bitmask) == 0) {
+ leftmask |= bitmask;
+ if (node->leftnode == LEAF) {
+ assert(node->left);
+ } else if (node->left) {
+ assert(node->leftnode == NODE);
+ indent += 1;
+ node = node->left;
+ break;
+ }
+ }
+ if (node->mark && (rightmask & bitmask) == 0) {
+ rightmask |= bitmask;
+ pathbits |= bitmask;
+ if (node->rightnode == LEAF) {
+ assert(node->right);
+ } else if (node->right) {
+ assert(node->rightnode==NODE);
+ indent += 1;
+ node = node->right;
+ break;
+ }
+ }
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ pathmask &= ~bitmask;
+ pathbits &= ~bitmask;
+ node = node->parent;
+ indent -= 1;
+ }
+ }
+done:
+ if (verbose > 0)
+ printf("Found %d changes\n", changed);
+ return changed;
+}
+
+/*
+ * Emit a trie for the given tree into the data array.
+ */
+static void emit(struct tree *tree, unsigned char *data)
+{
+ struct node *node;
+ unsigned int leftmask;
+ unsigned int rightmask;
+ unsigned int bitmask;
+ int offlen;
+ int offset;
+ int index;
+ int indent;
+ unsigned char byte;
+
+ index = tree->index;
+ data += index;
+ indent = 1;
+ if (verbose > 0)
+ printf("Emitting %s_%x\n", tree->type, tree->maxage);
+ if (tree->childnode == LEAF) {
+ assert(tree->root);
+ tree->leaf_emit(tree->root, data);
+ return;
+ }
+
+ assert(tree->childnode == NODE);
+ node = tree->root;
+ leftmask = rightmask = 0;
+ while (node) {
+ if (!node->mark)
+ goto skip;
+ assert(node->offset != -1);
+ assert(node->index == index);
+
+ byte = 0;
+ if (node->nextbyte)
+ byte |= NEXTBYTE;
+ byte |= (node->bitnum & BITNUM);
+ if (node->left && node->right) {
+ if (node->leftnode == NODE)
+ byte |= LEFTNODE;
+ if (node->rightnode == NODE)
+ byte |= RIGHTNODE;
+ if (node->offset <= 0xff)
+ offlen = 1;
+ else if (node->offset <= 0xffff)
+ offlen = 2;
+ else
+ offlen = 3;
+ offset = node->offset;
+ byte |= offlen << OFFLEN_SHIFT;
+ *data++ = byte;
+ index++;
+ while (offlen--) {
+ *data++ = offset & 0xff;
+ index++;
+ offset >>= 8;
+ }
+ } else if (node->left) {
+ if (node->leftnode == NODE)
+ byte |= TRIENODE;
+ *data++ = byte;
+ index++;
+ } else if (node->right) {
+ byte |= RIGHTNODE;
+ if (node->rightnode == NODE)
+ byte |= TRIENODE;
+ *data++ = byte;
+ index++;
+ } else {
+ assert(0);
+ }
+skip:
+ while (node) {
+ bitmask = 1 << node->bitnum;
+ if (node->mark && (leftmask & bitmask) == 0) {
+ leftmask |= bitmask;
+ if (node->leftnode == LEAF) {
+ assert(node->left);
+ data = tree->leaf_emit(node->left,
+ data);
+ index += tree->leaf_size(node->left);
+ } else if (node->left) {
+ assert(node->leftnode == NODE);
+ indent += 1;
+ node = node->left;
+ break;
+ }
+ }
+ if (node->mark && (rightmask & bitmask) == 0) {
+ rightmask |= bitmask;
+ if (node->rightnode == LEAF) {
+ assert(node->right);
+ data = tree->leaf_emit(node->right,
+ data);
+ index += tree->leaf_size(node->right);
+ } else if (node->right) {
+ assert(node->rightnode==NODE);
+ indent += 1;
+ node = node->right;
+ break;
+ }
+ }
+ leftmask &= ~bitmask;
+ rightmask &= ~bitmask;
+ node = node->parent;
+ indent -= 1;
+ }
+ }
+}
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * Unicode data.
+ *
+ * We need to keep track of the Canonical Combining Class, the Age,
+ * and decompositions for a code point.
+ *
+ * For the Age, we store the index into the ages table. Effectively
+ * this is a generation number that the table maps to a unicode
+ * version.
+ *
+ * The correction field is used to indicate that this entry is in the
+ * corrections array, which contains decompositions that were
+ * corrected in later revisions. The value of the correction field is
+ * the Unicode version in which the mapping was corrected.
+ */
+struct unicode_data {
+ unsigned int code;
+ int ccc;
+ int gen;
+ int correction;
+ unsigned int *utf32nfkdi;
+ unsigned int *utf32nfkdicf;
+ char *utf8nfkdi;
+ char *utf8nfkdicf;
+};
+
+struct unicode_data unicode_data[0x110000];
+struct unicode_data *corrections;
+int corrections_count;
+
+struct tree *nfkdi_tree;
+struct tree *nfkdicf_tree;
+
+struct tree *trees;
+int trees_count;
+
+/*
+ * Check the corrections array to see if this entry was corrected at
+ * some point.
+ */
+static struct unicode_data *corrections_lookup(struct unicode_data *u)
+{
+ int i;
+
+ for (i = 0; i != corrections_count; i++)
+ if (u->code == corrections[i].code)
+ return &corrections[i];
+ return u;
+}
+
+static int nfkdi_equal(void *l, void *r)
+{
+ struct unicode_data *left = l;
+ struct unicode_data *right = r;
+
+ if (left->gen != right->gen)
+ return 0;
+ if (left->ccc != right->ccc)
+ return 0;
+ if (left->utf8nfkdi && right->utf8nfkdi &&
+ strcmp(left->utf8nfkdi, right->utf8nfkdi) == 0)
+ return 1;
+ if (left->utf8nfkdi || right->utf8nfkdi)
+ return 0;
+ return 1;
+}
+
+static int nfkdicf_equal(void *l, void *r)
+{
+ struct unicode_data *left = l;
+ struct unicode_data *right = r;
+
+ if (left->gen != right->gen)
+ return 0;
+ if (left->ccc != right->ccc)
+ return 0;
+ if (left->utf8nfkdicf && right->utf8nfkdicf &&
+ strcmp(left->utf8nfkdicf, right->utf8nfkdicf) == 0)
+ return 1;
+ if (left->utf8nfkdicf && right->utf8nfkdicf)
+ return 0;
+ if (left->utf8nfkdicf || right->utf8nfkdicf)
+ return 0;
+ if (left->utf8nfkdi && right->utf8nfkdi &&
+ strcmp(left->utf8nfkdi, right->utf8nfkdi) == 0)
+ return 1;
+ if (left->utf8nfkdi || right->utf8nfkdi)
+ return 0;
+ return 1;
+}
+
+static void nfkdi_print(void *l, int indent)
+{
+ struct unicode_data *leaf = l;
+
+ printf("%*sleaf @ %p code %X ccc %d gen %d", indent, "", leaf,
+ leaf->code, leaf->ccc, leaf->gen);
+ if (leaf->utf8nfkdi)
+ printf(" nfkdi \"%s\"", (const char*)leaf->utf8nfkdi);
+ printf("\n");
+}
+
+static void nfkdicf_print(void *l, int indent)
+{
+ struct unicode_data *leaf = l;
+
+ printf("%*sleaf @ %p code %X ccc %d gen %d", indent, "", leaf,
+ leaf->code, leaf->ccc, leaf->gen);
+ if (leaf->utf8nfkdicf)
+ printf(" nfkdicf \"%s\"", (const char*)leaf->utf8nfkdicf);
+ else if (leaf->utf8nfkdi)
+ printf(" nfkdi \"%s\"", (const char*)leaf->utf8nfkdi);
+ printf("\n");
+}
+
+static int nfkdi_mark(void *l)
+{
+ return 1;
+}
+
+static int nfkdicf_mark(void *l)
+{
+ struct unicode_data *leaf = l;
+
+ if (leaf->utf8nfkdicf)
+ return 1;
+ return 0;
+}
+
+static int correction_mark(void *l)
+{
+ struct unicode_data *leaf = l;
+
+ return leaf->correction;
+}
+
+static int nfkdi_size(void *l)
+{
+ struct unicode_data *leaf = l;
+
+ int size = 2;
+ if (leaf->utf8nfkdi)
+ size += strlen(leaf->utf8nfkdi) + 1;
+ return size;
+}
+
+static int nfkdicf_size(void *l)
+{
+ struct unicode_data *leaf = l;
+
+ int size = 2;
+ if (leaf->utf8nfkdicf)
+ size += strlen(leaf->utf8nfkdicf) + 1;
+ else if (leaf->utf8nfkdi)
+ size += strlen(leaf->utf8nfkdi) + 1;
+ return size;
+}
+
+static int *nfkdi_index(struct tree *tree, void *l)
+{
+ struct unicode_data *leaf = l;
+
+ return &tree->leafindex[leaf->code];
+}
+
+static int *nfkdicf_index(struct tree *tree, void *l)
+{
+ struct unicode_data *leaf = l;
+
+ return &tree->leafindex[leaf->code];
+}
+
+static unsigned char *nfkdi_emit(void *l, unsigned char *data)
+{
+ struct unicode_data *leaf = l;
+ unsigned char *s;
+
+ *data++ = leaf->gen;
+ if (leaf->utf8nfkdi) {
+ *data++ = DECOMPOSE;
+ s = (unsigned char*)leaf->utf8nfkdi;
+ while ((*data++ = *s++) != 0)
+ ;
+ } else {
+ *data++ = leaf->ccc;
+ }
+ return data;
+}
+
+static unsigned char *nfkdicf_emit(void *l, unsigned char *data)
+{
+ struct unicode_data *leaf = l;
+ unsigned char *s;
+
+ *data++ = leaf->gen;
+ if (leaf->utf8nfkdicf) {
+ *data++ = DECOMPOSE;
+ s = (unsigned char*)leaf->utf8nfkdicf;
+ while ((*data++ = *s++) != 0)
+ ;
+ } else if (leaf->utf8nfkdi) {
+ *data++ = DECOMPOSE;
+ s = (unsigned char*)leaf->utf8nfkdi;
+ while ((*data++ = *s++) != 0)
+ ;
+ } else {
+ *data++ = leaf->ccc;
+ }
+ return data;
+}
+
+static void utf8_create(struct unicode_data *data)
+{
+ char utf[18*4+1];
+ char *u;
+ unsigned int *um;
+ int i;
+
+ u = utf;
+ um = data->utf32nfkdi;
+ if (um) {
+ for (i = 0; um[i]; i++)
+ u += utf8encode(u, um[i]);
+ *u = '\0';
+ data->utf8nfkdi = strdup(utf);
+ }
+ u = utf;
+ um = data->utf32nfkdicf;
+ if (um) {
+ for (i = 0; um[i]; i++)
+ u += utf8encode(u, um[i]);
+ *u = '\0';
+ if (!data->utf8nfkdi || strcmp(data->utf8nfkdi, utf))
+ data->utf8nfkdicf = strdup(utf);
+ }
+}
+
+static void utf8_init(void)
+{
+ unsigned int unichar;
+ int i;
+
+ for (unichar = 0; unichar != 0x110000; unichar++)
+ utf8_create(&unicode_data[unichar]);
+
+ for (i = 0; i != corrections_count; i++)
+ utf8_create(&corrections[i]);
+}
+
+static void trees_init(void)
+{
+ struct unicode_data *data;
+ unsigned int maxage;
+ unsigned int nextage;
+ int count;
+ int i;
+ int j;
+
+ /* Count the number of different ages. */
+ count = 0;
+ nextage = (unsigned int)-1;
+ do {
+ maxage = nextage;
+ nextage = 0;
+ for (i = 0; i <= corrections_count; i++) {
+ data = &corrections[i];
+ if (nextage < data->correction &&
+ data->correction < maxage)
+ nextage = data->correction;
+ }
+ count++;
+ } while (nextage);
+
+ /* Two trees per age: nfkdi and nfkdicf */
+ trees_count = count * 2;
+ trees = calloc(trees_count, sizeof(struct tree));
+
+ /* Assign ages to the trees. */
+ count = trees_count;
+ nextage = (unsigned int)-1;
+ do {
+ maxage = nextage;
+ trees[--count].maxage = maxage;
+ trees[--count].maxage = maxage;
+ nextage = 0;
+ for (i = 0; i <= corrections_count; i++) {
+ data = &corrections[i];
+ if (nextage < data->correction &&
+ data->correction < maxage)
+ nextage = data->correction;
+ }
+ } while (nextage);
+
+ /* The ages assigned above are off by one. */
+ for (i = 0; i != trees_count; i++) {
+ j = 0;
+ while (ages[j] < trees[i].maxage)
+ j++;
+ trees[i].maxage = ages[j-1];
+ }
+
+ /* Set up the forwarding between trees. */
+ trees[trees_count-2].next = &trees[trees_count-1];
+ trees[trees_count-1].leaf_mark = nfkdi_mark;
+ trees[trees_count-2].leaf_mark = nfkdicf_mark;
+ for (i = 0; i != trees_count-2; i += 2) {
+ trees[i].next = &trees[trees_count-2];
+ trees[i].leaf_mark = correction_mark;
+ trees[i+1].next = &trees[trees_count-1];
+ trees[i+1].leaf_mark = correction_mark;
+ }
+
+ /* Assign the callouts. */
+ for (i = 0; i != trees_count; i += 2) {
+ trees[i].type = "nfkdicf";
+ trees[i].leaf_equal = nfkdicf_equal;
+ trees[i].leaf_print = nfkdicf_print;
+ trees[i].leaf_size = nfkdicf_size;
+ trees[i].leaf_index = nfkdicf_index;
+ trees[i].leaf_emit = nfkdicf_emit;
+
+ trees[i+1].type = "nfkdi";
+ trees[i+1].leaf_equal = nfkdi_equal;
+ trees[i+1].leaf_print = nfkdi_print;
+ trees[i+1].leaf_size = nfkdi_size;
+ trees[i+1].leaf_index = nfkdi_index;
+ trees[i+1].leaf_emit = nfkdi_emit;
+ }
+
+ /* Finish init. */
+ for (i = 0; i != trees_count; i++)
+ trees[i].childnode = NODE;
+}
+
+static void trees_populate(void)
+{
+ struct unicode_data *data;
+ unsigned int unichar;
+ char keyval[4];
+ int keylen;
+ int i;
+
+ for (i = 0; i != trees_count; i++) {
+ if (verbose > 0) {
+ printf("Populating %s_%x\n",
+ trees[i].type, trees[i].maxage);
+ }
+ for (unichar = 0; unichar != 0x110000; unichar++) {
+ if (unicode_data[unichar].gen < 0)
+ continue;
+ keylen = utf8encode(keyval, unichar);
+ data = corrections_lookup(&unicode_data[unichar]);
+ if (data->correction <= trees[i].maxage)
+ data = &unicode_data[unichar];
+ insert(&trees[i], keyval, keylen, data);
+ }
+ }
+}
+
+static void trees_reduce(void)
+{
+ int i;
+ int size;
+ int changed;
+
+ for (i = 0; i != trees_count; i++)
+ prune(&trees[i]);
+ for (i = 0; i != trees_count; i++)
+ mark_nodes(&trees[i]);
+ do {
+ size = 0;
+ for (i = 0; i != trees_count; i++)
+ size = index_nodes(&trees[i], size);
+ changed = 0;
+ for (i = 0; i != trees_count; i++)
+ changed += size_nodes(&trees[i]);
+ } while (changed);
+
+ utf8data = calloc(size, 1);
+ utf8data_size = size;
+ for (i = 0; i != trees_count; i++)
+ emit(&trees[i], utf8data);
+
+ if (verbose > 0) {
+ for (i = 0; i != trees_count; i++) {
+ printf("%s_%x idx %d\n",
+ trees[i].type, trees[i].maxage, trees[i].index);
+ }
+ }
+
+ nfkdi = utf8data + trees[trees_count-1].index;
+ nfkdicf = utf8data + trees[trees_count-2].index;
+
+ nfkdi_tree = &trees[trees_count-1];
+ nfkdicf_tree = &trees[trees_count-2];
+}
+
+static void verify(struct tree *tree)
+{
+ struct unicode_data *data;
+ utf8leaf_t *leaf;
+ unsigned int unichar;
+ char key[4];
+ int report;
+ int nocf;
+
+ if (verbose > 0)
+ printf("Verifying %s_%x\n", tree->type, tree->maxage);
+ nocf = strcmp(tree->type, "nfkdicf");
+
+ for (unichar = 0; unichar != 0x110000; unichar++) {
+ report = 0;
+ data = corrections_lookup(&unicode_data[unichar]);
+ if (data->correction <= tree->maxage)
+ data = &unicode_data[unichar];
+ utf8encode(key,unichar);
+ leaf = utf8lookup(tree, key);
+ if (!leaf) {
+ if (data->gen != -1)
+ report++;
+ if (unichar < 0xd800 || unichar > 0xdfff)
+ report++;
+ } else {
+ if (unichar >= 0xd800 && unichar <= 0xdfff)
+ report++;
+ if (data->gen == -1)
+ report++;
+ if (data->gen != LEAF_GEN(leaf))
+ report++;
+ if (LEAF_CCC(leaf) == DECOMPOSE) {
+ if (nocf) {
+ if (!data->utf8nfkdi) {
+ report++;
+ } else if (strcmp(data->utf8nfkdi,
+ LEAF_STR(leaf))) {
+ report++;
+ }
+ } else {
+ if (!data->utf8nfkdicf &&
+ !data->utf8nfkdi) {
+ report++;
+ } else if (data->utf8nfkdicf) {
+ if (strcmp(data->utf8nfkdicf,
+ LEAF_STR(leaf)))
+ report++;
+ } else if (strcmp(data->utf8nfkdi,
+ LEAF_STR(leaf))) {
+ report++;
+ }
+ }
+ } else if (data->ccc != LEAF_CCC(leaf)) {
+ report++;
+ }
+ }
+ if (report) {
+ printf("%X code %X gen %d ccc %d"
+ " nfkdi -> \"%s\"",
+ unichar, data->code, data->gen,
+ data->ccc,
+ data->utf8nfkdi);
+ if (leaf) {
+ printf(" gen %d ccc %d"
+ " nfkdi -> \"%s\"",
+ LEAF_GEN(leaf),
+ LEAF_CCC(leaf),
+ LEAF_CCC(leaf) == DECOMPOSE ?
+ LEAF_STR(leaf) : "");
+ }
+ printf("\n");
+ }
+ }
+}
+
+static void trees_verify(void)
+{
+ int i;
+
+ for (i = 0; i != trees_count; i++)
+ verify(&trees[i]);
+}
+
+/* ------------------------------------------------------------------ */
+
+static void help(void)
+{
+ printf("Usage: %s [options]\n", argv0);
+ printf("\n");
+ printf("This program creates an a data trie used for parsing and\n");
+ printf("normalization of UTF-8 strings. The trie is derived from\n");
+ printf("a set of input files from the Unicode character database\n");
+ printf("found at: http://www.unicode.org/Public/UCD/latest/ucd/\n");
+ printf("\n");
+ printf("The generated tree supports two normalization forms:\n");
+ printf("\n");
+ printf("\tnfkdi:\n");
+ printf("\t- Apply unicode normalization form NFKD.\n");
+ printf("\t- Remove any Default_Ignorable_Code_Point.\n");
+ printf("\n");
+ printf("\tnfkdicf:\n");
+ printf("\t- Apply unicode normalization form NFKD.\n");
+ printf("\t- Remove any Default_Ignorable_Code_Point.\n");
+ printf("\t- Apply a full casefold (C + F).\n");
+ printf("\n");
+ printf("These forms were chosen as being most useful when dealing\n");
+ printf("with file names: NFKD catches most cases where characters\n");
+ printf("should be considered equivalent. The ignorables are mostly\n");
+ printf("invisible, making names hard to type.\n");
+ printf("\n");
+ printf("The options to specify the files to be used are listed\n");
+ printf("below with their default values, which are the names used\n");
+ printf("by version 11.0.0 of the Unicode Character Database.\n");
+ printf("\n");
+ printf("The input files:\n");
+ printf("\t-a %s\n", AGE_NAME);
+ printf("\t-c %s\n", CCC_NAME);
+ printf("\t-p %s\n", PROP_NAME);
+ printf("\t-d %s\n", DATA_NAME);
+ printf("\t-f %s\n", FOLD_NAME);
+ printf("\t-n %s\n", NORM_NAME);
+ printf("\n");
+ printf("Additionally, the generated tables are tested using:\n");
+ printf("\t-t %s\n", TEST_NAME);
+ printf("\n");
+ printf("Finally, the output file:\n");
+ printf("\t-o %s\n", UTF8_NAME);
+ printf("\n");
+}
+
+static void usage(void)
+{
+ help();
+ exit(1);
+}
+
+static void open_fail(const char *name, int error)
+{
+ printf("Error %d opening %s: %s\n", error, name, strerror(error));
+ exit(1);
+}
+
+static void file_fail(const char *filename)
+{
+ printf("Error parsing %s\n", filename);
+ exit(1);
+}
+
+static void line_fail(const char *filename, const char *line)
+{
+ printf("Error parsing %s:%s\n", filename, line);
+ exit(1);
+}
+
+/* ------------------------------------------------------------------ */
+
+static void print_utf32(unsigned int *utf32str)
+{
+ int i;
+
+ for (i = 0; utf32str[i]; i++)
+ printf(" %X", utf32str[i]);
+}
+
+static void print_utf32nfkdi(unsigned int unichar)
+{
+ printf(" %X ->", unichar);
+ print_utf32(unicode_data[unichar].utf32nfkdi);
+ printf("\n");
+}
+
+static void print_utf32nfkdicf(unsigned int unichar)
+{
+ printf(" %X ->", unichar);
+ print_utf32(unicode_data[unichar].utf32nfkdicf);
+ printf("\n");
+}
+
+/* ------------------------------------------------------------------ */
+
+static void age_init(void)
+{
+ FILE *file;
+ unsigned int first;
+ unsigned int last;
+ unsigned int unichar;
+ unsigned int major;
+ unsigned int minor;
+ unsigned int revision;
+ int gen;
+ int count;
+ int ret;
+
+ if (verbose > 0)
+ printf("Parsing %s\n", age_name);
+
+ file = fopen(age_name, "r");
+ if (!file)
+ open_fail(age_name, errno);
+ count = 0;
+
+ gen = 0;
+ while (fgets(line, LINESIZE, file)) {
+ ret = sscanf(line, "# Age=V%d_%d_%d",
+ &major, &minor, &revision);
+ if (ret == 3) {
+ ages_count++;
+ if (verbose > 1)
+ printf(" Age V%d_%d_%d\n",
+ major, minor, revision);
+ if (!age_valid(major, minor, revision))
+ line_fail(age_name, line);
+ continue;
+ }
+ ret = sscanf(line, "# Age=V%d_%d", &major, &minor);
+ if (ret == 2) {
+ ages_count++;
+ if (verbose > 1)
+ printf(" Age V%d_%d\n", major, minor);
+ if (!age_valid(major, minor, 0))
+ line_fail(age_name, line);
+ continue;
+ }
+ }
+
+ /* We must have found something above. */
+ if (verbose > 1)
+ printf("%d age entries\n", ages_count);
+ if (ages_count == 0 || ages_count > MAXGEN)
+ file_fail(age_name);
+
+ /* There is a 0 entry. */
+ ages_count++;
+ ages = calloc(ages_count + 1, sizeof(*ages));
+ /* And a guard entry. */
+ ages[ages_count] = (unsigned int)-1;
+
+ rewind(file);
+ count = 0;
+ gen = 0;
+ while (fgets(line, LINESIZE, file)) {
+ ret = sscanf(line, "# Age=V%d_%d_%d",
+ &major, &minor, &revision);
+ if (ret == 3) {
+ ages[++gen] =
+ UNICODE_AGE(major, minor, revision);
+ if (verbose > 1)
+ printf(" Age V%d_%d_%d = gen %d\n",
+ major, minor, revision, gen);
+ if (!age_valid(major, minor, revision))
+ line_fail(age_name, line);
+ continue;
+ }
+ ret = sscanf(line, "# Age=V%d_%d", &major, &minor);
+ if (ret == 2) {
+ ages[++gen] = UNICODE_AGE(major, minor, 0);
+ if (verbose > 1)
+ printf(" Age V%d_%d = %d\n",
+ major, minor, gen);
+ if (!age_valid(major, minor, 0))
+ line_fail(age_name, line);
+ continue;
+ }
+ ret = sscanf(line, "%X..%X ; %d.%d #",
+ &first, &last, &major, &minor);
+ if (ret == 4) {
+ for (unichar = first; unichar <= last; unichar++)
+ unicode_data[unichar].gen = gen;
+ count += 1 + last - first;
+ if (verbose > 1)
+ printf(" %X..%X gen %d\n", first, last, gen);
+ if (!utf32valid(first) || !utf32valid(last))
+ line_fail(age_name, line);
+ continue;
+ }
+ ret = sscanf(line, "%X ; %d.%d #", &unichar, &major, &minor);
+ if (ret == 3) {
+ unicode_data[unichar].gen = gen;
+ count++;
+ if (verbose > 1)
+ printf(" %X gen %d\n", unichar, gen);
+ if (!utf32valid(unichar))
+ line_fail(age_name, line);
+ continue;
+ }
+ }
+ unicode_maxage = ages[gen];
+ fclose(file);
+
+ /* Nix surrogate block */
+ if (verbose > 1)
+ printf(" Removing surrogate block D800..DFFF\n");
+ for (unichar = 0xd800; unichar <= 0xdfff; unichar++)
+ unicode_data[unichar].gen = -1;
+
+ if (verbose > 0)
+ printf("Found %d entries\n", count);
+ if (count == 0)
+ file_fail(age_name);
+}
+
+static void ccc_init(void)
+{
+ FILE *file;
+ unsigned int first;
+ unsigned int last;
+ unsigned int unichar;
+ unsigned int value;
+ int count;
+ int ret;
+
+ if (verbose > 0)
+ printf("Parsing %s\n", ccc_name);
+
+ file = fopen(ccc_name, "r");
+ if (!file)
+ open_fail(ccc_name, errno);
+
+ count = 0;
+ while (fgets(line, LINESIZE, file)) {
+ ret = sscanf(line, "%X..%X ; %d #", &first, &last, &value);
+ if (ret == 3) {
+ for (unichar = first; unichar <= last; unichar++) {
+ unicode_data[unichar].ccc = value;
+ count++;
+ }
+ if (verbose > 1)
+ printf(" %X..%X ccc %d\n", first, last, value);
+ if (!utf32valid(first) || !utf32valid(last))
+ line_fail(ccc_name, line);
+ continue;
+ }
+ ret = sscanf(line, "%X ; %d #", &unichar, &value);
+ if (ret == 2) {
+ unicode_data[unichar].ccc = value;
+ count++;
+ if (verbose > 1)
+ printf(" %X ccc %d\n", unichar, value);
+ if (!utf32valid(unichar))
+ line_fail(ccc_name, line);
+ continue;
+ }
+ }
+ fclose(file);
+
+ if (verbose > 0)
+ printf("Found %d entries\n", count);
+ if (count == 0)
+ file_fail(ccc_name);
+}
+
+static void nfkdi_init(void)
+{
+ FILE *file;
+ unsigned int unichar;
+ unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+ char *s;
+ unsigned int *um;
+ int count;
+ int i;
+ int ret;
+
+ if (verbose > 0)
+ printf("Parsing %s\n", data_name);
+ file = fopen(data_name, "r");
+ if (!file)
+ open_fail(data_name, errno);
+
+ count = 0;
+ while (fgets(line, LINESIZE, file)) {
+ ret = sscanf(line, "%X;%*[^;];%*[^;];%*[^;];%*[^;];%[^;];",
+ &unichar, buf0);
+ if (ret != 2)
+ continue;
+ if (!utf32valid(unichar))
+ line_fail(data_name, line);
+
+ s = buf0;
+ /* skip over <tag> */
+ if (*s == '<')
+ while (*s++ != ' ')
+ ;
+ /* decode the decomposition into UTF-32 */
+ i = 0;
+ while (*s) {
+ mapping[i] = strtoul(s, &s, 16);
+ if (!utf32valid(mapping[i]))
+ line_fail(data_name, line);
+ i++;
+ }
+ mapping[i++] = 0;
+
+ um = malloc(i * sizeof(unsigned int));
+ memcpy(um, mapping, i * sizeof(unsigned int));
+ unicode_data[unichar].utf32nfkdi = um;
+
+ if (verbose > 1)
+ print_utf32nfkdi(unichar);
+ count++;
+ }
+ fclose(file);
+ if (verbose > 0)
+ printf("Found %d entries\n", count);
+ if (count == 0)
+ file_fail(data_name);
+}
+
+static void nfkdicf_init(void)
+{
+ FILE *file;
+ unsigned int unichar;
+ unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+ char status;
+ char *s;
+ unsigned int *um;
+ int i;
+ int count;
+ int ret;
+
+ if (verbose > 0)
+ printf("Parsing %s\n", fold_name);
+ file = fopen(fold_name, "r");
+ if (!file)
+ open_fail(fold_name, errno);
+
+ count = 0;
+ while (fgets(line, LINESIZE, file)) {
+ ret = sscanf(line, "%X; %c; %[^;];", &unichar, &status, buf0);
+ if (ret != 3)
+ continue;
+ if (!utf32valid(unichar))
+ line_fail(fold_name, line);
+ /* Use the C+F casefold. */
+ if (status != 'C' && status != 'F')
+ continue;
+ s = buf0;
+ if (*s == '<')
+ while (*s++ != ' ')
+ ;
+ i = 0;
+ while (*s) {
+ mapping[i] = strtoul(s, &s, 16);
+ if (!utf32valid(mapping[i]))
+ line_fail(fold_name, line);
+ i++;
+ }
+ mapping[i++] = 0;
+
+ um = malloc(i * sizeof(unsigned int));
+ memcpy(um, mapping, i * sizeof(unsigned int));
+ unicode_data[unichar].utf32nfkdicf = um;
+
+ if (verbose > 1)
+ print_utf32nfkdicf(unichar);
+ count++;
+ }
+ fclose(file);
+ if (verbose > 0)
+ printf("Found %d entries\n", count);
+ if (count == 0)
+ file_fail(fold_name);
+}
+
+static void ignore_init(void)
+{
+ FILE *file;
+ unsigned int unichar;
+ unsigned int first;
+ unsigned int last;
+ unsigned int *um;
+ int count;
+ int ret;
+
+ if (verbose > 0)
+ printf("Parsing %s\n", prop_name);
+ file = fopen(prop_name, "r");
+ if (!file)
+ open_fail(prop_name, errno);
+ assert(file);
+ count = 0;
+ while (fgets(line, LINESIZE, file)) {
+ ret = sscanf(line, "%X..%X ; %s # ", &first, &last, buf0);
+ if (ret == 3) {
+ if (strcmp(buf0, "Default_Ignorable_Code_Point"))
+ continue;
+ if (!utf32valid(first) || !utf32valid(last))
+ line_fail(prop_name, line);
+ for (unichar = first; unichar <= last; unichar++) {
+ free(unicode_data[unichar].utf32nfkdi);
+ um = malloc(sizeof(unsigned int));
+ *um = 0;
+ unicode_data[unichar].utf32nfkdi = um;
+ free(unicode_data[unichar].utf32nfkdicf);
+ um = malloc(sizeof(unsigned int));
+ *um = 0;
+ unicode_data[unichar].utf32nfkdicf = um;
+ count++;
+ }
+ if (verbose > 1)
+ printf(" %X..%X Default_Ignorable_Code_Point\n",
+ first, last);
+ continue;
+ }
+ ret = sscanf(line, "%X ; %s # ", &unichar, buf0);
+ if (ret == 2) {
+ if (strcmp(buf0, "Default_Ignorable_Code_Point"))
+ continue;
+ if (!utf32valid(unichar))
+ line_fail(prop_name, line);
+ free(unicode_data[unichar].utf32nfkdi);
+ um = malloc(sizeof(unsigned int));
+ *um = 0;
+ unicode_data[unichar].utf32nfkdi = um;
+ free(unicode_data[unichar].utf32nfkdicf);
+ um = malloc(sizeof(unsigned int));
+ *um = 0;
+ unicode_data[unichar].utf32nfkdicf = um;
+ if (verbose > 1)
+ printf(" %X Default_Ignorable_Code_Point\n",
+ unichar);
+ count++;
+ continue;
+ }
+ }
+ fclose(file);
+
+ if (verbose > 0)
+ printf("Found %d entries\n", count);
+ if (count == 0)
+ file_fail(prop_name);
+}
+
+static void corrections_init(void)
+{
+ FILE *file;
+ unsigned int unichar;
+ unsigned int major;
+ unsigned int minor;
+ unsigned int revision;
+ unsigned int age;
+ unsigned int *um;
+ unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+ char *s;
+ int i;
+ int count;
+ int ret;
+
+ if (verbose > 0)
+ printf("Parsing %s\n", norm_name);
+ file = fopen(norm_name, "r");
+ if (!file)
+ open_fail(norm_name, errno);
+
+ count = 0;
+ while (fgets(line, LINESIZE, file)) {
+ ret = sscanf(line, "%X;%[^;];%[^;];%d.%d.%d #",
+ &unichar, buf0, buf1,
+ &major, &minor, &revision);
+ if (ret != 6)
+ continue;
+ if (!utf32valid(unichar) || !age_valid(major, minor, revision))
+ line_fail(norm_name, line);
+ count++;
+ }
+ corrections = calloc(count, sizeof(struct unicode_data));
+ corrections_count = count;
+ rewind(file);
+
+ count = 0;
+ while (fgets(line, LINESIZE, file)) {
+ ret = sscanf(line, "%X;%[^;];%[^;];%d.%d.%d #",
+ &unichar, buf0, buf1,
+ &major, &minor, &revision);
+ if (ret != 6)
+ continue;
+ if (!utf32valid(unichar) || !age_valid(major, minor, revision))
+ line_fail(norm_name, line);
+ corrections[count] = unicode_data[unichar];
+ assert(corrections[count].code == unichar);
+ age = UNICODE_AGE(major, minor, revision);
+ corrections[count].correction = age;
+
+ i = 0;
+ s = buf0;
+ while (*s) {
+ mapping[i] = strtoul(s, &s, 16);
+ if (!utf32valid(mapping[i]))
+ line_fail(norm_name, line);
+ i++;
+ }
+ mapping[i++] = 0;
+
+ um = malloc(i * sizeof(unsigned int));
+ memcpy(um, mapping, i * sizeof(unsigned int));
+ corrections[count].utf32nfkdi = um;
+
+ if (verbose > 1)
+ printf(" %X -> %s -> %s V%d_%d_%d\n",
+ unichar, buf0, buf1, major, minor, revision);
+ count++;
+ }
+ fclose(file);
+
+ if (verbose > 0)
+ printf("Found %d entries\n", count);
+ if (count == 0)
+ file_fail(norm_name);
+}
+
+/* ------------------------------------------------------------------ */
+
+/*
+ * Hangul decomposition (algorithm from Section 3.12 of Unicode 6.3.0)
+ *
+ * AC00;<Hangul Syllable, First>;Lo;0;L;;;;;N;;;;;
+ * D7A3;<Hangul Syllable, Last>;Lo;0;L;;;;;N;;;;;
+ *
+ * SBase = 0xAC00
+ * LBase = 0x1100
+ * VBase = 0x1161
+ * TBase = 0x11A7
+ * LCount = 19
+ * VCount = 21
+ * TCount = 28
+ * NCount = 588 (VCount * TCount)
+ * SCount = 11172 (LCount * NCount)
+ *
+ * Decomposition:
+ * SIndex = s - SBase
+ *
+ * LV (Canonical/Full)
+ * LIndex = SIndex / NCount
+ * VIndex = (Sindex % NCount) / TCount
+ * LPart = LBase + LIndex
+ * VPart = VBase + VIndex
+ *
+ * LVT (Canonical)
+ * LVIndex = (SIndex / TCount) * TCount
+ * TIndex = (Sindex % TCount)
+ * LVPart = SBase + LVIndex
+ * TPart = TBase + TIndex
+ *
+ * LVT (Full)
+ * LIndex = SIndex / NCount
+ * VIndex = (Sindex % NCount) / TCount
+ * TIndex = (Sindex % TCount)
+ * LPart = LBase + LIndex
+ * VPart = VBase + VIndex
+ * if (TIndex == 0) {
+ * d = <LPart, VPart>
+ * } else {
+ * TPart = TBase + TIndex
+ * d = <LPart, VPart, TPart>
+ * }
+ *
+ */
+
+static void
+hangul_decompose(void)
+{
+ unsigned int sb = 0xAC00;
+ unsigned int lb = 0x1100;
+ unsigned int vb = 0x1161;
+ unsigned int tb = 0x11a7;
+ /* unsigned int lc = 19; */
+ unsigned int vc = 21;
+ unsigned int tc = 28;
+ unsigned int nc = (vc * tc);
+ /* unsigned int sc = (lc * nc); */
+ unsigned int unichar;
+ unsigned int mapping[4];
+ unsigned int *um;
+ int count;
+ int i;
+
+ if (verbose > 0)
+ printf("Decomposing hangul\n");
+ /* Hangul */
+ count = 0;
+ for (unichar = 0xAC00; unichar <= 0xD7A3; unichar++) {
+ unsigned int si = unichar - sb;
+ unsigned int li = si / nc;
+ unsigned int vi = (si % nc) / tc;
+ unsigned int ti = si % tc;
+
+ i = 0;
+ mapping[i++] = lb + li;
+ mapping[i++] = vb + vi;
+ if (ti)
+ mapping[i++] = tb + ti;
+ mapping[i++] = 0;
+
+ assert(!unicode_data[unichar].utf32nfkdi);
+ um = malloc(i * sizeof(unsigned int));
+ memcpy(um, mapping, i * sizeof(unsigned int));
+ unicode_data[unichar].utf32nfkdi = um;
+
+ assert(!unicode_data[unichar].utf32nfkdicf);
+ um = malloc(i * sizeof(unsigned int));
+ memcpy(um, mapping, i * sizeof(unsigned int));
+ unicode_data[unichar].utf32nfkdicf = um;
+
+ if (verbose > 1)
+ print_utf32nfkdi(unichar);
+
+ count++;
+ }
+ if (verbose > 0)
+ printf("Created %d entries\n", count);
+}
+
+static void nfkdi_decompose(void)
+{
+ unsigned int unichar;
+ unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+ unsigned int *um;
+ unsigned int *dc;
+ int count;
+ int i;
+ int j;
+ int ret;
+
+ if (verbose > 0)
+ printf("Decomposing nfkdi\n");
+
+ count = 0;
+ for (unichar = 0; unichar != 0x110000; unichar++) {
+ if (!unicode_data[unichar].utf32nfkdi)
+ continue;
+ for (;;) {
+ ret = 1;
+ i = 0;
+ um = unicode_data[unichar].utf32nfkdi;
+ while (*um) {
+ dc = unicode_data[*um].utf32nfkdi;
+ if (dc) {
+ for (j = 0; dc[j]; j++)
+ mapping[i++] = dc[j];
+ ret = 0;
+ } else {
+ mapping[i++] = *um;
+ }
+ um++;
+ }
+ mapping[i++] = 0;
+ if (ret)
+ break;
+ free(unicode_data[unichar].utf32nfkdi);
+ um = malloc(i * sizeof(unsigned int));
+ memcpy(um, mapping, i * sizeof(unsigned int));
+ unicode_data[unichar].utf32nfkdi = um;
+ }
+ /* Add this decomposition to nfkdicf if there is no entry. */
+ if (!unicode_data[unichar].utf32nfkdicf) {
+ um = malloc(i * sizeof(unsigned int));
+ memcpy(um, mapping, i * sizeof(unsigned int));
+ unicode_data[unichar].utf32nfkdicf = um;
+ }
+ if (verbose > 1)
+ print_utf32nfkdi(unichar);
+ count++;
+ }
+ if (verbose > 0)
+ printf("Processed %d entries\n", count);
+}
+
+static void nfkdicf_decompose(void)
+{
+ unsigned int unichar;
+ unsigned int mapping[19]; /* Magic - guaranteed not to be exceeded. */
+ unsigned int *um;
+ unsigned int *dc;
+ int count;
+ int i;
+ int j;
+ int ret;
+
+ if (verbose > 0)
+ printf("Decomposing nfkdicf\n");
+ count = 0;
+ for (unichar = 0; unichar != 0x110000; unichar++) {
+ if (!unicode_data[unichar].utf32nfkdicf)
+ continue;
+ for (;;) {
+ ret = 1;
+ i = 0;
+ um = unicode_data[unichar].utf32nfkdicf;
+ while (*um) {
+ dc = unicode_data[*um].utf32nfkdicf;
+ if (dc) {
+ for (j = 0; dc[j]; j++)
+ mapping[i++] = dc[j];
+ ret = 0;
+ } else {
+ mapping[i++] = *um;
+ }
+ um++;
+ }
+ mapping[i++] = 0;
+ if (ret)
+ break;
+ free(unicode_data[unichar].utf32nfkdicf);
+ um = malloc(i * sizeof(unsigned int));
+ memcpy(um, mapping, i * sizeof(unsigned int));
+ unicode_data[unichar].utf32nfkdicf = um;
+ }
+ if (verbose > 1)
+ print_utf32nfkdicf(unichar);
+ count++;
+ }
+ if (verbose > 0)
+ printf("Processed %d entries\n", count);
+}
+
+/* ------------------------------------------------------------------ */
+
+int utf8agemax(struct tree *, const char *);
+int utf8nagemax(struct tree *, const char *, size_t);
+int utf8agemin(struct tree *, const char *);
+int utf8nagemin(struct tree *, const char *, size_t);
+ssize_t utf8len(struct tree *, const char *);
+ssize_t utf8nlen(struct tree *, const char *, size_t);
+struct utf8cursor;
+int utf8cursor(struct utf8cursor *, struct tree *, const char *);
+int utf8ncursor(struct utf8cursor *, struct tree *, const char *, size_t);
+int utf8byte(struct utf8cursor *);
+
+/*
+ * Use trie to scan s, touching at most len bytes.
+ * Returns the leaf if one exists, NULL otherwise.
+ *
+ * A non-NULL return guarantees that the UTF-8 sequence starting at s
+ * is well-formed and corresponds to a known unicode code point. The
+ * shorthand for this will be "is valid UTF-8 unicode".
+ */
+static utf8leaf_t *utf8nlookup(struct tree *tree, const char *s, size_t len)
+{
+ utf8trie_t *trie = utf8data + tree->index;
+ int offlen;
+ int offset;
+ int mask;
+ int node;
+
+ if (!tree)
+ return NULL;
+ if (len == 0)
+ return NULL;
+ node = 1;
+ while (node) {
+ offlen = (*trie & OFFLEN) >> OFFLEN_SHIFT;
+ if (*trie & NEXTBYTE) {
+ if (--len == 0)
+ return NULL;
+ s++;
+ }
+ mask = 1 << (*trie & BITNUM);
+ if (*s & mask) {
+ /* Right leg */
+ if (offlen) {
+ /* Right node at offset of trie */
+ node = (*trie & RIGHTNODE);
+ offset = trie[offlen];
+ while (--offlen) {
+ offset <<= 8;
+ offset |= trie[offlen];
+ }
+ trie += offset;
+ } else if (*trie & RIGHTPATH) {
+ /* Right node after this node */
+ node = (*trie & TRIENODE);
+ trie++;
+ } else {
+ /* No right node. */
+ return NULL;
+ }
+ } else {
+ /* Left leg */
+ if (offlen) {
+ /* Left node after this node. */
+ node = (*trie & LEFTNODE);
+ trie += offlen + 1;
+ } else if (*trie & RIGHTPATH) {
+ /* No left node. */
+ return NULL;
+ } else {
+ /* Left node after this node */
+ node = (*trie & TRIENODE);
+ trie++;
+ }
+ }
+ }
+ return trie;
+}
+
+/*
+ * Use trie to scan s.
+ * Returns the leaf if one exists, NULL otherwise.
+ *
+ * Forwards to trie_nlookup().
+ */
+static utf8leaf_t *utf8lookup(struct tree *tree, const char *s)
+{
+ return utf8nlookup(tree, s, (size_t)-1);
+}
+
+/*
+ * Return the number of bytes used by the current UTF-8 sequence.
+ * Assumes the input points to the first byte of a valid UTF-8
+ * sequence.
+ */
+static inline int utf8clen(const char *s)
+{
+ unsigned char c = *s;
+ return 1 + (c >= 0xC0) + (c >= 0xE0) + (c >= 0xF0);
+}
+
+/*
+ * Maximum age of any character in s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ * Return 0 if only non-assigned code points are used.
+ */
+int utf8agemax(struct tree *tree, const char *s)
+{
+ utf8leaf_t *leaf;
+ int age = 0;
+ int leaf_age;
+
+ if (!tree)
+ return -1;
+ while (*s) {
+ if (!(leaf = utf8lookup(tree, s)))
+ return -1;
+ leaf_age = ages[LEAF_GEN(leaf)];
+ if (leaf_age <= tree->maxage && leaf_age > age)
+ age = leaf_age;
+ s += utf8clen(s);
+ }
+ return age;
+}
+
+/*
+ * Minimum age of any character in s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ * Return 0 if non-assigned code points are used.
+ */
+int utf8agemin(struct tree *tree, const char *s)
+{
+ utf8leaf_t *leaf;
+ int age;
+ int leaf_age;
+
+ if (!tree)
+ return -1;
+ age = tree->maxage;
+ while (*s) {
+ if (!(leaf = utf8lookup(tree, s)))
+ return -1;
+ leaf_age = ages[LEAF_GEN(leaf)];
+ if (leaf_age <= tree->maxage && leaf_age < age)
+ age = leaf_age;
+ s += utf8clen(s);
+ }
+ return age;
+}
+
+/*
+ * Maximum age of any character in s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+int utf8nagemax(struct tree *tree, const char *s, size_t len)
+{
+ utf8leaf_t *leaf;
+ int age = 0;
+ int leaf_age;
+
+ if (!tree)
+ return -1;
+ while (len && *s) {
+ if (!(leaf = utf8nlookup(tree, s, len)))
+ return -1;
+ leaf_age = ages[LEAF_GEN(leaf)];
+ if (leaf_age <= tree->maxage && leaf_age > age)
+ age = leaf_age;
+ len -= utf8clen(s);
+ s += utf8clen(s);
+ }
+ return age;
+}
+
+/*
+ * Maximum age of any character in s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+int utf8nagemin(struct tree *tree, const char *s, size_t len)
+{
+ utf8leaf_t *leaf;
+ int leaf_age;
+ int age;
+
+ if (!tree)
+ return -1;
+ age = tree->maxage;
+ while (len && *s) {
+ if (!(leaf = utf8nlookup(tree, s, len)))
+ return -1;
+ leaf_age = ages[LEAF_GEN(leaf)];
+ if (leaf_age <= tree->maxage && leaf_age < age)
+ age = leaf_age;
+ len -= utf8clen(s);
+ s += utf8clen(s);
+ }
+ return age;
+}
+
+/*
+ * Length of the normalization of s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ *
+ * A string of Default_Ignorable_Code_Point has length 0.
+ */
+ssize_t utf8len(struct tree *tree, const char *s)
+{
+ utf8leaf_t *leaf;
+ size_t ret = 0;
+
+ if (!tree)
+ return -1;
+ while (*s) {
+ if (!(leaf = utf8lookup(tree, s)))
+ return -1;
+ if (ages[LEAF_GEN(leaf)] > tree->maxage)
+ ret += utf8clen(s);
+ else if (LEAF_CCC(leaf) == DECOMPOSE)
+ ret += strlen(LEAF_STR(leaf));
+ else
+ ret += utf8clen(s);
+ s += utf8clen(s);
+ }
+ return ret;
+}
+
+/*
+ * Length of the normalization of s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+ssize_t utf8nlen(struct tree *tree, const char *s, size_t len)
+{
+ utf8leaf_t *leaf;
+ size_t ret = 0;
+
+ if (!tree)
+ return -1;
+ while (len && *s) {
+ if (!(leaf = utf8nlookup(tree, s, len)))
+ return -1;
+ if (ages[LEAF_GEN(leaf)] > tree->maxage)
+ ret += utf8clen(s);
+ else if (LEAF_CCC(leaf) == DECOMPOSE)
+ ret += strlen(LEAF_STR(leaf));
+ else
+ ret += utf8clen(s);
+ len -= utf8clen(s);
+ s += utf8clen(s);
+ }
+ return ret;
+}
+
+/*
+ * Cursor structure used by the normalizer.
+ */
+struct utf8cursor {
+ struct tree *tree;
+ const char *s;
+ const char *p;
+ const char *ss;
+ const char *sp;
+ unsigned int len;
+ unsigned int slen;
+ short int ccc;
+ short int nccc;
+ unsigned int unichar;
+};
+
+/*
+ * Set up an utf8cursor for use by utf8byte().
+ *
+ * s : string.
+ * len : length of s.
+ * u8c : pointer to cursor.
+ * trie : utf8trie_t to use for normalization.
+ *
+ * Returns -1 on error, 0 on success.
+ */
+int utf8ncursor(struct utf8cursor *u8c, struct tree *tree, const char *s,
+ size_t len)
+{
+ if (!tree)
+ return -1;
+ if (!s)
+ return -1;
+ u8c->tree = tree;
+ u8c->s = s;
+ u8c->p = NULL;
+ u8c->ss = NULL;
+ u8c->sp = NULL;
+ u8c->len = len;
+ u8c->slen = 0;
+ u8c->ccc = STOPPER;
+ u8c->nccc = STOPPER;
+ u8c->unichar = 0;
+ /* Check we didn't clobber the maximum length. */
+ if (u8c->len != len)
+ return -1;
+ /* The first byte of s may not be an utf8 continuation. */
+ if (len > 0 && (*s & 0xC0) == 0x80)
+ return -1;
+ return 0;
+}
+
+/*
+ * Set up an utf8cursor for use by utf8byte().
+ *
+ * s : NUL-terminated string.
+ * u8c : pointer to cursor.
+ * trie : utf8trie_t to use for normalization.
+ *
+ * Returns -1 on error, 0 on success.
+ */
+int utf8cursor(struct utf8cursor *u8c, struct tree *tree, const char *s)
+{
+ return utf8ncursor(u8c, tree, s, (unsigned int)-1);
+}
+
+/*
+ * Get one byte from the normalized form of the string described by u8c.
+ *
+ * Returns the byte cast to an unsigned char on succes, and -1 on failure.
+ *
+ * The cursor keeps track of the location in the string in u8c->s.
+ * When a character is decomposed, the current location is stored in
+ * u8c->p, and u8c->s is set to the start of the decomposition. Note
+ * that bytes from a decomposition do not count against u8c->len.
+ *
+ * Characters are emitted if they match the current CCC in u8c->ccc.
+ * Hitting end-of-string while u8c->ccc == STOPPER means we're done,
+ * and the function returns 0 in that case.
+ *
+ * Sorting by CCC is done by repeatedly scanning the string. The
+ * values of u8c->s and u8c->p are stored in u8c->ss and u8c->sp at
+ * the start of the scan. The first pass finds the lowest CCC to be
+ * emitted and stores it in u8c->nccc, the second pass emits the
+ * characters with this CCC and finds the next lowest CCC. This limits
+ * the number of passes to 1 + the number of different CCCs in the
+ * sequence being scanned.
+ *
+ * Therefore:
+ * u8c->p != NULL -> a decomposition is being scanned.
+ * u8c->ss != NULL -> this is a repeating scan.
+ * u8c->ccc == -1 -> this is the first scan of a repeating scan.
+ */
+int utf8byte(struct utf8cursor *u8c)
+{
+ utf8leaf_t *leaf;
+ int ccc;
+
+ for (;;) {
+ /* Check for the end of a decomposed character. */
+ if (u8c->p && *u8c->s == '\0') {
+ u8c->s = u8c->p;
+ u8c->p = NULL;
+ }
+
+ /* Check for end-of-string. */
+ if (!u8c->p && (u8c->len == 0 || *u8c->s == '\0')) {
+ /* There is no next byte. */
+ if (u8c->ccc == STOPPER)
+ return 0;
+ /* End-of-string during a scan counts as a stopper. */
+ ccc = STOPPER;
+ goto ccc_mismatch;
+ } else if ((*u8c->s & 0xC0) == 0x80) {
+ /* This is a continuation of the current character. */
+ if (!u8c->p)
+ u8c->len--;
+ return (unsigned char)*u8c->s++;
+ }
+
+ /* Look up the data for the current character. */
+ if (u8c->p)
+ leaf = utf8lookup(u8c->tree, u8c->s);
+ else
+ leaf = utf8nlookup(u8c->tree, u8c->s, u8c->len);
+
+ /* No leaf found implies that the input is a binary blob. */
+ if (!leaf)
+ return -1;
+
+ /* Characters that are too new have CCC 0. */
+ if (ages[LEAF_GEN(leaf)] > u8c->tree->maxage) {
+ ccc = STOPPER;
+ } else if ((ccc = LEAF_CCC(leaf)) == DECOMPOSE) {
+ u8c->len -= utf8clen(u8c->s);
+ u8c->p = u8c->s + utf8clen(u8c->s);
+ u8c->s = LEAF_STR(leaf);
+ /* Empty decomposition implies CCC 0. */
+ if (*u8c->s == '\0') {
+ if (u8c->ccc == STOPPER)
+ continue;
+ ccc = STOPPER;
+ goto ccc_mismatch;
+ }
+ leaf = utf8lookup(u8c->tree, u8c->s);
+ ccc = LEAF_CCC(leaf);
+ }
+ u8c->unichar = utf8decode(u8c->s);
+
+ /*
+ * If this is not a stopper, then see if it updates
+ * the next canonical class to be emitted.
+ */
+ if (ccc != STOPPER && u8c->ccc < ccc && ccc < u8c->nccc)
+ u8c->nccc = ccc;
+
+ /*
+ * Return the current byte if this is the current
+ * combining class.
+ */
+ if (ccc == u8c->ccc) {
+ if (!u8c->p)
+ u8c->len--;
+ return (unsigned char)*u8c->s++;
+ }
+
+ /* Current combining class mismatch. */
+ ccc_mismatch:
+ if (u8c->nccc == STOPPER) {
+ /*
+ * Scan forward for the first canonical class
+ * to be emitted. Save the position from
+ * which to restart.
+ */
+ assert(u8c->ccc == STOPPER);
+ u8c->ccc = MINCCC - 1;
+ u8c->nccc = ccc;
+ u8c->sp = u8c->p;
+ u8c->ss = u8c->s;
+ u8c->slen = u8c->len;
+ if (!u8c->p)
+ u8c->len -= utf8clen(u8c->s);
+ u8c->s += utf8clen(u8c->s);
+ } else if (ccc != STOPPER) {
+ /* Not a stopper, and not the ccc we're emitting. */
+ if (!u8c->p)
+ u8c->len -= utf8clen(u8c->s);
+ u8c->s += utf8clen(u8c->s);
+ } else if (u8c->nccc != MAXCCC + 1) {
+ /* At a stopper, restart for next ccc. */
+ u8c->ccc = u8c->nccc;
+ u8c->nccc = MAXCCC + 1;
+ u8c->s = u8c->ss;
+ u8c->p = u8c->sp;
+ u8c->len = u8c->slen;
+ } else {
+ /* All done, proceed from here. */
+ u8c->ccc = STOPPER;
+ u8c->nccc = STOPPER;
+ u8c->sp = NULL;
+ u8c->ss = NULL;
+ u8c->slen = 0;
+ }
+ }
+}
+
+/* ------------------------------------------------------------------ */
+
+static int normalize_line(struct tree *tree)
+{
+ char *s;
+ char *t;
+ int c;
+ struct utf8cursor u8c;
+
+ /* First test: null-terminated string. */
+ s = buf2;
+ t = buf3;
+ if (utf8cursor(&u8c, tree, s))
+ return -1;
+ while ((c = utf8byte(&u8c)) > 0)
+ if (c != (unsigned char)*t++)
+ return -1;
+ if (c < 0)
+ return -1;
+ if (*t != 0)
+ return -1;
+
+ /* Second test: length-limited string. */
+ s = buf2;
+ /* Replace NUL with a value that will cause an error if seen. */
+ s[strlen(s) + 1] = -1;
+ t = buf3;
+ if (utf8cursor(&u8c, tree, s))
+ return -1;
+ while ((c = utf8byte(&u8c)) > 0)
+ if (c != (unsigned char)*t++)
+ return -1;
+ if (c < 0)
+ return -1;
+ if (*t != 0)
+ return -1;
+
+ return 0;
+}
+
+static void normalization_test(void)
+{
+ FILE *file;
+ unsigned int unichar;
+ struct unicode_data *data;
+ char *s;
+ char *t;
+ int ret;
+ int ignorables;
+ int tests = 0;
+ int failures = 0;
+
+ if (verbose > 0)
+ printf("Parsing %s\n", test_name);
+ /* Step one, read data from file. */
+ file = fopen(test_name, "r");
+ if (!file)
+ open_fail(test_name, errno);
+
+ while (fgets(line, LINESIZE, file)) {
+ ret = sscanf(line, "%[^;];%*[^;];%*[^;];%*[^;];%[^;];",
+ buf0, buf1);
+ if (ret != 2 || *line == '#')
+ continue;
+ s = buf0;
+ t = buf2;
+ while (*s) {
+ unichar = strtoul(s, &s, 16);
+ t += utf8encode(t, unichar);
+ }
+ *t = '\0';
+
+ ignorables = 0;
+ s = buf1;
+ t = buf3;
+ while (*s) {
+ unichar = strtoul(s, &s, 16);
+ data = &unicode_data[unichar];
+ if (data->utf8nfkdi && !*data->utf8nfkdi)
+ ignorables = 1;
+ else
+ t += utf8encode(t, unichar);
+ }
+ *t = '\0';
+
+ tests++;
+ if (normalize_line(nfkdi_tree) < 0) {
+ printf("Line %s -> %s", buf0, buf1);
+ if (ignorables)
+ printf(" (ignorables removed)");
+ printf(" failure\n");
+ failures++;
+ }
+ }
+ fclose(file);
+ if (verbose > 0)
+ printf("Ran %d tests with %d failures\n", tests, failures);
+ if (failures)
+ file_fail(test_name);
+}
+
+/* ------------------------------------------------------------------ */
+
+static void write_file(void)
+{
+ FILE *file;
+ int i;
+ int j;
+ int t;
+ int gen;
+
+ if (verbose > 0)
+ printf("Writing %s\n", utf8_name);
+ file = fopen(utf8_name, "w");
+ if (!file)
+ open_fail(utf8_name, errno);
+
+ fprintf(file, "/* This file is generated code, do not edit. */\n");
+ fprintf(file, "#ifndef __INCLUDED_FROM_UTF8NORM_C__\n");
+ fprintf(file, "#error Only nls_utf8-norm.c should include this file.\n");
+ fprintf(file, "#endif\n");
+ fprintf(file, "\n");
+ fprintf(file, "static const unsigned int utf8vers = %#x;\n",
+ unicode_maxage);
+ fprintf(file, "\n");
+ fprintf(file, "static const unsigned int utf8agetab[] = {\n");
+ for (i = 0; i != ages_count; i++)
+ fprintf(file, "\t%#x%s\n", ages[i],
+ ages[i] == unicode_maxage ? "" : ",");
+ fprintf(file, "};\n");
+ fprintf(file, "\n");
+ fprintf(file, "static const struct utf8data utf8nfkdicfdata[] = {\n");
+ t = 0;
+ for (gen = 0; gen < ages_count; gen++) {
+ fprintf(file, "\t{ %#x, %d }%s\n",
+ ages[gen], trees[t].index,
+ ages[gen] == unicode_maxage ? "" : ",");
+ if (trees[t].maxage == ages[gen])
+ t += 2;
+ }
+ fprintf(file, "};\n");
+ fprintf(file, "\n");
+ fprintf(file, "static const struct utf8data utf8nfkdidata[] = {\n");
+ t = 1;
+ for (gen = 0; gen < ages_count; gen++) {
+ fprintf(file, "\t{ %#x, %d }%s\n",
+ ages[gen], trees[t].index,
+ ages[gen] == unicode_maxage ? "" : ",");
+ if (trees[t].maxage == ages[gen])
+ t += 2;
+ }
+ fprintf(file, "};\n");
+ fprintf(file, "\n");
+ fprintf(file, "static const unsigned char utf8data[%zd] = {\n",
+ utf8data_size);
+ t = 0;
+ for (i = 0; i != utf8data_size; i += 16) {
+ if (i == trees[t].index) {
+ fprintf(file, "\t/* %s_%x */\n",
+ trees[t].type, trees[t].maxage);
+ if (t < trees_count-1)
+ t++;
+ }
+ fprintf(file, "\t");
+ for (j = i; j != i + 16; j++)
+ fprintf(file, "0x%.2x%s", utf8data[j],
+ (j < utf8data_size -1 ? "," : ""));
+ fprintf(file, "\n");
+ }
+ fprintf(file, "};\n");
+ fclose(file);
+}
+
+/* ------------------------------------------------------------------ */
+
+int main(int argc, char *argv[])
+{
+ unsigned int unichar;
+ int opt;
+
+ argv0 = argv[0];
+
+ while ((opt = getopt(argc, argv, "a:c:d:f:hn:o:p:t:v")) != -1) {
+ switch (opt) {
+ case 'a':
+ age_name = optarg;
+ break;
+ case 'c':
+ ccc_name = optarg;
+ break;
+ case 'd':
+ data_name = optarg;
+ break;
+ case 'f':
+ fold_name = optarg;
+ break;
+ case 'n':
+ norm_name = optarg;
+ break;
+ case 'o':
+ utf8_name = optarg;
+ break;
+ case 'p':
+ prop_name = optarg;
+ break;
+ case 't':
+ test_name = optarg;
+ break;
+ case 'v':
+ verbose++;
+ break;
+ case 'h':
+ help();
+ exit(0);
+ default:
+ usage();
+ }
+ }
+
+ if (verbose > 1)
+ help();
+ for (unichar = 0; unichar != 0x110000; unichar++)
+ unicode_data[unichar].code = unichar;
+ age_init();
+ ccc_init();
+ nfkdi_init();
+ nfkdicf_init();
+ ignore_init();
+ corrections_init();
+ hangul_decompose();
+ nfkdi_decompose();
+ nfkdicf_decompose();
+ utf8_init();
+ trees_init();
+ trees_populate();
+ trees_reduce();
+ trees_verify();
+ /* Prevent "unused function" warning. */
+ (void)lookup(nfkdi_tree, " ");
+ if (verbose > 2)
+ tree_walk(nfkdi_tree);
+ if (verbose > 2)
+ tree_walk(nfkdicf_tree);
+ normalization_test();
+ write_file();
+
+ return 0;
+}
--
2.20.0.rc2

2018-12-06 22:05:48

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH v4 18/23] nls: utf8: Introduce test module for normalized utf8 implementation

From: Gabriel Krisman Bertazi <[email protected]>

This implements a in-kernel sanity test module for the utf8
normalization core. At probe time, it will run basic sequences through
the utf8n core, to identify problems will equivalent sequences and
normalization/casefold code. This is supposed to be useful for
regression testing when adding support for a new version of utf8 to
linux.

Changes since RFC v2:
- Merge with NLS

Changes since RFC v1:
- Include comparison tests for matching strings with different lengths.
- Include tests for characters included in unicode 8.0.0, 9.0.0 and 10.0.0.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
fs/nls/Kconfig | 5 +
fs/nls/Makefile | 1 +
fs/nls/nls_utf8-selftest.c | 316 +++++++++++++++++++++++++++++++++++++
3 files changed, 322 insertions(+)
create mode 100644 fs/nls/nls_utf8-selftest.c

diff --git a/fs/nls/Kconfig b/fs/nls/Kconfig
index 7cb2848da608..b9c9b663d0ab 100644
--- a/fs/nls/Kconfig
+++ b/fs/nls/Kconfig
@@ -626,4 +626,9 @@ config NLS_UTF8_NORMALIZATION
Say Y here to enable utf8 NFKD normalization and casefolding
support.

+config NLS_UTF8_NORMALIZATION_SELFTEST
+ tristate "Test UTF-8 normalization support"
+ depends on NLS_UTF8
+ default n
+
endif # NLS
diff --git a/fs/nls/Makefile b/fs/nls/Makefile
index bd13c1a90767..e88006d2f3ac 100644
--- a/fs/nls/Makefile
+++ b/fs/nls/Makefile
@@ -47,6 +47,7 @@ obj-$(CONFIG_NLS_KOI8_U) += nls_koi8-u.o nls_koi8-ru.o
obj-$(CONFIG_NLS_UTF8) += nls_utf8.o
nls_utf8-y += nls_utf8-core.o
nls_utf8-$(CONFIG_NLS_UTF8_NORMALIZATION) += nls_utf8-norm.o
+obj-$(CONFIG_NLS_UTF8_NORMALIZATION_SELFTEST) += nls_utf8-selftest.o

obj-$(CONFIG_NLS_MAC_CELTIC) += mac-celtic.o
obj-$(CONFIG_NLS_MAC_CENTEURO) += mac-centeuro.o
diff --git a/fs/nls/nls_utf8-selftest.c b/fs/nls/nls_utf8-selftest.c
new file mode 100644
index 000000000000..47e4f24a3f44
--- /dev/null
+++ b/fs/nls/nls_utf8-selftest.c
@@ -0,0 +1,316 @@
+/*
+ * Kernel module for testing utf-8 support.
+ *
+ * Copyright 2017 Collabora Ltd.
+ *
+ * This software is licensed under the terms of the GNU General Public
+ * License version 2, as published by the Free Software Foundation, and
+ * may be copied, distributed, and modified under those terms.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/module.h>
+#include <linux/printk.h>
+#include <linux/nls.h>
+
+#include "utf8n.h"
+
+unsigned int failed_tests;
+unsigned int total_tests;
+
+/* Tests will be based on this version. */
+#define latest_maj 11
+#define latest_min 0
+#define latest_rev 0
+
+#define _test(cond, func, line, fmt, ...) do { \
+ total_tests++; \
+ if (!cond) { \
+ failed_tests++; \
+ pr_err("test %s:%d Failed: %s%s", \
+ func, line, #cond, (fmt?":":".")); \
+ if (fmt) \
+ pr_err(fmt, ##__VA_ARGS__); \
+ } \
+ } while (0)
+#define test_f(cond, fmt, ...) _test(cond, __func__, __LINE__, fmt, ##__VA_ARGS__)
+#define test(cond) _test(cond, __func__, __LINE__, "")
+
+const static struct {
+ /* UTF-8 strings in this vector _must_ be NULL-terminated. */
+ unsigned char str[10];
+ unsigned char dec[10];
+} nfkdi_test_data[] = {
+ /* Trivial sequence */
+ {
+ /* "ABba" decomposes to itself */
+ .str = {0x41, 0x42, 0x62, 0x61, 0x00},
+ .dec = {0x41, 0x42, 0x62, 0x61, 0x00}
+ },
+ /* Simple equivalent sequences */
+ {
+ /* 'VULGAR FRACTION ONE QUARTER' decomposes to
+ 'NUMBER 1' + 'FRACTION SLASH' + 'NUMBER 4' */
+ .str = {0xc2, 0xbc, 0x00},
+ .dec = {0x31, 0xe2, 0x81, 0x84, 0x34, 0x00},
+ },
+ {
+ /* 'LATIN SMALL LETTER A WITH DIAERESIS' decomposes to
+ 'LETTER A' + 'COMBINING DIAERESIS' */
+ .str = {0xc3, 0xa4, 0x00},
+ .dec = {0x61, 0xcc, 0x88, 0x00},
+ },
+ {
+ /* 'LATIN SMALL LETTER LJ' decomposes to
+ 'LETTER L' + 'LETTER J' */
+ .str = {0xC7, 0x89, 0x00},
+ .dec = {0x6c, 0x6a, 0x00},
+ },
+ {
+ /* GREEK ANO TELEIA decomposes to MIDDLE DOT */
+ .str = {0xCE, 0x87, 0x00},
+ .dec = {0xC2, 0xB7, 0x00}
+ },
+ /* Canonical ordering */
+ {
+ /* A + 'COMBINING ACUTE ACCENT' + 'COMBINING OGONEK' decomposes
+ to A + 'COMBINING OGONEK' + 'COMBINING ACUTE ACCENT' */
+ .str = {0x41, 0xcc, 0x81, 0xcc, 0xa8, 0x0},
+ .dec = {0x41, 0xcc, 0xa8, 0xcc, 0x81, 0x0},
+ },
+ {
+ /* 'LATIN SMALL LETTER A WITH DIAERESIS' + 'COMBINING OGONEK'
+ decomposes to
+ 'LETTER A' + 'COMBINING OGONEK' + 'COMBINING DIAERESIS' */
+ .str = {0xc3, 0xa4, 0xCC, 0xA8, 0x00},
+
+ .dec = {0x61, 0xCC, 0xA8, 0xcc, 0x88, 0x00},
+ },
+
+};
+
+const static struct {
+ /* UTF-8 strings in this vector _must_ be NULL-terminated. */
+ unsigned char str[30];
+ unsigned char ncf[30];
+} nfkdicf_test_data[] = {
+ /* Trivial sequences */
+ {
+ /* "ABba" folds to lowercase */
+ .str = {0x41, 0x42, 0x62, 0x61, 0x00},
+ .ncf = {0x61, 0x62, 0x62, 0x61, 0x00},
+ },
+ {
+ /* All ASCII folds to lower-case */
+ .str = "ABCDEFGHIJKLMNOPRSTUVWXYZ0.1",
+ .ncf = "abcdefghijklmnoprstuvwxyz0.1",
+ },
+ {
+ /* LATIN SMALL LETTER SHARP S folds to
+ LATIN SMALL LETTER S + LATIN SMALL LETTER S */
+ .str = {0xc3, 0x9f, 0x00},
+ .ncf = {0x73, 0x73, 0x00},
+ },
+ {
+ /* LATIN CAPITAL LETTER A WITH RING ABOVE folds to
+ LATIN SMALL LETTER A + COMBINING RING ABOVE */
+ .str = {0xC3, 0x85, 0x00},
+ .ncf = {0x61, 0xcc, 0x8a, 0x00},
+ },
+ /* Introduced by UTF-8.0.0. */
+ /* Cherokee letters are interesting test-cases because they fold
+ to upper-case. Before 8.0.0, Cherokee lowercase were
+ undefined, thus, the folding from LC is not stable between
+ 7.0.0 -> 8.0.0, but it is from UC. */
+ {
+ /* CHEROKEE SMALL LETTER A folds to CHEROKEE LETTER A */
+ .str = {0xea, 0xad, 0xb0, 0x00},
+ .ncf = {0xe1, 0x8e, 0xa0, 0x00},
+ },
+ {
+ /* CHEROKEE SMALL LETTER YE folds to CHEROKEE LETTER YE */
+ .str = {0xe1, 0x8f, 0xb8, 0x00},
+ .ncf = {0xe1, 0x8f, 0xb0, 0x00},
+ },
+ {
+ /* OLD HUNGARIAN CAPITAL LETTER AMB folds to
+ OLD HUNGARIAN SMALL LETTER AMB */
+ .str = {0xf0, 0x90, 0xb2, 0x83, 0x00},
+ .ncf = {0xf0, 0x90, 0xb3, 0x83, 0x00},
+ },
+ /* Introduced by UTF-9.0.0. */
+ {
+ /* OSAGE CAPITAL LETTER CHA folds to
+ OSAGE SMALL LETTER CHA */
+ .str = {0xf0, 0x90, 0x92, 0xb5, 0x00},
+ .ncf = {0xf0, 0x90, 0x93, 0x9d, 0x00},
+ },
+ {
+ /* LATIN CAPITAL LETTER SMALL CAPITAL I folds to
+ LATIN LETTER SMALL CAPITAL I */
+ .str = {0xea, 0x9e, 0xae, 0x00},
+ .ncf = {0xc9, 0xaa, 0x00},
+ },
+ /* Introduced by UTF-11.0.0. */
+ {
+ /* GEORGIAN SMALL LETTER AN folds to GEORGIAN MTAVRULI
+ CAPITAL LETTER AN */
+ .str = {0xe1, 0xb2, 0x90, 0x00},
+ .ncf = {0xe1, 0x83, 0x90, 0x00},
+ }
+};
+
+static void check_utf8_nfkdi(void)
+{
+ int i;
+ struct utf8cursor u8c;
+ const struct utf8data *data;
+
+ data = utf8nfkdi(UNICODE_AGE(latest_maj, latest_min, latest_rev));
+ if (!data) {
+ pr_err("%s: Unable to load utf8-%d.%d.%d. Skipping.\n",
+ __func__, latest_maj, latest_min, latest_rev);
+ return;
+ }
+
+ for (i = 0; i < ARRAY_SIZE(nfkdi_test_data); i++) {
+ int len = strlen(nfkdi_test_data[i].str);
+ int nlen = strlen(nfkdi_test_data[i].dec);
+ int j = 0;
+ unsigned char c;
+
+ test((utf8len(data, nfkdi_test_data[i].str) == nlen));
+ test((utf8nlen(data, nfkdi_test_data[i].str, len) == nlen));
+
+ if (utf8cursor(&u8c, data, nfkdi_test_data[i].str) < 0)
+ pr_err("can't create cursor\n");
+
+ while ((c = utf8byte(&u8c)) > 0) {
+ test_f((c == nfkdi_test_data[i].dec[j]),
+ "Unexpected byte 0x%x should be 0x%x\n",
+ c, nfkdi_test_data[i].dec[j]);
+ j++;
+ }
+
+ test((j == nlen));
+ }
+}
+
+static void check_utf8_nfkdicf(void)
+{
+ int i;
+ struct utf8cursor u8c;
+ const struct utf8data *data;
+
+ data = utf8nfkdicf(UNICODE_AGE(latest_maj, latest_min, latest_rev));
+ if (!data) {
+ pr_err("%s: Unable to load utf8-%d.%d.%d. Skipping.\n",
+ __func__, latest_maj, latest_min, latest_rev);
+ return;
+ }
+
+ for (i = 0; i < ARRAY_SIZE(nfkdicf_test_data); i++) {
+ int len = strlen(nfkdicf_test_data[i].str);
+ int nlen = strlen(nfkdicf_test_data[i].ncf);
+ int j = 0;
+ unsigned char c;
+
+ test((utf8len(data, nfkdicf_test_data[i].str) == nlen));
+ test((utf8nlen(data, nfkdicf_test_data[i].str, len) == nlen));
+
+ if (utf8cursor(&u8c, data, nfkdicf_test_data[i].str) < 0)
+ pr_err("can't create cursor\n");
+
+ while ((c = utf8byte(&u8c)) > 0) {
+ test_f((c == nfkdicf_test_data[i].ncf[j]),
+ "Unexpected byte 0x%x should be 0x%x\n",
+ c, nfkdicf_test_data[i].ncf[j]);
+ j++;
+ }
+
+ test((j == nlen));
+ }
+}
+
+static void check_utf8_comparisons(void)
+{
+ int i;
+ struct nls_table *table = load_nls_version("utf8", "11.0.0",
+ NLS_UTF8_NORMALIZATION_TYPE_NFKD |
+ NLS_UTF8_CASEFOLD_TYPE_NFKDCF);
+
+ if (IS_ERR(table)) {
+ pr_err("%s: Unable to load utf8 %d.%d.%d. Skipping.\n",
+ __func__, latest_maj, latest_min, latest_rev);
+ return;
+ }
+
+ for (i = 0; i < ARRAY_SIZE(nfkdi_test_data); i++) {
+ const char *s1 = nfkdi_test_data[i].str;
+ const char *s2 = nfkdi_test_data[i].dec;
+
+ test_f(!nls_strncmp(table, s1, strlen(s1), s2, strlen(s2)),
+ "%s %s comparison mismatch\n", s1, s2);
+ }
+ for (i = 0; i < ARRAY_SIZE(nfkdicf_test_data); i++) {
+ const char *s1 = nfkdicf_test_data[i].str;
+ const char *s2 = nfkdicf_test_data[i].ncf;
+
+ test_f(!nls_strncasecmp(table, s1, strlen(s1),
+ s2, strlen(s2)),
+ "%s %s comparison mismatch\n", s1, s2);
+ }
+
+ unload_nls(table);
+}
+
+static void check_supported_versions(void)
+{
+ /* Unicode 7.0.0 should be supported. */
+ test(utf8version_is_supported(7, 0, 0));
+
+ /* Unicode 9.0.0 should be supported. */
+ test(utf8version_is_supported(9, 0, 0));
+
+ /* Unicode 1x.0.0 (the latest version) should be supported. */
+ test(utf8version_is_supported(latest_maj, latest_min, latest_rev));
+
+ /* Next versions don't exist. */
+ test(!utf8version_is_supported(12, 0, 0));
+ test(!utf8version_is_supported(0, 0, 0));
+ test(!utf8version_is_supported(-1, -1, -1));
+}
+
+static int __init init_test_ucd(void)
+{
+ failed_tests = 0;
+ total_tests = 0;
+
+ check_supported_versions();
+ check_utf8_nfkdi();
+ check_utf8_nfkdicf();
+ check_utf8_comparisons();
+
+ if (!failed_tests)
+ pr_info("All %u tests passed\n", total_tests);
+ else
+ pr_err("%u out of %u tests failed\n", failed_tests,
+ total_tests);
+ return 0;
+}
+
+static void __exit exit_test_ucd(void)
+{
+}
+
+module_init(init_test_ucd);
+module_exit(exit_test_ucd);
+
+MODULE_AUTHOR("Gabriel Krisman Bertazi <[email protected]>");
+MODULE_LICENSE("GPL");
--
2.20.0.rc2

2018-12-06 22:06:06

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH v4 23/23] docs: ext4.rst: Document encoding and case-insensitive

From: Gabriel Krisman Bertazi <[email protected]>

Introduces the encoding-awareness and case-insensitive features on ext4
for system administrators. Explain the minimum of design decisions that
are important for sysadmins enabling this feature.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
Documentation/admin-guide/ext4.rst | 29 +++++++++++++++++++++++++++++
1 file changed, 29 insertions(+)

diff --git a/Documentation/admin-guide/ext4.rst b/Documentation/admin-guide/ext4.rst
index e506d3dae510..f42c682acecc 100644
--- a/Documentation/admin-guide/ext4.rst
+++ b/Documentation/admin-guide/ext4.rst
@@ -91,10 +91,39 @@ Currently Available
* large block (up to pagesize) support
* efficient new ordered mode in JBD2 and ext4 (avoid using buffer head to force
the ordering)
+* Encoding aware file names
+* Case insensitive file name lookups

[1] Filesystems with a block size of 1k may see a limit imposed by the
directory hash tree having a maximum depth of two.

+Encoding-aware file names and case-insensitive lookups
+======================================================
+
+Ext4 optionally supports filesystem-wide charset knowledge when handling
+file names, which allows the user to perform file system lookups using
+charset equivalent versions of the same file name, and optionally ensure
+that no invalid names are held by the filesystem. charset encoding
+awareness is also essential for performing case-insensitive lookups,
+because it is what defines the casefold operation.
+
+The case-insensitive file name lookup feature is supported in a smaller
+granularity, on a per-directory basis, allowing the user to mix
+case-insensitive and case-sensitive directories in the same filesystem.
+It is enabled by flipping a file attribute on an empty directory. For
+the reason stated above, the filesystem must have encoding enabled to
+use this feature.
+
+When we change from filenames as opaque byte sequences to seeing them as
+encoded strings we need to address what happens when a program tries to
+create a file with an invalid name. The Natural Language System within
+the kernel leaves the decision of what to do in this case to the
+filesystem, which select its preferred behavior by enabling/disabling
+the strict mode in NLS. When Ext4 encounters one of those strings, it
+falls back to considering the entire string as an opaque byte sequence,
+which still allows the user to operate on that file but the
+case-insensitive and equivalent sequence lookups won't work.
+
Options
=======

--
2.20.0.rc2

2018-12-09 17:41:34

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v4 00/23] Ext4 Encoding and Case-insensitive support

On Sat, Dec 8, 2018 at 9:03 PM Theodore Y. Ts'o <[email protected]> wrote:
>
> Whether or not case-folding is being done is per-directory (it's a
> flag on the directory set by chattr) . What encoding is supported
> (and we only will support two, ASCII and UTF-8) is per-file system. I
> personally believe it's insane to try to encode a large number of
> encodings, like big5, or iso-8859-1, etc. on a per-directory basis.
> Either don't do encodings at all, or use utf-8. Period. I believe
> you made a similar request for git metadata, no? :-)

Absolutely.

But if you only support ascii or utf-8, then why are you messing with
the nls part? That makes no sense.

You can't have it both ways.

Either you have a horrible fundamental design mistake that has
different per-filesystem locales, or you don't.

If you don't, you shouldn't be touching any of the nls code.

Whatever unicode tables you use for case folding shouldn't be in the nls code.

Linus

2018-12-06 22:04:55

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH v4 03/23] nls: Wrap charset hooks in ops structure

From: Gabriel Krisman Bertazi <[email protected]>

This is done in preparation to splitting the nls_table structure, in
order to support multiple versions of the same encoding. By placing all
the operations together, we can allow multiple operations for the same
encoding, depending on the version. For now, there is no behavior
change intended, but this simplify the following patches.

With the exception of the declaration of the structure, this patch was
generated by the following Coccinelle script:

<smpl>

@nlstable@
identifier p;
expression uni2char_fn;
expression char2uni_fn;

@@
static struct nls_table p = {
- .char2uni = char2uni_fn,
- .uni2char = uni2char_fn,
+ .ops = &charset_ops,
};

@createops@
identifier nlstable.p;
expression nlstable.uni2char_fn;
expression nlstable.char2uni_fn;
@@

+static const struct nls_ops charset_ops = {
+ .uni2char = uni2char_fn,
+ .char2uni = char2uni_fn,
+};
+
static struct nls_table p = {};

@@
struct nls_table *c;
@@
(
- c->uni2char
+ c->ops->uni2char
|
- c->char2uni
+ c->ops->char2uni
)

</smpl>

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
fs/nls/mac-celtic.c | 8 ++++++--
fs/nls/mac-centeuro.c | 8 ++++++--
fs/nls/mac-croatian.c | 8 ++++++--
fs/nls/mac-cyrillic.c | 8 ++++++--
fs/nls/mac-gaelic.c | 8 ++++++--
fs/nls/mac-greek.c | 8 ++++++--
fs/nls/mac-iceland.c | 8 ++++++--
fs/nls/mac-inuit.c | 8 ++++++--
fs/nls/mac-roman.c | 8 ++++++--
fs/nls/mac-romanian.c | 8 ++++++--
fs/nls/mac-turkish.c | 8 ++++++--
fs/nls/nls_ascii.c | 8 ++++++--
fs/nls/nls_base.c | 8 ++++++--
fs/nls/nls_cp1250.c | 8 ++++++--
fs/nls/nls_cp1251.c | 8 ++++++--
fs/nls/nls_cp1255.c | 8 ++++++--
fs/nls/nls_cp437.c | 8 ++++++--
fs/nls/nls_cp737.c | 8 ++++++--
fs/nls/nls_cp775.c | 8 ++++++--
fs/nls/nls_cp850.c | 8 ++++++--
fs/nls/nls_cp852.c | 8 ++++++--
fs/nls/nls_cp855.c | 8 ++++++--
fs/nls/nls_cp857.c | 8 ++++++--
fs/nls/nls_cp860.c | 8 ++++++--
fs/nls/nls_cp861.c | 8 ++++++--
fs/nls/nls_cp862.c | 8 ++++++--
fs/nls/nls_cp863.c | 8 ++++++--
fs/nls/nls_cp864.c | 8 ++++++--
fs/nls/nls_cp865.c | 8 ++++++--
fs/nls/nls_cp866.c | 8 ++++++--
fs/nls/nls_cp869.c | 8 ++++++--
fs/nls/nls_cp874.c | 8 ++++++--
fs/nls/nls_cp932.c | 8 ++++++--
fs/nls/nls_cp936.c | 8 ++++++--
fs/nls/nls_cp949.c | 8 ++++++--
fs/nls/nls_cp950.c | 8 ++++++--
fs/nls/nls_euc-jp.c | 8 ++++++--
fs/nls/nls_iso8859-1.c | 8 ++++++--
fs/nls/nls_iso8859-13.c | 8 ++++++--
fs/nls/nls_iso8859-14.c | 8 ++++++--
fs/nls/nls_iso8859-15.c | 8 ++++++--
fs/nls/nls_iso8859-2.c | 8 ++++++--
fs/nls/nls_iso8859-3.c | 8 ++++++--
fs/nls/nls_iso8859-4.c | 8 ++++++--
fs/nls/nls_iso8859-5.c | 8 ++++++--
fs/nls/nls_iso8859-6.c | 8 ++++++--
fs/nls/nls_iso8859-7.c | 8 ++++++--
fs/nls/nls_iso8859-9.c | 8 ++++++--
fs/nls/nls_koi8-r.c | 8 ++++++--
fs/nls/nls_koi8-ru.c | 8 ++++++--
fs/nls/nls_koi8-u.c | 8 ++++++--
fs/nls/nls_utf8.c | 8 ++++++--
fs/udf/unicode.c | 4 ++--
include/linux/nls.h | 16 ++++++++++------
54 files changed, 324 insertions(+), 112 deletions(-)

diff --git a/fs/nls/mac-celtic.c b/fs/nls/mac-celtic.c
index 266c2d7d50bd..1b59b04f26f2 100644
--- a/fs/nls/mac-celtic.c
+++ b/fs/nls/mac-celtic.c
@@ -577,10 +577,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static const struct nls_ops charset_ops = {
+ .uni2char = uni2char,
+ .char2uni = char2uni,
+};
+
static struct nls_table table = {
.charset = "macceltic",
- .uni2char = uni2char,
- .char2uni = char2uni,
+ .ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};
diff --git a/fs/nls/mac-centeuro.c b/fs/nls/mac-centeuro.c
index 9789c6057551..d5b8f38f97b6 100644
--- a/fs/nls/mac-centeuro.c
+++ b/fs/nls/mac-centeuro.c
@@ -507,10 +507,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static const struct nls_ops charset_ops = {
+ .uni2char = uni2char,
+ .char2uni = char2uni,
+};
+
static struct nls_table table = {
.charset = "maccenteuro",
- .uni2char = uni2char,
- .char2uni = char2uni,
+ .ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};
diff --git a/fs/nls/mac-croatian.c b/fs/nls/mac-croatian.c
index bb19e7a07d43..32de6accd526 100644
--- a/fs/nls/mac-croatian.c
+++ b/fs/nls/mac-croatian.c
@@ -577,10 +577,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static const struct nls_ops charset_ops = {
+ .uni2char = uni2char,
+ .char2uni = char2uni,
+};
+
static struct nls_table table = {
.charset = "maccroatian",
- .uni2char = uni2char,
- .char2uni = char2uni,
+ .ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};
diff --git a/fs/nls/mac-cyrillic.c b/fs/nls/mac-cyrillic.c
index 2a7dea36acba..34d5c1c05ff1 100644
--- a/fs/nls/mac-cyrillic.c
+++ b/fs/nls/mac-cyrillic.c
@@ -472,10 +472,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static const struct nls_ops charset_ops = {
+ .uni2char = uni2char,
+ .char2uni = char2uni,
+};
+
static struct nls_table table = {
.charset = "maccyrillic",
- .uni2char = uni2char,
- .char2uni = char2uni,
+ .ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};
diff --git a/fs/nls/mac-gaelic.c b/fs/nls/mac-gaelic.c
index 77b001653588..2aabf5213176 100644
--- a/fs/nls/mac-gaelic.c
+++ b/fs/nls/mac-gaelic.c
@@ -542,10 +542,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static const struct nls_ops charset_ops = {
+ .uni2char = uni2char,
+ .char2uni = char2uni,
+};
+
static struct nls_table table = {
.charset = "macgaelic",
- .uni2char = uni2char,
- .char2uni = char2uni,
+ .ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};
diff --git a/fs/nls/mac-greek.c b/fs/nls/mac-greek.c
index 1eccf499e2eb..df62909ef57e 100644
--- a/fs/nls/mac-greek.c
+++ b/fs/nls/mac-greek.c
@@ -472,10 +472,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static const struct nls_ops charset_ops = {
+ .uni2char = uni2char,
+ .char2uni = char2uni,
+};
+
static struct nls_table table = {
.charset = "macgreek",
- .uni2char = uni2char,
- .char2uni = char2uni,
+ .ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};
diff --git a/fs/nls/mac-iceland.c b/fs/nls/mac-iceland.c
index cbd0875c6d69..8daa68b995bc 100644
--- a/fs/nls/mac-iceland.c
+++ b/fs/nls/mac-iceland.c
@@ -577,10 +577,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static const struct nls_ops charset_ops = {
+ .uni2char = uni2char,
+ .char2uni = char2uni,
+};
+
static struct nls_table table = {
.charset = "maciceland",
- .uni2char = uni2char,
- .char2uni = char2uni,
+ .ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};
diff --git a/fs/nls/mac-inuit.c b/fs/nls/mac-inuit.c
index fba8357aaf03..b0799693502a 100644
--- a/fs/nls/mac-inuit.c
+++ b/fs/nls/mac-inuit.c
@@ -507,10 +507,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static const struct nls_ops charset_ops = {
+ .uni2char = uni2char,
+ .char2uni = char2uni,
+};
+
static struct nls_table table = {
.charset = "macinuit",
- .uni2char = uni2char,
- .char2uni = char2uni,
+ .ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};
diff --git a/fs/nls/mac-roman.c b/fs/nls/mac-roman.c
index b6a98a5208cd..ba358b864b05 100644
--- a/fs/nls/mac-roman.c
+++ b/fs/nls/mac-roman.c
@@ -612,10 +612,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static const struct nls_ops charset_ops = {
+ .uni2char = uni2char,
+ .char2uni = char2uni,
+};
+
static struct nls_table table = {
.charset = "macroman",
- .uni2char = uni2char,
- .char2uni = char2uni,
+ .ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};
diff --git a/fs/nls/mac-romanian.c b/fs/nls/mac-romanian.c
index 25547f023638..7a8a7f9a0bbc 100644
--- a/fs/nls/mac-romanian.c
+++ b/fs/nls/mac-romanian.c
@@ -577,10 +577,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static const struct nls_ops charset_ops = {
+ .uni2char = uni2char,
+ .char2uni = char2uni,
+};
+
static struct nls_table table = {
.charset = "macromanian",
- .uni2char = uni2char,
- .char2uni = char2uni,
+ .ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};
diff --git a/fs/nls/mac-turkish.c b/fs/nls/mac-turkish.c
index b5454bc7b7fa..eb3c5e53ec88 100644
--- a/fs/nls/mac-turkish.c
+++ b/fs/nls/mac-turkish.c
@@ -577,10 +577,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static const struct nls_ops charset_ops = {
+ .uni2char = uni2char,
+ .char2uni = char2uni,
+};
+
static struct nls_table table = {
.charset = "macturkish",
- .uni2char = uni2char,
- .char2uni = char2uni,
+ .ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};
diff --git a/fs/nls/nls_ascii.c b/fs/nls/nls_ascii.c
index a2620650d5e4..6bad3e779284 100644
--- a/fs/nls/nls_ascii.c
+++ b/fs/nls/nls_ascii.c
@@ -142,10 +142,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static const struct nls_ops charset_ops = {
+ .uni2char = uni2char,
+ .char2uni = char2uni,
+};
+
static struct nls_table table = {
.charset = "ascii",
- .uni2char = uni2char,
- .char2uni = char2uni,
+ .ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};
diff --git a/fs/nls/nls_base.c b/fs/nls/nls_base.c
index e5d083b6e2b2..0bb0acf6893f 100644
--- a/fs/nls/nls_base.c
+++ b/fs/nls/nls_base.c
@@ -520,10 +520,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static const struct nls_ops charset_ops = {
+ .uni2char = uni2char,
+ .char2uni = char2uni,
+};
+
static struct nls_table default_table = {
.charset = "default",
- .uni2char = uni2char,
- .char2uni = char2uni,
+ .ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};
diff --git a/fs/nls/nls_cp1250.c b/fs/nls/nls_cp1250.c
index ace3e19d3407..08902e86fc8e 100644
--- a/fs/nls/nls_cp1250.c
+++ b/fs/nls/nls_cp1250.c
@@ -323,10 +323,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static const struct nls_ops charset_ops = {
+ .uni2char = uni2char,
+ .char2uni = char2uni,
+};
+
static struct nls_table table = {
.charset = "cp1250",
- .uni2char = uni2char,
- .char2uni = char2uni,
+ .ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};
diff --git a/fs/nls/nls_cp1251.c b/fs/nls/nls_cp1251.c
index 9273ddfd08a1..2bb88c8cc5bf 100644
--- a/fs/nls/nls_cp1251.c
+++ b/fs/nls/nls_cp1251.c
@@ -277,10 +277,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static const struct nls_ops charset_ops = {
+ .uni2char = uni2char,
+ .char2uni = char2uni,
+};
+
static struct nls_table table = {
.charset = "cp1251",
- .uni2char = uni2char,
- .char2uni = char2uni,
+ .ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};
diff --git a/fs/nls/nls_cp1255.c b/fs/nls/nls_cp1255.c
index 1caf5dfed85b..c6bf8d575c5b 100644
--- a/fs/nls/nls_cp1255.c
+++ b/fs/nls/nls_cp1255.c
@@ -358,11 +358,15 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static const struct nls_ops charset_ops = {
+ .uni2char = uni2char,
+ .char2uni = char2uni,
+};
+
static struct nls_table table = {
.charset = "cp1255",
.alias = "iso8859-8",
- .uni2char = uni2char,
- .char2uni = char2uni,
+ .ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};
diff --git a/fs/nls/nls_cp437.c b/fs/nls/nls_cp437.c
index 7ddb830da3fd..0f3f8bdbb62b 100644
--- a/fs/nls/nls_cp437.c
+++ b/fs/nls/nls_cp437.c
@@ -363,10 +363,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static const struct nls_ops charset_ops = {
+ .uni2char = uni2char,
+ .char2uni = char2uni,
+};
+
static struct nls_table table = {
.charset = "cp437",
- .uni2char = uni2char,
- .char2uni = char2uni,
+ .ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};
diff --git a/fs/nls/nls_cp737.c b/fs/nls/nls_cp737.c
index c593f683a0cd..9383359ca25f 100644
--- a/fs/nls/nls_cp737.c
+++ b/fs/nls/nls_cp737.c
@@ -326,10 +326,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static const struct nls_ops charset_ops = {
+ .uni2char = uni2char,
+ .char2uni = char2uni,
+};
+
static struct nls_table table = {
.charset = "cp737",
- .uni2char = uni2char,
- .char2uni = char2uni,
+ .ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};
diff --git a/fs/nls/nls_cp775.c b/fs/nls/nls_cp775.c
index 554c863745f2..6c787b9079ed 100644
--- a/fs/nls/nls_cp775.c
+++ b/fs/nls/nls_cp775.c
@@ -295,10 +295,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static const struct nls_ops charset_ops = {
+ .uni2char = uni2char,
+ .char2uni = char2uni,
+};
+
static struct nls_table table = {
.charset = "cp775",
- .uni2char = uni2char,
- .char2uni = char2uni,
+ .ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};
diff --git a/fs/nls/nls_cp850.c b/fs/nls/nls_cp850.c
index 56cccd14b40b..50a57138a571 100644
--- a/fs/nls/nls_cp850.c
+++ b/fs/nls/nls_cp850.c
@@ -291,10 +291,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static const struct nls_ops charset_ops = {
+ .uni2char = uni2char,
+ .char2uni = char2uni,
+};
+
static struct nls_table table = {
.charset = "cp850",
- .uni2char = uni2char,
- .char2uni = char2uni,
+ .ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};
diff --git a/fs/nls/nls_cp852.c b/fs/nls/nls_cp852.c
index 7cdc05ac1d40..0cbb199f1cd5 100644
--- a/fs/nls/nls_cp852.c
+++ b/fs/nls/nls_cp852.c
@@ -313,10 +313,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static const struct nls_ops charset_ops = {
+ .uni2char = uni2char,
+ .char2uni = char2uni,
+};
+
static struct nls_table table = {
.charset = "cp852",
- .uni2char = uni2char,
- .char2uni = char2uni,
+ .ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};
diff --git a/fs/nls/nls_cp855.c b/fs/nls/nls_cp855.c
index 7426eea05663..530b77c86363 100644
--- a/fs/nls/nls_cp855.c
+++ b/fs/nls/nls_cp855.c
@@ -275,10 +275,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static const struct nls_ops charset_ops = {
+ .uni2char = uni2char,
+ .char2uni = char2uni,
+};
+
static struct nls_table table = {
.charset = "cp855",
- .uni2char = uni2char,
- .char2uni = char2uni,
+ .ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};
diff --git a/fs/nls/nls_cp857.c b/fs/nls/nls_cp857.c
index 098309733ebd..0db642ec6f45 100644
--- a/fs/nls/nls_cp857.c
+++ b/fs/nls/nls_cp857.c
@@ -277,10 +277,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static const struct nls_ops charset_ops = {
+ .uni2char = uni2char,
+ .char2uni = char2uni,
+};
+
static struct nls_table table = {
.charset = "cp857",
- .uni2char = uni2char,
- .char2uni = char2uni,
+ .ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};
diff --git a/fs/nls/nls_cp860.c b/fs/nls/nls_cp860.c
index 84224478e731..44a40dac26bd 100644
--- a/fs/nls/nls_cp860.c
+++ b/fs/nls/nls_cp860.c
@@ -340,10 +340,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static const struct nls_ops charset_ops = {
+ .uni2char = uni2char,
+ .char2uni = char2uni,
+};
+
static struct nls_table table = {
.charset = "cp860",
- .uni2char = uni2char,
- .char2uni = char2uni,
+ .ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};
diff --git a/fs/nls/nls_cp861.c b/fs/nls/nls_cp861.c
index dc873e4be092..50e08174fc48 100644
--- a/fs/nls/nls_cp861.c
+++ b/fs/nls/nls_cp861.c
@@ -363,10 +363,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static const struct nls_ops charset_ops = {
+ .uni2char = uni2char,
+ .char2uni = char2uni,
+};
+
static struct nls_table table = {
.charset = "cp861",
- .uni2char = uni2char,
- .char2uni = char2uni,
+ .ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};
diff --git a/fs/nls/nls_cp862.c b/fs/nls/nls_cp862.c
index d5263e3c5566..3505f3437972 100644
--- a/fs/nls/nls_cp862.c
+++ b/fs/nls/nls_cp862.c
@@ -397,10 +397,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static const struct nls_ops charset_ops = {
+ .uni2char = uni2char,
+ .char2uni = char2uni,
+};
+
static struct nls_table table = {
.charset = "cp862",
- .uni2char = uni2char,
- .char2uni = char2uni,
+ .ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};
diff --git a/fs/nls/nls_cp863.c b/fs/nls/nls_cp863.c
index 051c9832e36a..e3489cdc0c04 100644
--- a/fs/nls/nls_cp863.c
+++ b/fs/nls/nls_cp863.c
@@ -357,10 +357,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static const struct nls_ops charset_ops = {
+ .uni2char = uni2char,
+ .char2uni = char2uni,
+};
+
static struct nls_table table = {
.charset = "cp863",
- .uni2char = uni2char,
- .char2uni = char2uni,
+ .ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};
diff --git a/fs/nls/nls_cp864.c b/fs/nls/nls_cp864.c
index 97eb1273b2f7..d4185bc7f1bf 100644
--- a/fs/nls/nls_cp864.c
+++ b/fs/nls/nls_cp864.c
@@ -383,10 +383,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static const struct nls_ops charset_ops = {
+ .uni2char = uni2char,
+ .char2uni = char2uni,
+};
+
static struct nls_table table = {
.charset = "cp864",
- .uni2char = uni2char,
- .char2uni = char2uni,
+ .ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};
diff --git a/fs/nls/nls_cp865.c b/fs/nls/nls_cp865.c
index 111214228525..9f468944e577 100644
--- a/fs/nls/nls_cp865.c
+++ b/fs/nls/nls_cp865.c
@@ -363,10 +363,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static const struct nls_ops charset_ops = {
+ .uni2char = uni2char,
+ .char2uni = char2uni,
+};
+
static struct nls_table table = {
.charset = "cp865",
- .uni2char = uni2char,
- .char2uni = char2uni,
+ .ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};
diff --git a/fs/nls/nls_cp866.c b/fs/nls/nls_cp866.c
index ffdcbc3fc38d..ee46fd5a76b1 100644
--- a/fs/nls/nls_cp866.c
+++ b/fs/nls/nls_cp866.c
@@ -281,10 +281,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static const struct nls_ops charset_ops = {
+ .uni2char = uni2char,
+ .char2uni = char2uni,
+};
+
static struct nls_table table = {
.charset = "cp866",
- .uni2char = uni2char,
- .char2uni = char2uni,
+ .ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};
diff --git a/fs/nls/nls_cp869.c b/fs/nls/nls_cp869.c
index 3b5a34589354..da29a4a53e1d 100644
--- a/fs/nls/nls_cp869.c
+++ b/fs/nls/nls_cp869.c
@@ -291,10 +291,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static const struct nls_ops charset_ops = {
+ .uni2char = uni2char,
+ .char2uni = char2uni,
+};
+
static struct nls_table table = {
.charset = "cp869",
- .uni2char = uni2char,
- .char2uni = char2uni,
+ .ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};
diff --git a/fs/nls/nls_cp874.c b/fs/nls/nls_cp874.c
index 8dfaa10710fa..642659b9ed89 100644
--- a/fs/nls/nls_cp874.c
+++ b/fs/nls/nls_cp874.c
@@ -249,11 +249,15 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static const struct nls_ops charset_ops = {
+ .uni2char = uni2char,
+ .char2uni = char2uni,
+};
+
static struct nls_table table = {
.charset = "cp874",
.alias = "tis-620",
- .uni2char = uni2char,
- .char2uni = char2uni,
+ .ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};
diff --git a/fs/nls/nls_cp932.c b/fs/nls/nls_cp932.c
index 67b7398e8483..3e7bdefdca90 100644
--- a/fs/nls/nls_cp932.c
+++ b/fs/nls/nls_cp932.c
@@ -7907,11 +7907,15 @@ static int char2uni(const unsigned char *rawstring, int boundlen,
return -EINVAL;
}

+static const struct nls_ops charset_ops = {
+ .uni2char = uni2char,
+ .char2uni = char2uni,
+};
+
static struct nls_table table = {
.charset = "cp932",
.alias = "sjis",
- .uni2char = uni2char,
- .char2uni = char2uni,
+ .ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};
diff --git a/fs/nls/nls_cp936.c b/fs/nls/nls_cp936.c
index c96546cfec9f..b1fa2918992b 100644
--- a/fs/nls/nls_cp936.c
+++ b/fs/nls/nls_cp936.c
@@ -11085,11 +11085,15 @@ static int char2uni(const unsigned char *rawstring, int boundlen,
return n;
}

+static const struct nls_ops charset_ops = {
+ .uni2char = uni2char,
+ .char2uni = char2uni,
+};
+
static struct nls_table table = {
.charset = "cp936",
.alias = "gb2312",
- .uni2char = uni2char,
- .char2uni = char2uni,
+ .ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};
diff --git a/fs/nls/nls_cp949.c b/fs/nls/nls_cp949.c
index 199171e97aa4..1d334095d86c 100644
--- a/fs/nls/nls_cp949.c
+++ b/fs/nls/nls_cp949.c
@@ -13920,11 +13920,15 @@ static int char2uni(const unsigned char *rawstring, int boundlen,
return n;
}

+static const struct nls_ops charset_ops = {
+ .uni2char = uni2char,
+ .char2uni = char2uni,
+};
+
static struct nls_table table = {
.charset = "cp949",
.alias = "euc-kr",
- .uni2char = uni2char,
- .char2uni = char2uni,
+ .ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};
diff --git a/fs/nls/nls_cp950.c b/fs/nls/nls_cp950.c
index 8e1418708209..d936160a48f9 100644
--- a/fs/nls/nls_cp950.c
+++ b/fs/nls/nls_cp950.c
@@ -9456,11 +9456,15 @@ static int char2uni(const unsigned char *rawstring, int boundlen,
return n;
}

+static const struct nls_ops charset_ops = {
+ .uni2char = uni2char,
+ .char2uni = char2uni,
+};
+
static struct nls_table table = {
.charset = "cp950",
.alias = "big5",
- .uni2char = uni2char,
- .char2uni = char2uni,
+ .ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};
diff --git a/fs/nls/nls_euc-jp.c b/fs/nls/nls_euc-jp.c
index eec257545f04..0af73982738b 100644
--- a/fs/nls/nls_euc-jp.c
+++ b/fs/nls/nls_euc-jp.c
@@ -549,10 +549,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen,
return euc_offset;
}

+static const struct nls_ops charset_ops = {
+ .uni2char = uni2char,
+ .char2uni = char2uni,
+};
+
static struct nls_table table = {
.charset = "euc-jp",
- .uni2char = uni2char,
- .char2uni = char2uni,
+ .ops = &charset_ops,
};

static int __init init_nls_euc_jp(void)
diff --git a/fs/nls/nls_iso8859-1.c b/fs/nls/nls_iso8859-1.c
index 69ac020d43b1..6212b2925fa0 100644
--- a/fs/nls/nls_iso8859-1.c
+++ b/fs/nls/nls_iso8859-1.c
@@ -233,10 +233,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static const struct nls_ops charset_ops = {
+ .uni2char = uni2char,
+ .char2uni = char2uni,
+};
+
static struct nls_table table = {
.charset = "iso8859-1",
- .uni2char = uni2char,
- .char2uni = char2uni,
+ .ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};
diff --git a/fs/nls/nls_iso8859-13.c b/fs/nls/nls_iso8859-13.c
index afb3f8f275f0..8f0a23109207 100644
--- a/fs/nls/nls_iso8859-13.c
+++ b/fs/nls/nls_iso8859-13.c
@@ -261,10 +261,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static const struct nls_ops charset_ops = {
+ .uni2char = uni2char,
+ .char2uni = char2uni,
+};
+
static struct nls_table table = {
.charset = "iso8859-13",
- .uni2char = uni2char,
- .char2uni = char2uni,
+ .ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};
diff --git a/fs/nls/nls_iso8859-14.c b/fs/nls/nls_iso8859-14.c
index 046370f0b6f0..80ab77f37480 100644
--- a/fs/nls/nls_iso8859-14.c
+++ b/fs/nls/nls_iso8859-14.c
@@ -317,10 +317,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static const struct nls_ops charset_ops = {
+ .uni2char = uni2char,
+ .char2uni = char2uni,
+};
+
static struct nls_table table = {
.charset = "iso8859-14",
- .uni2char = uni2char,
- .char2uni = char2uni,
+ .ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};
diff --git a/fs/nls/nls_iso8859-15.c b/fs/nls/nls_iso8859-15.c
index 7e34a841a056..5c02f93e7b20 100644
--- a/fs/nls/nls_iso8859-15.c
+++ b/fs/nls/nls_iso8859-15.c
@@ -283,10 +283,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static const struct nls_ops charset_ops = {
+ .uni2char = uni2char,
+ .char2uni = char2uni,
+};
+
static struct nls_table table = {
.charset = "iso8859-15",
- .uni2char = uni2char,
- .char2uni = char2uni,
+ .ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};
diff --git a/fs/nls/nls_iso8859-2.c b/fs/nls/nls_iso8859-2.c
index 7dd571181741..97afc1233da1 100644
--- a/fs/nls/nls_iso8859-2.c
+++ b/fs/nls/nls_iso8859-2.c
@@ -284,10 +284,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static const struct nls_ops charset_ops = {
+ .uni2char = uni2char,
+ .char2uni = char2uni,
+};
+
static struct nls_table table = {
.charset = "iso8859-2",
- .uni2char = uni2char,
- .char2uni = char2uni,
+ .ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};
diff --git a/fs/nls/nls_iso8859-3.c b/fs/nls/nls_iso8859-3.c
index 740b75ec4493..f835fcec3aae 100644
--- a/fs/nls/nls_iso8859-3.c
+++ b/fs/nls/nls_iso8859-3.c
@@ -284,10 +284,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static const struct nls_ops charset_ops = {
+ .uni2char = uni2char,
+ .char2uni = char2uni,
+};
+
static struct nls_table table = {
.charset = "iso8859-3",
- .uni2char = uni2char,
- .char2uni = char2uni,
+ .ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};
diff --git a/fs/nls/nls_iso8859-4.c b/fs/nls/nls_iso8859-4.c
index 8826021e32f5..14acb68fb013 100644
--- a/fs/nls/nls_iso8859-4.c
+++ b/fs/nls/nls_iso8859-4.c
@@ -284,10 +284,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static const struct nls_ops charset_ops = {
+ .uni2char = uni2char,
+ .char2uni = char2uni,
+};
+
static struct nls_table table = {
.charset = "iso8859-4",
- .uni2char = uni2char,
- .char2uni = char2uni,
+ .ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};
diff --git a/fs/nls/nls_iso8859-5.c b/fs/nls/nls_iso8859-5.c
index 7c04057a1ad8..f559bbb25045 100644
--- a/fs/nls/nls_iso8859-5.c
+++ b/fs/nls/nls_iso8859-5.c
@@ -248,10 +248,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static const struct nls_ops charset_ops = {
+ .uni2char = uni2char,
+ .char2uni = char2uni,
+};
+
static struct nls_table table = {
.charset = "iso8859-5",
- .uni2char = uni2char,
- .char2uni = char2uni,
+ .ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};
diff --git a/fs/nls/nls_iso8859-6.c b/fs/nls/nls_iso8859-6.c
index d4a881400d74..e3d7e28363b8 100644
--- a/fs/nls/nls_iso8859-6.c
+++ b/fs/nls/nls_iso8859-6.c
@@ -239,10 +239,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static const struct nls_ops charset_ops = {
+ .uni2char = uni2char,
+ .char2uni = char2uni,
+};
+
static struct nls_table table = {
.charset = "iso8859-6",
- .uni2char = uni2char,
- .char2uni = char2uni,
+ .ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};
diff --git a/fs/nls/nls_iso8859-7.c b/fs/nls/nls_iso8859-7.c
index 37b75d825a75..49fd2b24e492 100644
--- a/fs/nls/nls_iso8859-7.c
+++ b/fs/nls/nls_iso8859-7.c
@@ -293,10 +293,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static const struct nls_ops charset_ops = {
+ .uni2char = uni2char,
+ .char2uni = char2uni,
+};
+
static struct nls_table table = {
.charset = "iso8859-7",
- .uni2char = uni2char,
- .char2uni = char2uni,
+ .ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};
diff --git a/fs/nls/nls_iso8859-9.c b/fs/nls/nls_iso8859-9.c
index 557b98250d37..876696f89626 100644
--- a/fs/nls/nls_iso8859-9.c
+++ b/fs/nls/nls_iso8859-9.c
@@ -248,10 +248,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static const struct nls_ops charset_ops = {
+ .uni2char = uni2char,
+ .char2uni = char2uni,
+};
+
static struct nls_table table = {
.charset = "iso8859-9",
- .uni2char = uni2char,
- .char2uni = char2uni,
+ .ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};
diff --git a/fs/nls/nls_koi8-r.c b/fs/nls/nls_koi8-r.c
index 811f232fccfb..6a85211402a8 100644
--- a/fs/nls/nls_koi8-r.c
+++ b/fs/nls/nls_koi8-r.c
@@ -299,10 +299,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static const struct nls_ops charset_ops = {
+ .uni2char = uni2char,
+ .char2uni = char2uni,
+};
+
static struct nls_table table = {
.charset = "koi8-r",
- .uni2char = uni2char,
- .char2uni = char2uni,
+ .ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};
diff --git a/fs/nls/nls_koi8-ru.c b/fs/nls/nls_koi8-ru.c
index 32781252110d..c4e382fd0f13 100644
--- a/fs/nls/nls_koi8-ru.c
+++ b/fs/nls/nls_koi8-ru.c
@@ -51,10 +51,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen,
return n;
}

+static const struct nls_ops charset_ops = {
+ .uni2char = uni2char,
+ .char2uni = char2uni,
+};
+
static struct nls_table table = {
.charset = "koi8-ru",
- .uni2char = uni2char,
- .char2uni = char2uni,
+ .ops = &charset_ops,
};

static int __init init_nls_koi8_ru(void)
diff --git a/fs/nls/nls_koi8-u.c b/fs/nls/nls_koi8-u.c
index 7e029e4c188a..5f91e9cdb165 100644
--- a/fs/nls/nls_koi8-u.c
+++ b/fs/nls/nls_koi8-u.c
@@ -306,10 +306,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return 1;
}

+static const struct nls_ops charset_ops = {
+ .uni2char = uni2char,
+ .char2uni = char2uni,
+};
+
static struct nls_table table = {
.charset = "koi8-u",
- .uni2char = uni2char,
- .char2uni = char2uni,
+ .ops = &charset_ops,
.charset2lower = charset2lower,
.charset2upper = charset2upper,
};
diff --git a/fs/nls/nls_utf8.c b/fs/nls/nls_utf8.c
index afcfbc4a14db..6988fffd5cf6 100644
--- a/fs/nls/nls_utf8.c
+++ b/fs/nls/nls_utf8.c
@@ -40,10 +40,14 @@ static int char2uni(const unsigned char *rawstring, int boundlen, wchar_t *uni)
return n;
}

+static const struct nls_ops charset_ops = {
+ .uni2char = uni2char,
+ .char2uni = char2uni,
+};
+
static struct nls_table table = {
.charset = "utf8",
- .uni2char = uni2char,
- .char2uni = char2uni,
+ .ops = &charset_ops,
.charset2lower = identity, /* no conversion */
.charset2upper = identity,
};
diff --git a/fs/udf/unicode.c b/fs/udf/unicode.c
index 45234791fec2..f1a9625ade43 100644
--- a/fs/udf/unicode.c
+++ b/fs/udf/unicode.c
@@ -178,7 +178,7 @@ static int udf_name_from_CS0(struct super_block *sb,
}

if (UDF_QUERY_FLAG(sb, UDF_FLAG_NLS_MAP))
- conv_f = UDF_SB(sb)->s_nls_map->uni2char;
+ conv_f = UDF_SB(sb)->s_nls_map->ops->uni2char;
else
conv_f = NULL;

@@ -286,7 +286,7 @@ static int udf_name_to_CS0(struct super_block *sb,
return 0;

if (UDF_QUERY_FLAG(sb, UDF_FLAG_NLS_MAP))
- conv_f = UDF_SB(sb)->s_nls_map->char2uni;
+ conv_f = UDF_SB(sb)->s_nls_map->ops->char2uni;
else
conv_f = NULL;

diff --git a/include/linux/nls.h b/include/linux/nls.h
index cacbcd7d63e6..5d63fe6aa55e 100644
--- a/include/linux/nls.h
+++ b/include/linux/nls.h
@@ -22,12 +22,16 @@ typedef u16 wchar_t;
/* Arbitrary Unicode character */
typedef u32 unicode_t;

-struct nls_table {
- const char *charset;
- const char *alias;
+struct nls_ops {
int (*uni2char) (wchar_t uni, unsigned char *out, int boundlen);
int (*char2uni) (const unsigned char *rawstring, int boundlen,
wchar_t *uni);
+};
+
+struct nls_table {
+ const char *charset;
+ const char *alias;
+ const struct nls_ops *ops;
const unsigned char *charset2lower;
const unsigned char *charset2upper;
struct module *owner;
@@ -62,14 +66,14 @@ extern int utf16s_to_utf8s(const wchar_t *pwcs, int len,
static inline int nls_uni2char(const struct nls_table *table, wchar_t uni,
unsigned char *out, int boundlen)
{
- return table->uni2char(uni, out, boundlen);
+ return table->ops->uni2char(uni, out, boundlen);
}

static inline int nls_char2uni(const struct nls_table *table,
const unsigned char *rawstring,
int boundlen, wchar_t *uni)
{
- return table->char2uni(rawstring, boundlen, uni);
+ return table->ops->char2uni(rawstring, boundlen, uni);
}

static inline const char *nls_charset_name(const struct nls_table *table)
@@ -116,7 +120,7 @@ nls_nullsize(const struct nls_table *codepage)
int charlen;
char tmp[NLS_MAX_CHARSET_SIZE];

- charlen = codepage->uni2char(0, tmp, NLS_MAX_CHARSET_SIZE);
+ charlen = codepage->ops->uni2char(0, tmp, NLS_MAX_CHARSET_SIZE);

return charlen > 0 ? charlen : 1;
}
--
2.20.0.rc2

2018-12-09 20:53:33

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: Re: [PATCH v4 00/23] Ext4 Encoding and Case-insensitive support

Linus Torvalds <[email protected]> writes:

> On Sat, Dec 8, 2018 at 9:03 PM Theodore Y. Ts'o <[email protected]> wrote:

> Either you have a horrible fundamental design mistake that has
> different per-filesystem locales, or you don't.
>
> If you don't, you shouldn't be touching any of the nls code.
>
> Whatever unicode tables you use for case folding shouldn't be in the nls code.

Hi Linus,

As Ted mentioned the SMB case, in my understanding, we might have more
users for in-kernel ut8 normalization/casefold comparison functions than
just ext4 in the future. Steve French (in cc.), for instance, mentioned
his interest in using this higher level NLS API when I first submitted
these patches.

My first RFC actually included this code as a separated module inside
lib/ instead of touching NLS, but I found myself rewriting much of the
same APIs that already existed in NLS. That is why I merged my work
with that subsystem. I am open to rethinking it, if there is a better
alternative.

Thanks,

--
Gabriel Krisman Bertazi

2018-12-06 22:05:33

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH v4 14/23] nls: utf8: Move nls-utf8{,-core}.c

From: Gabriel Krisman Bertazi <[email protected]>

nls_utf8 will be generated from multiple files, so lets move the
existing code to a -core suffix.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
fs/nls/Makefile | 3 +++
fs/nls/{nls_utf8.c => nls_utf8-core.c} | 0
2 files changed, 3 insertions(+)
rename fs/nls/{nls_utf8.c => nls_utf8-core.c} (100%)

diff --git a/fs/nls/Makefile b/fs/nls/Makefile
index 840e06aefd47..c94221b6108d 100644
--- a/fs/nls/Makefile
+++ b/fs/nls/Makefile
@@ -43,7 +43,10 @@ obj-$(CONFIG_NLS_ISO8859_14) += nls_iso8859-14.o
obj-$(CONFIG_NLS_ISO8859_15) += nls_iso8859-15.o
obj-$(CONFIG_NLS_KOI8_R) += nls_koi8-r.o
obj-$(CONFIG_NLS_KOI8_U) += nls_koi8-u.o nls_koi8-ru.o
+
obj-$(CONFIG_NLS_UTF8) += nls_utf8.o
+nls_utf8-y += nls_utf8-core.o
+
obj-$(CONFIG_NLS_MAC_CELTIC) += mac-celtic.o
obj-$(CONFIG_NLS_MAC_CENTEURO) += mac-centeuro.o
obj-$(CONFIG_NLS_MAC_CROATIAN) += mac-croatian.o
diff --git a/fs/nls/nls_utf8.c b/fs/nls/nls_utf8-core.c
similarity index 100%
rename from fs/nls/nls_utf8.c
rename to fs/nls/nls_utf8-core.c
--
2.20.0.rc2

2018-12-06 22:06:00

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH v4 21/23] ext4: Support encoding-aware file name lookups

From: Gabriel Krisman Bertazi <[email protected]>

This patch implements the actual support for encoding-aware file name
lookups in ext4, based on the feature bit and the encoding stored in the
superblock.

A filesystem that has the encoding feature set is able to find files
even if the name used by userspace is not exactly the same, but if it is
an equivalent string. This operation will be called and inexact-match
name search.

Ext4 only stores the first equivalent name dentry used in the
dcache. This is done to prevent unintentional duplication of dentries in
the dcache, while also allowing the VFS code to quickly find the right
entry in the cache despite what equivalent string was used without
resorting to ->lookup().

d_hash() is implemented as the hash of the normalized string, such that
we always have a well-known bucket for all the equivalencies of the same
string. d_compare uses the nls_strncmp() infrastructure, which should
handle the comparison of equivalent names as well. If the filesystem's
normalization type is PLAIN, though, we can just reuse the VFS hash.

For now, negative lookups are not inserted in the dcache, since they
would need to be invalidated anyway, because we can't trust missing file
dentries. This is bad for performance but requires some leveraging of
the vfs layer to fix. We can live without that for now, and so does
everyone else.

DX is supported by modifying the hashes to make them encoding-aware.
The new disk hashes are also calculated as the hash of the normalized
string, instead of the string directly. This allows us to efficiently
search for file names in the htree without requiring the user to provide
the exact name.

Changes since v2:
- Don't use d_add_ci.
- Squash the dcache hooks into this patch.
- Rename sbi->encoding -> sbi->s_encoding.

Changes since v1:
- Support normalized htree hashes.
- Guard code with CONFIG_NLS.
- Use qstr->len instead of strlen in dcache hookups.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
fs/ext4/dir.c | 45 ++++++++++++++++++++++++++++
fs/ext4/ext4.h | 12 ++++++--
fs/ext4/hash.c | 34 ++++++++++++++++++++-
fs/ext4/ialloc.c | 2 +-
fs/ext4/inline.c | 2 +-
fs/ext4/namei.c | 78 +++++++++++++++++++++++++++++++++++++++++-------
fs/ext4/super.c | 6 ++++
7 files changed, 163 insertions(+), 16 deletions(-)

diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index f93f9881ec18..efb75c204551 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -26,6 +26,7 @@
#include <linux/buffer_head.h>
#include <linux/slab.h>
#include <linux/iversion.h>
+#include <linux/nls.h>
#include "ext4.h"
#include "xattr.h"

@@ -662,3 +663,47 @@ const struct file_operations ext4_dir_operations = {
.open = ext4_dir_open,
.release = ext4_release_dir,
};
+
+#ifdef CONFIG_NLS
+static int ext4_d_compare(const struct dentry *dentry, unsigned int len,
+ const char *str, const struct qstr *name)
+{
+ struct nls_table *charset = EXT4_SB(dentry->d_sb)->s_encoding;
+
+ return nls_strncmp(charset, str, len, name->name, name->len);
+}
+
+static int ext4_d_hash(const struct dentry *dentry, struct qstr *q)
+{
+ const struct nls_table *charset = EXT4_SB(dentry->d_sb)->s_encoding;
+ unsigned char *norm;
+ int len, ret = 0;
+
+ /* If normalization is TYPE_PLAIN, we can just reuse the vfs
+ * hash. */
+ if (IS_NORMALIZATION_TYPE_ALL_PLAIN(charset))
+ return 0;
+
+ norm = kmalloc(PATH_MAX, GFP_ATOMIC);
+ if (!norm)
+ return -ENOMEM;
+
+ len = nls_normalize(charset, q->name, q->len, norm, PATH_MAX);
+
+ if (len < 0) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ q->hash = full_name_hash(dentry, norm, len);
+
+out:
+ kfree (norm);
+ return ret;
+}
+
+const struct dentry_operations ext4_dentry_ops = {
+ .d_hash = ext4_d_hash,
+ .d_compare = ext4_d_compare,
+};
+#endif
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index c21717a19106..e84a6605a19a 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1341,6 +1341,11 @@ struct ext4_super_block {
#define EXT4_ENC_ASCII 0
#define EXT4_ENC_UTF8_11_0 1

+/*
+ * Flags for ext4_sb_info.s_encoding_flags. Be careful when modifying
+ * these, as they must match their NLS counterpart. */
+#define EXT4_ENC_STRICT_MODE_FL (1 << 0)
+
/*
* fourth extended-fs super-block data in memory
*/
@@ -2387,8 +2392,8 @@ extern int ext4_check_all_de(struct inode *dir, struct buffer_head *bh,
extern int ext4_sync_file(struct file *, loff_t, loff_t, int);

/* hash.c */
-extern int ext4fs_dirhash(const char *name, int len, struct
- dx_hash_info *hinfo);
+extern int ext4fs_dirhash(const struct inode *dir, const char *name, int len,
+ struct dx_hash_info *hinfo);

/* ialloc.c */
extern struct inode *__ext4_new_inode(handle_t *, struct inode *, umode_t,
@@ -2971,6 +2976,9 @@ static inline void ext4_unlock_group(struct super_block *sb,

/* dir.c */
extern const struct file_operations ext4_dir_operations;
+#ifdef CONFIG_NLS
+extern const struct dentry_operations ext4_dentry_ops;
+#endif

/* file.c */
extern const struct inode_operations ext4_file_inode_operations;
diff --git a/fs/ext4/hash.c b/fs/ext4/hash.c
index e22dcfab308b..8ec9c7145987 100644
--- a/fs/ext4/hash.c
+++ b/fs/ext4/hash.c
@@ -6,6 +6,7 @@
*/

#include <linux/fs.h>
+#include <linux/nls.h>
#include <linux/compiler.h>
#include <linux/bitops.h>
#include "ext4.h"
@@ -196,7 +197,8 @@ static void str2hashbuf_unsigned(const char *msg, int len, __u32 *buf, int num)
* represented, and whether or not the returned hash is 32 bits or 64
* bits. 32 bit hashes will return 0 for the minor hash.
*/
-int ext4fs_dirhash(const char *name, int len, struct dx_hash_info *hinfo)
+static int __ext4fs_dirhash(const char *name, int len,
+ struct dx_hash_info *hinfo)
{
__u32 hash;
__u32 minor_hash = 0;
@@ -266,3 +268,33 @@ int ext4fs_dirhash(const char *name, int len, struct dx_hash_info *hinfo)
hinfo->minor_hash = minor_hash;
return 0;
}
+
+int ext4fs_dirhash(const struct inode *dir, const char *name, int len,
+ struct dx_hash_info *hinfo)
+{
+#ifdef CONFIG_NLS
+ const struct nls_table *charset = EXT4_SB(dir->i_sb)->s_encoding;
+ int r, dlen;
+ unsigned char *buff;
+
+ if (len && charset) {
+ buff = kzalloc(sizeof (char) * PATH_MAX, GFP_KERNEL);
+ if (!buff)
+ return -1;
+
+ dlen = nls_normalize(charset, name, len, buff, PATH_MAX);
+
+ if (dlen < 0) {
+ kfree(buff);
+ goto opaque_seq;
+ }
+
+ r = __ext4fs_dirhash(buff, dlen, hinfo);
+
+ kfree(buff);
+ return r;
+ }
+opaque_seq:
+#endif
+ return __ext4fs_dirhash(name, len, hinfo);
+}
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 014f6a698cb7..1ef355549abf 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -455,7 +455,7 @@ static int find_group_orlov(struct super_block *sb, struct inode *parent,
if (qstr) {
hinfo.hash_version = DX_HASH_HALF_MD4;
hinfo.seed = sbi->s_hash_seed;
- ext4fs_dirhash(qstr->name, qstr->len, &hinfo);
+ ext4fs_dirhash(parent, qstr->name, qstr->len, &hinfo);
grp = hinfo.hash;
} else
grp = prandom_u32();
diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
index 9c4bac18cc6c..10b9d3dcec4e 100644
--- a/fs/ext4/inline.c
+++ b/fs/ext4/inline.c
@@ -1404,7 +1404,7 @@ int htree_inlinedir_to_tree(struct file *dir_file,
}
}

- ext4fs_dirhash(de->name, de->name_len, hinfo);
+ ext4fs_dirhash(dir, de->name, de->name_len, hinfo);
if ((hinfo->hash < start_hash) ||
((hinfo->hash == start_hash) &&
(hinfo->minor_hash < start_minor_hash)))
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 437f71fe83ae..23e0e911b3fe 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -35,6 +35,7 @@
#include <linux/buffer_head.h>
#include <linux/bio.h>
#include <linux/iversion.h>
+#include <linux/nls.h>
#include "ext4.h"
#include "ext4_jbd2.h"

@@ -629,7 +630,7 @@ static struct stats dx_show_leaf(struct inode *dir,
}
if (!fscrypt_has_encryption_key(dir)) {
/* Directory is not encrypted */
- ext4fs_dirhash(de->name,
+ ext4fs_dirhash(dir, de->name,
de->name_len, &h);
printk("%*.s:(U)%x.%u ", len,
name, h.hash,
@@ -662,8 +663,8 @@ static struct stats dx_show_leaf(struct inode *dir,
name = fname_crypto_str.name;
len = fname_crypto_str.len;
}
- ext4fs_dirhash(de->name, de->name_len,
- &h);
+ ext4fs_dirhash(dir, de->name,
+ de->name_len, &h);
printk("%*.s:(E)%x.%u ", len, name,
h.hash, (unsigned) ((char *) de
- base));
@@ -673,7 +674,7 @@ static struct stats dx_show_leaf(struct inode *dir,
#else
int len = de->name_len;
char *name = de->name;
- ext4fs_dirhash(de->name, de->name_len, &h);
+ ext4fs_dirhash(dir, de->name, de->name_len, &h);
printk("%*.s:%x.%u ", len, name, h.hash,
(unsigned) ((char *) de - base));
#endif
@@ -762,7 +763,7 @@ dx_probe(struct ext4_filename *fname, struct inode *dir,
hinfo->hash_version += EXT4_SB(dir->i_sb)->s_hash_unsigned;
hinfo->seed = EXT4_SB(dir->i_sb)->s_hash_seed;
if (fname && fname_name(fname))
- ext4fs_dirhash(fname_name(fname), fname_len(fname), hinfo);
+ ext4fs_dirhash(dir, fname_name(fname), fname_len(fname), hinfo);
hash = hinfo->hash;

if (root->info.unused_flags & 1) {
@@ -1008,7 +1009,7 @@ static int htree_dirblock_to_tree(struct file *dir_file,
/* silently ignore the rest of the block */
break;
}
- ext4fs_dirhash(de->name, de->name_len, hinfo);
+ ext4fs_dirhash(dir, de->name, de->name_len, hinfo);
if ((hinfo->hash < start_hash) ||
((hinfo->hash == start_hash) &&
(hinfo->minor_hash < start_minor_hash)))
@@ -1197,7 +1198,7 @@ static int dx_make_map(struct inode *dir, struct ext4_dir_entry_2 *de,

while ((char *) de < base + blocksize) {
if (de->name_len && de->inode) {
- ext4fs_dirhash(de->name, de->name_len, &h);
+ ext4fs_dirhash(dir, de->name, de->name_len, &h);
map_tail--;
map_tail->hash = h.hash;
map_tail->offs = ((char *) de - base)>>2;
@@ -1257,10 +1258,14 @@ static void dx_insert_block(struct dx_frame *frame, u32 hash, ext4_lblk_t block)
*
* Return: %true if the directory entry matches, otherwise %false.
*/
-static inline bool ext4_match(const struct ext4_filename *fname,
+static inline bool ext4_match(const struct inode *parent,
+ const struct ext4_filename *fname,
const struct ext4_dir_entry_2 *de)
{
struct fscrypt_name f;
+#ifdef CONFIG_NLS
+ const struct ext4_sb_info *sbi = EXT4_SB(parent->i_sb);
+#endif

if (!de->inode)
return false;
@@ -1270,6 +1275,15 @@ static inline bool ext4_match(const struct ext4_filename *fname,
#ifdef CONFIG_EXT4_FS_ENCRYPTION
f.crypto_buf = fname->crypto_buf;
#endif
+
+#ifdef CONFIG_NLS
+ if (sbi->s_encoding) {
+ return !nls_strncmp(sbi->s_encoding,
+ de->name, de->name_len,
+ f.disk_name.name, f.disk_name.len);
+ }
+#endif
+
return fscrypt_match_name(&f, de->name, de->name_len);
}

@@ -1290,7 +1304,7 @@ int ext4_search_dir(struct buffer_head *bh, char *search_buf, int buf_size,
/* this code is executed quadratically often */
/* do minimal checking `by hand' */
if ((char *) de + de->name_len <= dlimit &&
- ext4_match(fname, de)) {
+ ext4_match(dir, fname, de)) {
/* found a match - just to be sure, do
* a full check */
if (ext4_check_dir_entry(dir, NULL, de, bh, bh->b_data,
@@ -1588,6 +1602,17 @@ static struct dentry *ext4_lookup(struct inode *dir, struct dentry *dentry, unsi
return ERR_PTR(-EPERM);
}
}
+
+#ifdef CONFIG_NLS
+ if (EXT4_SB(dir->i_sb)->s_encoding && !inode) {
+ /* Eventually we want to call d_add_ci(dentry, NULL)
+ * for negative dentries in the encoding case as
+ * well. For now, prevent the negative dentry
+ * from being cached.
+ */
+ return NULL;
+ }
+#endif
return d_splice_alias(inode, dentry);
}

@@ -1798,7 +1823,7 @@ int ext4_find_dest_de(struct inode *dir, struct inode *inode,
if (ext4_check_dir_entry(dir, NULL, de, bh,
buf, buf_size, offset))
return -EFSCORRUPTED;
- if (ext4_match(fname, de))
+ if (ext4_match(dir, fname, de))
return -EEXIST;
nlen = EXT4_DIR_REC_LEN(de->name_len);
rlen = ext4_rec_len_from_disk(de->rec_len, buf_size);
@@ -1983,7 +2008,7 @@ static int make_indexed_dir(handle_t *handle, struct ext4_filename *fname,
if (fname->hinfo.hash_version <= DX_HASH_TEA)
fname->hinfo.hash_version += EXT4_SB(dir->i_sb)->s_hash_unsigned;
fname->hinfo.seed = EXT4_SB(dir->i_sb)->s_hash_seed;
- ext4fs_dirhash(fname_name(fname), fname_len(fname), &fname->hinfo);
+ ext4fs_dirhash(dir, fname_name(fname), fname_len(fname), &fname->hinfo);

memset(frames, 0, sizeof(frames));
frame = frames;
@@ -2036,6 +2061,7 @@ static int ext4_add_entry(handle_t *handle, struct dentry *dentry,
struct ext4_dir_entry_2 *de;
struct ext4_dir_entry_tail *t;
struct super_block *sb;
+ struct ext4_sb_info *sbi;
struct ext4_filename fname;
int retval;
int dx_fallback=0;
@@ -2047,10 +2073,18 @@ static int ext4_add_entry(handle_t *handle, struct dentry *dentry,
csum_size = sizeof(struct ext4_dir_entry_tail);

sb = dir->i_sb;
+ sbi = EXT4_SB(sb);
blocksize = sb->s_blocksize;
if (!dentry->d_name.len)
return -EINVAL;

+#ifdef CONFIG_NLS
+ if (sbi->s_encoding_flags & EXT4_ENC_STRICT_MODE_FL &&
+ nls_validate(sbi->s_encoding, dentry->d_name.name,
+ dentry->d_name.len))
+ return -EINVAL;
+#endif
+
retval = ext4_fname_setup_filename(dir, &dentry->d_name, 0, &fname);
if (retval)
return retval;
@@ -2975,6 +3009,17 @@ static int ext4_rmdir(struct inode *dir, struct dentry *dentry)
ext4_update_dx_flag(dir);
ext4_mark_inode_dirty(handle, dir);

+#ifdef CONFIG_NLS
+ /* VFS negative dentries are incompatible with Encoding and
+ * Case-insensitiveness. Eventually we'll want avoid
+ * invalidating the dentries here, alongside with returning the
+ * negative dentries at ext4_lookup(), when it is better
+ * supported by the VFS for the CI case.
+ */
+ if (EXT4_SB(dir->i_sb)->s_encoding)
+ d_invalidate(dentry);
+#endif
+
end_rmdir:
brelse(bh);
if (handle)
@@ -3044,6 +3089,17 @@ static int ext4_unlink(struct inode *dir, struct dentry *dentry)
inode->i_ctime = current_time(inode);
ext4_mark_inode_dirty(handle, inode);

+#ifdef CONFIG_NLS
+ /* VFS negative dentries are incompatible with Encoding and
+ * Case-insensitiveness. Eventually we'll want avoid
+ * invalidating the dentries here, alongside with returning the
+ * negative dentries at ext4_lookup(), when it is better
+ * supported by the VFS for the CI case.
+ */
+ if (EXT4_SB(dir->i_sb)->s_encoding)
+ d_invalidate(dentry);
+#endif
+
end_unlink:
brelse(bh);
if (handle)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index e64a9ed2ca12..5ac1ed77f36b 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4412,6 +4412,12 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
iput(root);
goto failed_mount4;
}
+
+#ifdef CONFIG_NLS
+ if (sbi->s_encoding)
+ sb->s_d_op = &ext4_dentry_ops;
+#endif
+
sb->s_root = d_make_root(root);
if (!sb->s_root) {
ext4_msg(sb, KERN_ERR, "get root dentry failed");
--
2.20.0.rc2

2018-12-09 20:10:46

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH v4 00/23] Ext4 Encoding and Case-insensitive support

On Sun, Dec 09, 2018 at 09:41:13AM -0800, Linus Torvalds wrote:
> But if you only support ascii or utf-8, then why are you messing with
> the nls part? That makes no sense.
>
> You can't have it both ways.
>
> Either you have a horrible fundamental design mistake that has
> different per-filesystem locales, or you don't.
>
> If you don't, you shouldn't be touching any of the nls code.
>
> Whatever unicode tables you use for case folding shouldn't be in the nls code.

Gabriel added the Unicode tables for case folding to the fs/nls
directory. If you'd prefer that we put them somewhere else, we
can; do you have a preference?

- Ted

2018-12-08 21:59:11

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v4 00/23] Ext4 Encoding and Case-insensitive support

On Sat, Dec 8, 2018 at 1:48 PM Linus Torvalds
<[email protected]> wrote:
>
> Yes, allowing concurrent use then generates whole new "interesting"
> questions, like "what happens if a case _sensitive_ user creates two
> files with names that are identical to a in-sensitive user", but they
> aren't necessarily any worse than the issues you face *not* allowing
> that.

I'm hoping you are at least doing it per-directory. That makes at
least the "oh, the whole filesystem needs to do this wrong" issue a
bit less bad.

Just looking at the shortlog you posted, my guess is that the ext4
patches didn't even get *that* right, though. That shortlog "encoding
information in superblock" implies this is the same kind of just
horribly bad mess that we've seen before.

I really despise every single case-sensitive filesystem I have ever
seen, exactly because nobody apparently spends even a minimal amount
of effort on getting any of the basics remotely right. Every single
case I've seen has been a huge nasty hack, with seriously bad
system-wide consequences.

Linus

2018-12-06 22:05:51

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH v4 19/23] ext4: Reserve superblock fields for encoding information

From: Gabriel Krisman Bertazi <[email protected]>

The s_encoding field stores a magic number indicating the encoding
format and version used globally by file and directory names in the
filesystem.

The s_encoding_flags defines policies for using the charset encoding,
like how to handle invalid sequences and what kind of normalization to
use.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
fs/ext4/ext4.h | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 3f89d0ab08fc..52c9e8b948a0 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1311,7 +1311,9 @@ struct ext4_super_block {
__u8 s_first_error_time_hi;
__u8 s_last_error_time_hi;
__u8 s_pad[2];
- __le32 s_reserved[96]; /* Padding to the end of the block */
+ __le16 s_encoding; /* Filename charset encoding */
+ __le16 s_encoding_flags; /* Filename charset encoding flags */
+ __le32 s_reserved[95]; /* Padding to the end of the block */
__le32 s_checksum; /* crc32c(superblock) */
};

@@ -1661,6 +1663,7 @@ static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
#define EXT4_FEATURE_INCOMPAT_LARGEDIR 0x4000 /* >2GB or 3-lvl htree */
#define EXT4_FEATURE_INCOMPAT_INLINE_DATA 0x8000 /* data in inode */
#define EXT4_FEATURE_INCOMPAT_ENCRYPT 0x10000
+#define EXT4_FEATURE_INCOMPAT_FNAME_ENCODING 0x20000

#define EXT4_FEATURE_COMPAT_FUNCS(name, flagname) \
static inline bool ext4_has_feature_##name(struct super_block *sb) \
@@ -1749,6 +1752,7 @@ EXT4_FEATURE_INCOMPAT_FUNCS(csum_seed, CSUM_SEED)
EXT4_FEATURE_INCOMPAT_FUNCS(largedir, LARGEDIR)
EXT4_FEATURE_INCOMPAT_FUNCS(inline_data, INLINE_DATA)
EXT4_FEATURE_INCOMPAT_FUNCS(encrypt, ENCRYPT)
+EXT4_FEATURE_INCOMPAT_FUNCS(fname_encoding, FNAME_ENCODING)

#define EXT2_FEATURE_COMPAT_SUPP EXT4_FEATURE_COMPAT_EXT_ATTR
#define EXT2_FEATURE_INCOMPAT_SUPP (EXT4_FEATURE_INCOMPAT_FILETYPE| \
@@ -1776,6 +1780,7 @@ EXT4_FEATURE_INCOMPAT_FUNCS(encrypt, ENCRYPT)
EXT4_FEATURE_INCOMPAT_MMP | \
EXT4_FEATURE_INCOMPAT_INLINE_DATA | \
EXT4_FEATURE_INCOMPAT_ENCRYPT | \
+ EXT4_FEATURE_INCOMPAT_FNAME_ENCODING | \
EXT4_FEATURE_INCOMPAT_CSUM_SEED | \
EXT4_FEATURE_INCOMPAT_LARGEDIR)
#define EXT4_FEATURE_RO_COMPAT_SUPP (EXT4_FEATURE_RO_COMPAT_SPARSE_SUPER| \
--
2.20.0.rc2

2018-12-10 00:08:26

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH v4 00/23] Ext4 Encoding and Case-insensitive support

On Sun, Dec 09, 2018 at 12:54:38PM -0800, Linus Torvalds wrote:
> First off, there is no such thing as "one" unicode table for case
> folding. There are lots and lots of tables, and I'm not clear what
> table it is all about.
>
> For example, both OS X and Windows do some form of case folding on
> unicode. They don't do the *same* folding, though.

So things are much better in recent years. In the past it was kind of
a disaster, but the world is converging enough that the latest
versions of Mac OS'x APFS and Windows NTFS behave pretty much the same
way. They are both case-insensitive, case-preserving and
normalization-preserving, normalization-insensitive with respect to
filenames.

In the bad old-days, MacOS X's HFS+ was not normalization-preserving.
So it would force filenames to NFD form --- so if the user tried to
create a file named Ã…, and passed in the Unicode string U+212B to
creat(2), HFS+ would store it as U+0041,U+030A and that is what
readdir(2) would return. Apple has effectively admitted this was a
mistake, and their new APFS doesn't do this any more.

Now, both file systems basically say, "we don't care whether you pass
in U+212B or U+0041,U+030A; on the screen it looks identical, Ã…, so we
will treat it as the same filename; but readdir(2) will return what
you gave us."

It's been a *long* time since Unicode has changed case folding rules
for pre-existing characters. The tables have only changed with
respect to the new character sets have been added. If you have a set
of filenames which were all legal under Unicode 5.0, how they case
fold didn't change with respect to Unicode 6.0, 7.0, 8.0 9.0, 10.0 or
11.0.

Unicode 11.0 added some character sets like Ancient Sanskrit, a bunch
of new emoji's, and the copyleft symbol, and to the extent that
Ancient Sanskrit had case, the tables might have been *extended*. But
that doesn't break backwards compatibility.

And, of course, MacOS and Windows have been aggressively tracking
Unicode updates because everybody wants the latest emoji's. :-)

And it's not just SAMBA/CIFS. The NFSv4 protocol also provides for
case/normalization preserving filenames, and you can specify a NFSv4
mount option whether or not file name lookups should be
case/normalization insensitive. And the NFSv4 protocol specs also
specify the use of the Unicode thables, of which the latest versions
can be downloaded here:

http://www.unicode.org/Public/11.0.0/ucd/

So how about this? We'll put the unicode handling functions in a new
directory, fs/unicode, just to make it really clear that this will now
be changing any of the legacy fs/nls functions which other file
systems will use. By putting it in a separate directory, it will be
easier for other file systems to use it, whether it's for better Samba
or NFSv4 support.

- Ted

2018-12-06 22:06:03

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH v4 22/23] ext4: Implement EXT4_CASEFOLD_FL flag

From: Gabriel Krisman Bertazi <[email protected]>

Casefold is a flag applied to directories and inherited by its children
which states that the directory requires case-insensitive searches.
This flag can only be enabled on empty directories for filesystems that
support the encoding feature, thus preventing collision of file names
that only differ by case.

Enconding-awareness is also required because we consider the casefold
operation not be defined for opaque byte sequences.

Changes since v2:
- Rename sbi->encoding -> sbi->s_encoding.

Changes since v1:
- Moved the CASEFOLD_FL to prevent collision with reserved verity flag.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
fs/ext4/dir.c | 30 ++++++++++++++++++++++--------
fs/ext4/ext4.h | 7 ++++---
fs/ext4/hash.c | 6 +++++-
fs/ext4/inode.c | 4 +++-
fs/ext4/ioctl.c | 18 ++++++++++++++++++
fs/ext4/namei.c | 13 ++++++++++---
include/linux/fs.h | 2 ++
7 files changed, 64 insertions(+), 16 deletions(-)

diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index efb75c204551..43b91747f7e7 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -670,6 +670,10 @@ static int ext4_d_compare(const struct dentry *dentry, unsigned int len,
{
struct nls_table *charset = EXT4_SB(dentry->d_sb)->s_encoding;

+ if (IS_CASEFOLDED(dentry->d_parent->d_inode))
+ return nls_strncasecmp(charset, str, len, name->name,
+ name->len);
+
return nls_strncmp(charset, str, len, name->name, name->len);
}

@@ -679,16 +683,26 @@ static int ext4_d_hash(const struct dentry *dentry, struct qstr *q)
unsigned char *norm;
int len, ret = 0;

- /* If normalization is TYPE_PLAIN, we can just reuse the vfs
- * hash. */
- if (IS_NORMALIZATION_TYPE_ALL_PLAIN(charset))
- return 0;
+ if (!IS_CASEFOLDED(dentry->d_inode)) {

- norm = kmalloc(PATH_MAX, GFP_ATOMIC);
- if (!norm)
- return -ENOMEM;
+ /* If normalization is TYPE_PLAIN, we can just reuse the
+ * VFS hash.
+ */
+ if (IS_NORMALIZATION_TYPE_ALL_PLAIN(charset))
+ return 0;

- len = nls_normalize(charset, q->name, q->len, norm, PATH_MAX);
+ norm = kmalloc(PATH_MAX, GFP_ATOMIC);
+ if (!norm)
+ return -ENOMEM;
+
+ len = nls_normalize(charset, q->name, q->len, norm, PATH_MAX);
+ } else {
+ norm = kmalloc(PATH_MAX, GFP_ATOMIC);
+ if (!norm)
+ return -ENOMEM;
+
+ len = nls_casefold(charset, q->name, q->len, norm, PATH_MAX);
+ }

if (len < 0) {
ret = -EINVAL;
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index e84a6605a19a..d21ed5e88302 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -400,10 +400,11 @@ struct flex_groups {
#define EXT4_EOFBLOCKS_FL 0x00400000 /* Blocks allocated beyond EOF */
#define EXT4_INLINE_DATA_FL 0x10000000 /* Inode has inline data. */
#define EXT4_PROJINHERIT_FL 0x20000000 /* Create with parents projid */
+#define EXT4_CASEFOLD_FL 0x40000000 /* Casefolded file */
#define EXT4_RESERVED_FL 0x80000000 /* reserved for ext4 lib */

-#define EXT4_FL_USER_VISIBLE 0x304BDFFF /* User visible flags */
-#define EXT4_FL_USER_MODIFIABLE 0x204BC0FF /* User modifiable flags */
+#define EXT4_FL_USER_VISIBLE 0x704BDFFF /* User visible flags */
+#define EXT4_FL_USER_MODIFIABLE 0x604BC0FF /* User modifiable flags */

/* Flags we can manipulate with through EXT4_IOC_FSSETXATTR */
#define EXT4_FL_XFLAG_VISIBLE (EXT4_SYNC_FL | \
@@ -418,7 +419,7 @@ struct flex_groups {
EXT4_SYNC_FL | EXT4_NODUMP_FL | EXT4_NOATIME_FL |\
EXT4_NOCOMPR_FL | EXT4_JOURNAL_DATA_FL |\
EXT4_NOTAIL_FL | EXT4_DIRSYNC_FL |\
- EXT4_PROJINHERIT_FL)
+ EXT4_PROJINHERIT_FL | EXT4_CASEFOLD_FL)

/* Flags that are appropriate for regular files (all but dir-specific ones). */
#define EXT4_REG_FLMASK (~(EXT4_DIRSYNC_FL | EXT4_TOPDIR_FL))
diff --git a/fs/ext4/hash.c b/fs/ext4/hash.c
index 8ec9c7145987..78cb97664a33 100644
--- a/fs/ext4/hash.c
+++ b/fs/ext4/hash.c
@@ -282,7 +282,11 @@ int ext4fs_dirhash(const struct inode *dir, const char *name, int len,
if (!buff)
return -1;

- dlen = nls_normalize(charset, name, len, buff, PATH_MAX);
+ if (!IS_CASEFOLDED(dir))
+ dlen = nls_normalize(charset, name, len, buff,
+ PATH_MAX);
+ else
+ dlen = nls_casefold(charset, name, len, buff, PATH_MAX);

if (dlen < 0) {
kfree(buff);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 22a9d8159720..9908d7d98b6e 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4745,9 +4745,11 @@ void ext4_set_inode_flags(struct inode *inode)
new_fl |= S_DAX;
if (flags & EXT4_ENCRYPT_FL)
new_fl |= S_ENCRYPTED;
+ if (flags & EXT4_CASEFOLD_FL)
+ new_fl |= S_CASEFOLD;
inode_set_flags(inode, new_fl,
S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC|S_DAX|
- S_ENCRYPTED);
+ S_ENCRYPTED|S_CASEFOLD);
}

static blkcnt_t ext4_inode_blocks(struct ext4_inode *raw_inode,
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index 0edee31913d1..ef4ffe681836 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -231,6 +231,7 @@ static int ext4_ioctl_setflags(struct inode *inode,
struct ext4_iloc iloc;
unsigned int oldflags, mask, i;
unsigned int jflag;
+ struct super_block *sb = inode->i_sb;

/* Is it quota file? Do not allow user to mess with it */
if (ext4_is_quota_file(inode))
@@ -275,6 +276,23 @@ static int ext4_ioctl_setflags(struct inode *inode,
goto flags_out;
}

+ if ((flags ^ oldflags) & EXT4_CASEFOLD_FL) {
+ if (!ext4_has_feature_fname_encoding(sb)) {
+ err = -EOPNOTSUPP;
+ goto flags_out;
+ }
+
+ if (!S_ISDIR(inode->i_mode)) {
+ err = -ENOTDIR;
+ goto flags_out;
+ }
+
+ if (!ext4_empty_dir(inode)) {
+ err = -ENOTEMPTY;
+ goto flags_out;
+ }
+ }
+
handle = ext4_journal_start(inode, EXT4_HT_INODE, 1);
if (IS_ERR(handle)) {
err = PTR_ERR(handle);
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 23e0e911b3fe..a21f0d7227db 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -1278,9 +1278,16 @@ static inline bool ext4_match(const struct inode *parent,

#ifdef CONFIG_NLS
if (sbi->s_encoding) {
- return !nls_strncmp(sbi->s_encoding,
- de->name, de->name_len,
- f.disk_name.name, f.disk_name.len);
+ if (!IS_CASEFOLDED(parent))
+ return !nls_strncmp(sbi->s_encoding,
+ de->name, de->name_len,
+ fname->disk_name.name,
+ fname->disk_name.len);
+ else
+ return !nls_strncasecmp(sbi->s_encoding,
+ de->name, de->name_len,
+ fname->disk_name.name,
+ fname->disk_name.len);
}
#endif

diff --git a/include/linux/fs.h b/include/linux/fs.h
index c95c0807471f..69abaca207c0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1947,6 +1947,7 @@ struct super_operations {
#define S_DAX 0 /* Make all the DAX code disappear */
#endif
#define S_ENCRYPTED 16384 /* Encrypted file (using fs/crypto/) */
+#define S_CASEFOLD 32768 /* Casefolded file */

/*
* Note that nosuid etc flags are inode-specific: setting some file-system
@@ -1987,6 +1988,7 @@ static inline bool sb_rdonly(const struct super_block *sb) { return sb->s_flags
#define IS_NOSEC(inode) ((inode)->i_flags & S_NOSEC)
#define IS_DAX(inode) ((inode)->i_flags & S_DAX)
#define IS_ENCRYPTED(inode) ((inode)->i_flags & S_ENCRYPTED)
+#define IS_CASEFOLDED(inode) ((inode)->i_flags & S_CASEFOLD)

#define IS_WHITEOUT(inode) (S_ISCHR(inode->i_mode) && \
(inode)->i_rdev == WHITEOUT_DEV)
--
2.20.0.rc2

2018-12-09 20:54:58

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v4 00/23] Ext4 Encoding and Case-insensitive support

On Sun, Dec 9, 2018 at 12:10 PM Theodore Y. Ts'o <[email protected]> wrote:
>
> Gabriel added the Unicode tables for case folding to the fs/nls
> directory. If you'd prefer that we put them somewhere else, we
> can; do you have a preference?

I have a really hard time judging, since I haven't seen the code, just
a random diffstat and shortlog.

First off, there is no such thing as "one" unicode table for case
folding. There are lots and lots of tables, and I'm not clear what
table it is all about.

For example, both OS X and Windows do some form of case folding on
unicode. They don't do the *same* folding, though.

There are also various locale variations to case folding. This is
where I thought your nls choice came from, but then you tried to imply
that there are no locale issues and that directories can just have a
single flag to enable/disable the folding.

In some locales, "SS" and "ß" (perhaps "SZ" too) will compare the same
in case-insensitivity. Crazy in general, and afaik modern unicode even
has a real upper-case "ß" so it's arguably legacy, but...

And that's all entirely independent of the issues with all the
combining characters, modifier letters, white-space, overlong utf8
questions, etc etc.

It's also easy to generate overlong utf-8 that decodes to '/', for
example. Some broken systems might consider that identical to a real
'/' and it matters for path lookup.

So what's the actual code? What rules did you happen to pick? Did you
take the windows rules as-is (I _think_ they may be documented) since
the primary target apparently is just samba performance?

And even if the answer is "we follow NTFS rules", which *version* of
NTFS folding rules are you using if you're trying to speed up samba,
for example? Because afaik they have changed over time.

Is the *only* target samba? You are never interested for local loads
like "oh, people want to run Wine and might need it" or the
application testing parts?

All of these matter.

For example, if it's some "ext4 special case just for samba", then
perhaps the logical place to put all this is just in fs/ext4/ and not
bother anybody else about it.

But if it might be useful as some generic "NTFS hashing" library, then
make it that.

Linus

2018-12-06 22:04:46

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH v4 01/23] nls: Wrap uni2char/char2uni callers

From: Gabriel Krisman Bertazi <[email protected]>

Just a cosmetic change at this point, this patch will simplify the
following Coccinelle patches which will move the hooks into a dedicated
structure, with the goal of splitting the nls_table structure to support
versioning.

This was generated with the following coccinele script:

<smpl>

@@
expression A, B, C, D;
@@
(
- A->uni2char(B, C, D)
+ nls_uni2char(A, B, C, D)
|
- A->char2uni(B, C, D)
+ nls_char2uni(A, B, C, D)
)

</smpl>

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
fs/befs/linuxvfs.c | 4 ++--
fs/cifs/cifs_unicode.c | 9 +++++----
fs/cifs/dir.c | 7 ++++---
fs/fat/dir.c | 8 ++++----
fs/fat/namei_vfat.c | 6 +++---
fs/hfs/trans.c | 9 +++++----
fs/hfsplus/unicode.c | 6 +++---
fs/isofs/joliet.c | 3 ++-
fs/jfs/jfs_unicode.c | 7 +++----
fs/nls/nls_euc-jp.c | 4 ++--
fs/nls/nls_koi8-ru.c | 6 +++---
fs/ntfs/unistr.c | 8 ++++----
include/linux/nls.h | 13 +++++++++++++
13 files changed, 53 insertions(+), 37 deletions(-)

diff --git a/fs/befs/linuxvfs.c b/fs/befs/linuxvfs.c
index 4700b4534439..0ba368fbfad4 100644
--- a/fs/befs/linuxvfs.c
+++ b/fs/befs/linuxvfs.c
@@ -542,7 +542,7 @@ befs_utf2nls(struct super_block *sb, const char *in,
/* convert from Unicode to nls */
if (uni > MAX_WCHAR_T)
goto conv_err;
- unilen = nls->uni2char(uni, &result[o], in_len - o);
+ unilen = nls_uni2char(nls, uni, &result[o], in_len - o);
if (unilen < 0)
goto conv_err;
}
@@ -616,7 +616,7 @@ befs_nls2utf(struct super_block *sb, const char *in,
for (i = o = 0; i < in_len; i += unilen, o += utflen) {

/* convert from nls to unicode */
- unilen = nls->char2uni(&in[i], in_len - i, &uni);
+ unilen = nls_char2uni(nls, &in[i], in_len - i, &uni);
if (unilen < 0)
goto conv_err;

diff --git a/fs/cifs/cifs_unicode.c b/fs/cifs/cifs_unicode.c
index a2b2355e7f01..ffad8b4f90d1 100644
--- a/fs/cifs/cifs_unicode.c
+++ b/fs/cifs/cifs_unicode.c
@@ -145,7 +145,7 @@ cifs_mapchar(char *target, const __u16 *from, const struct nls_table *cp,
return len;

/* if character not one of seven in special remap set */
- len = cp->uni2char(src_char, target, NLS_MAX_CHARSET_SIZE);
+ len = nls_uni2char(cp, src_char, target, NLS_MAX_CHARSET_SIZE);
if (len <= 0)
goto surrogate_pair;

@@ -289,7 +289,7 @@ cifs_strtoUTF16(__le16 *to, const char *from, int len,
}

for (i = 0; len && *from; i++, from += charlen, len -= charlen) {
- charlen = codepage->char2uni(from, len, &wchar_to);
+ charlen = nls_char2uni(codepage, from, len, &wchar_to);
if (charlen < 1) {
cifs_dbg(VFS, "strtoUTF16: char2uni of 0x%x returned %d\n",
*from, charlen);
@@ -515,7 +515,8 @@ cifsConvertToUTF16(__le16 *target, const char *source, int srclen,
* as they use backslash as separator.
*/
if (dst_char == 0) {
- charlen = cp->char2uni(source + i, srclen - i, &tmp);
+ charlen = nls_char2uni(cp, source + i, srclen - i,
+ &tmp);
dst_char = cpu_to_le16(tmp);

/*
@@ -605,7 +606,7 @@ cifs_local_to_utf16_bytes(const char *from, int len,
wchar_t wchar_to;

for (i = 0; len && *from; i++, from += charlen, len -= charlen) {
- charlen = codepage->char2uni(from, len, &wchar_to);
+ charlen = nls_char2uni(codepage, from, len, &wchar_to);
/* Failed conversion defaults to a question mark */
if (charlen < 1)
charlen = 1;
diff --git a/fs/cifs/dir.c b/fs/cifs/dir.c
index 3713d22b95a7..f8bb9285f630 100644
--- a/fs/cifs/dir.c
+++ b/fs/cifs/dir.c
@@ -910,7 +910,7 @@ static int cifs_ci_hash(const struct dentry *dentry, struct qstr *q)

hash = init_name_hash(dentry);
for (i = 0; i < q->len; i += charlen) {
- charlen = codepage->char2uni(&q->name[i], q->len - i, &c);
+ charlen = nls_char2uni(codepage, &q->name[i], q->len - i, &c);
/* error out if we can't convert the character */
if (unlikely(charlen < 0))
return charlen;
@@ -939,8 +939,9 @@ static int cifs_ci_compare(const struct dentry *dentry,

for (i = 0; i < len; i += l1) {
/* Convert characters in both strings to UTF-16. */
- l1 = codepage->char2uni(&str[i], len - i, &c1);
- l2 = codepage->char2uni(&name->name[i], name->len - i, &c2);
+ l1 = nls_char2uni(codepage, &str[i], len - i, &c1);
+ l2 = nls_char2uni(codepage, &name->name[i], name->len - i,
+ &c2);

/*
* If we can't convert either character, just declare it to
diff --git a/fs/fat/dir.c b/fs/fat/dir.c
index c8366cb8eccd..d5f856651a08 100644
--- a/fs/fat/dir.c
+++ b/fs/fat/dir.c
@@ -153,7 +153,7 @@ static int uni16_to_x8(struct super_block *sb, unsigned char *ascii,

while (*ip && ((len - NLS_MAX_CHARSET_SIZE) > 0)) {
ec = *ip++;
- charlen = nls->uni2char(ec, op, NLS_MAX_CHARSET_SIZE);
+ charlen = nls_uni2char(nls, ec, op, NLS_MAX_CHARSET_SIZE);
if (charlen > 0) {
op += charlen;
len -= charlen;
@@ -195,7 +195,7 @@ fat_short2uni(struct nls_table *t, unsigned char *c, int clen, wchar_t *uni)
{
int charlen;

- charlen = t->char2uni(c, clen, uni);
+ charlen = nls_char2uni(t, c, clen, uni);
if (charlen < 0) {
*uni = 0x003f; /* a question mark */
charlen = 1;
@@ -210,7 +210,7 @@ fat_short2lower_uni(struct nls_table *t, unsigned char *c,
int charlen;
wchar_t wc;

- charlen = t->char2uni(c, clen, &wc);
+ charlen = nls_char2uni(t, c, clen, &wc);
if (charlen < 0) {
*uni = 0x003f; /* a question mark */
charlen = 1;
@@ -220,7 +220,7 @@ fat_short2lower_uni(struct nls_table *t, unsigned char *c,
if (!nc)
nc = *c;

- charlen = t->char2uni(&nc, 1, uni);
+ charlen = nls_char2uni(t, &nc, 1, uni);
if (charlen < 0) {
*uni = 0x003f; /* a question mark */
charlen = 1;
diff --git a/fs/fat/namei_vfat.c b/fs/fat/namei_vfat.c
index 996c8c25e9c6..ab6b450521ff 100644
--- a/fs/fat/namei_vfat.c
+++ b/fs/fat/namei_vfat.c
@@ -289,7 +289,7 @@ static inline int to_shortname_char(struct nls_table *nls,
return 1;
}

- len = nls->uni2char(*src, buf, buf_size);
+ len = nls_uni2char(nls, *src, buf, buf_size);
if (len <= 0) {
info->valid = 0;
buf[0] = '_';
@@ -544,8 +544,8 @@ xlate_to_uni(const unsigned char *name, int len, unsigned char *outname,
ip += 5;
i += 5;
} else {
- charlen = nls->char2uni(ip, len - i,
- (wchar_t *)op);
+ charlen = nls_char2uni(nls, ip, len - i,
+ (wchar_t *)op);
if (charlen < 0)
return -EINVAL;
ip += charlen;
diff --git a/fs/hfs/trans.c b/fs/hfs/trans.c
index 39f5e343bf4d..2fae312edb31 100644
--- a/fs/hfs/trans.c
+++ b/fs/hfs/trans.c
@@ -49,7 +49,8 @@ int hfs_mac2asc(struct super_block *sb, char *out, const struct hfs_name *in)

while (srclen > 0) {
if (nls_disk) {
- size = nls_disk->char2uni(src, srclen, &ch);
+ size = nls_char2uni(nls_disk, src, srclen,
+ &ch);
if (size <= 0) {
ch = '?';
size = 1;
@@ -62,7 +63,7 @@ int hfs_mac2asc(struct super_block *sb, char *out, const struct hfs_name *in)
}
if (ch == '/')
ch = ':';
- size = nls_io->uni2char(ch, dst, dstlen);
+ size = nls_uni2char(nls_io, ch, dst, dstlen);
if (size < 0) {
if (size == -ENAMETOOLONG)
goto out;
@@ -110,7 +111,7 @@ void hfs_asc2mac(struct super_block *sb, struct hfs_name *out, const struct qstr
wchar_t ch;

while (srclen > 0) {
- size = nls_io->char2uni(src, srclen, &ch);
+ size = nls_char2uni(nls_io, src, srclen, &ch);
if (size < 0) {
ch = '?';
size = 1;
@@ -120,7 +121,7 @@ void hfs_asc2mac(struct super_block *sb, struct hfs_name *out, const struct qstr
if (ch == ':')
ch = '/';
if (nls_disk) {
- size = nls_disk->uni2char(ch, dst, dstlen);
+ size = nls_uni2char(nls_disk, ch, dst, dstlen);
if (size < 0) {
if (size == -ENAMETOOLONG)
goto out;
diff --git a/fs/hfsplus/unicode.c b/fs/hfsplus/unicode.c
index c8d1b2be7854..057dc7e57cb1 100644
--- a/fs/hfsplus/unicode.c
+++ b/fs/hfsplus/unicode.c
@@ -190,7 +190,7 @@ int hfsplus_uni2asc(struct super_block *sb,
c0 = ':';
break;
}
- res = nls->uni2char(c0, op, len);
+ res = nls_uni2char(nls, c0, op, len);
if (res < 0) {
if (res == -ENAMETOOLONG)
goto out;
@@ -233,7 +233,7 @@ int hfsplus_uni2asc(struct super_block *sb,
cc = c0;
}
done:
- res = nls->uni2char(cc, op, len);
+ res = nls_uni2char(nls, cc, op, len);
if (res < 0) {
if (res == -ENAMETOOLONG)
goto out;
@@ -256,7 +256,7 @@ int hfsplus_uni2asc(struct super_block *sb,
static inline int asc2unichar(struct super_block *sb, const char *astr, int len,
wchar_t *uc)
{
- int size = HFSPLUS_SB(sb)->nls->char2uni(astr, len, uc);
+ int size = nls_char2uni(HFSPLUS_SB(sb)->nls, astr, len, uc);
if (size <= 0) {
*uc = '?';
size = 1;
diff --git a/fs/isofs/joliet.c b/fs/isofs/joliet.c
index be8b6a9d0b92..56fac73b27a5 100644
--- a/fs/isofs/joliet.c
+++ b/fs/isofs/joliet.c
@@ -25,7 +25,8 @@ uni16_to_x8(unsigned char *ascii, __be16 *uni, int len, struct nls_table *nls)

while ((ch = get_unaligned(ip)) && len) {
int llen;
- llen = nls->uni2char(be16_to_cpu(ch), op, NLS_MAX_CHARSET_SIZE);
+ llen = nls_uni2char(nls, be16_to_cpu(ch), op,
+ NLS_MAX_CHARSET_SIZE);
if (llen > 0)
op += llen;
else
diff --git a/fs/jfs/jfs_unicode.c b/fs/jfs/jfs_unicode.c
index 0148e2e4d97a..4ca88ef661e9 100644
--- a/fs/jfs/jfs_unicode.c
+++ b/fs/jfs/jfs_unicode.c
@@ -41,9 +41,8 @@ int jfs_strfromUCS_le(char *to, const __le16 * from,
for (i = 0; (i < len) && from[i]; i++) {
int charlen;
charlen =
- codepage->uni2char(le16_to_cpu(from[i]),
- &to[outlen],
- NLS_MAX_CHARSET_SIZE);
+ nls_uni2char(codepage, le16_to_cpu(from[i]),
+ &to[outlen], NLS_MAX_CHARSET_SIZE);
if (charlen > 0)
outlen += charlen;
else
@@ -88,7 +87,7 @@ static int jfs_strtoUCS(wchar_t * to, const unsigned char *from, int len,
if (codepage) {
for (i = 0; len && *from; i++, from += charlen, len -= charlen)
{
- charlen = codepage->char2uni(from, len, &to[i]);
+ charlen = nls_char2uni(codepage, from, len, &to[i]);
if (charlen < 1) {
jfs_err("jfs_strtoUCS: char2uni returned %d.",
charlen);
diff --git a/fs/nls/nls_euc-jp.c b/fs/nls/nls_euc-jp.c
index 162b3f160353..eec257545f04 100644
--- a/fs/nls/nls_euc-jp.c
+++ b/fs/nls/nls_euc-jp.c
@@ -413,7 +413,7 @@ static int uni2char(const wchar_t uni,

if (!p_nls)
return -EINVAL;
- if ((n = p_nls->uni2char(uni, out, boundlen)) < 0)
+ if ((n = nls_uni2char(p_nls, uni, out, boundlen)) < 0)
return n;

/* translate SJIS into EUC-JP */
@@ -543,7 +543,7 @@ static int char2uni(const unsigned char *rawstring, int boundlen,
sjis_temp[1] = 0x00;
}

- if ( (n = p_nls->char2uni(sjis_temp, sizeof(sjis_temp), uni)) < 0)
+ if ( (n = nls_char2uni(p_nls, sjis_temp, sizeof(sjis_temp), uni)) < 0)
return n;

return euc_offset;
diff --git a/fs/nls/nls_koi8-ru.c b/fs/nls/nls_koi8-ru.c
index a80a741a8676..32781252110d 100644
--- a/fs/nls/nls_koi8-ru.c
+++ b/fs/nls/nls_koi8-ru.c
@@ -28,12 +28,12 @@ static int uni2char(const wchar_t uni,
else if (uni == 0x255d || uni == 0x256c)
return 0;
else
- return p_nls->uni2char(uni, out, boundlen);
+ return nls_uni2char(p_nls, uni, out, boundlen);
return 1;
}
else
/* fast path */
- return p_nls->uni2char(uni, out, boundlen);
+ return nls_uni2char(p_nls, uni, out, boundlen);
}

static int char2uni(const unsigned char *rawstring, int boundlen,
@@ -47,7 +47,7 @@ static int char2uni(const unsigned char *rawstring, int boundlen,
return 1;
}

- n = p_nls->char2uni(rawstring, boundlen, uni);
+ n = nls_char2uni(p_nls, rawstring, boundlen, uni);
return n;
}

diff --git a/fs/ntfs/unistr.c b/fs/ntfs/unistr.c
index 005ca4b0f132..e0a5f33441df 100644
--- a/fs/ntfs/unistr.c
+++ b/fs/ntfs/unistr.c
@@ -269,8 +269,8 @@ int ntfs_nlstoucs(const ntfs_volume *vol, const char *ins,
ucs = kmem_cache_alloc(ntfs_name_cache, GFP_NOFS);
if (likely(ucs)) {
for (i = o = 0; i < ins_len; i += wc_len) {
- wc_len = nls->char2uni(ins + i, ins_len - i,
- &wc);
+ wc_len = nls_char2uni(nls, ins + i,
+ ins_len - i, &wc);
if (likely(wc_len >= 0 &&
o < NTFS_MAX_NAME_LEN)) {
if (likely(wc)) {
@@ -355,8 +355,8 @@ int ntfs_ucstonls(const ntfs_volume *vol, const ntfschar *ins,
goto mem_err_out;
}
for (i = o = 0; i < ins_len; i++) {
-retry: wc = nls->uni2char(le16_to_cpu(ins[i]), ns + o,
- ns_len - o);
+retry: wc = nls_uni2char(nls, le16_to_cpu(ins[i]),
+ ns + o, ns_len - o);
if (wc > 0) {
o += wc;
continue;
diff --git a/include/linux/nls.h b/include/linux/nls.h
index 499e486b3722..5073ecd57279 100644
--- a/include/linux/nls.h
+++ b/include/linux/nls.h
@@ -59,6 +59,19 @@ extern int utf8s_to_utf16s(const u8 *s, int len,
extern int utf16s_to_utf8s(const wchar_t *pwcs, int len,
enum utf16_endian endian, u8 *s, int maxlen);

+static inline int nls_uni2char(const struct nls_table *table, wchar_t uni,
+ unsigned char *out, int boundlen)
+{
+ return table->uni2char(uni, out, boundlen);
+}
+
+static inline int nls_char2uni(const struct nls_table *table,
+ const unsigned char *rawstring,
+ int boundlen, wchar_t *uni)
+{
+ return table->char2uni(rawstring, boundlen, uni);
+}
+
static inline unsigned char nls_tolower(struct nls_table *t, unsigned char c)
{
unsigned char nc = t->charset2lower[c];
--
2.20.0.rc2

2018-12-06 22:05:15

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH v4 09/23] nls: Add new interface for string comparisons

From: Gabriel Krisman Bertazi <[email protected]>

The existing stricmp() interface is limited by not accepting separated
length parameters for each string being compared. This is a problem for
charsets doing normalization or full casefold comparison, since
different sized strings can still be matched. To resolve this problem,
this patch implements a new interface, allowing charsets to do the
comparison, if needed.

The original stricmp is left in the code, until we convert all caller to
the new interface. Nevertheless, it was reimplemented using the new
interface.

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
include/linux/nls.h | 69 +++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 66 insertions(+), 3 deletions(-)

diff --git a/include/linux/nls.h b/include/linux/nls.h
index c43746bd390e..980103d4c363 100644
--- a/include/linux/nls.h
+++ b/include/linux/nls.h
@@ -3,6 +3,7 @@
#define _LINUX_NLS_H

#include <linux/init.h>
+#include <linux/string.h>

/* Unicode has changed over the years. Unicode code points no longer
* fit into 16 bits; as of Unicode 5 valid code points range from 0
@@ -38,6 +39,32 @@ struct nls_ops {
**/
int (*validate)(const struct nls_table *charset,
const unsigned char *str, size_t len);
+ /**
+ * @strncmp:
+ *
+ * strncmp is the function for case-sensitive string comparison.
+ * It only needs to be implemented by charsets that want to do
+ * some fancy comparisons, like normalization-insensitive.
+ *
+ * Returns 0 if str1 and str2 are equal, otherwise return
+ * non-zero.
+ **/
+ int (*strncmp)(const struct nls_table *charset,
+ const unsigned char *str1, size_t len1,
+ const unsigned char *str2, size_t len2);
+
+ /**
+ * @strncasecmp:
+ *
+ * strncasecmp is the function for case-insensitive string
+ * comparison.
+ *
+ * Returns 0 if str1 and str2 are equal, otherwise return
+ * non-zero.
+ **/
+ int (*strncasecmp)(const struct nls_table *charset,
+ const unsigned char *str1, size_t len1,
+ const unsigned char *str2, size_t len2);
unsigned char (*lowercase)(const struct nls_table *charset,
unsigned int c);
unsigned char (*uppercase)(const struct nls_table *charset,
@@ -139,10 +166,21 @@ static inline unsigned char nls_toupper(const struct nls_table *t,
return nc ? nc : c;
}

-static inline int nls_strnicmp(struct nls_table *t, const unsigned char *s1,
- const unsigned char *s2, int len)
+static inline int nls_strncasecmp(struct nls_table *t,
+ const unsigned char *s1, size_t len1,
+ const unsigned char *s2, size_t len2)
{
- while (len--) {
+ if (t->ops->strncasecmp)
+ return t->ops->strncasecmp(t, s1, len1, s2, len2);
+
+ if (IS_STRICT_MODE(t) &&
+ (nls_validate(t, s1, len1) || nls_validate(t, s1, len1)))
+ return -EINVAL;
+
+ if (len1 != len2)
+ return 1;
+
+ while (len1--) {
if (nls_tolower(t, *s1++) != nls_tolower(t, *s2++))
return 1;
}
@@ -150,6 +188,31 @@ static inline int nls_strnicmp(struct nls_table *t, const unsigned char *s1,
return 0;
}

+static inline int nls_strncmp(struct nls_table *t,
+ const unsigned char *s1, size_t len1,
+ const unsigned char *s2, size_t len2)
+{
+ if (t->ops->strncmp)
+ return t->ops->strncmp(t, s1, len1, s2, len2);
+
+ if (IS_STRICT_MODE(t) &&
+ (nls_validate(t, s1, len1) || nls_validate(t, s1, len1)))
+ return -EINVAL;
+
+ if (len1 != len2)
+ return 1;
+
+ /* strnicmp did not return negative values. So let's keep the
+ * abi for now */
+ return !!memcmp(s1, s2, len1);
+}
+
+static inline int nls_strnicmp(struct nls_table *t, const unsigned char *s1,
+ const unsigned char *s2, int len)
+{
+ return nls_strncasecmp(t, s1, len, s2, len);
+}
+
/*
* nls_nullsize - return length of null character for codepage
* @codepage - codepage for which to return length of NULL terminator
--
2.20.0.rc2

2018-12-06 22:04:49

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH v4 02/23] nls: Wrap charset field access

From: Gabriel Krisman Bertazi <[email protected]>

The goal is to simplify the following patches that split nls_table. No
behavior changes intended.

<smpl>

@@
struct nls_table *c;
@@

- c->charset
+ nls_charset_name(c)

</smpl>

Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
---
fs/befs/linuxvfs.c | 4 ++--
fs/cifs/cifs_unicode.c | 6 +++---
fs/cifs/cifsfs.c | 2 +-
fs/cifs/connect.c | 2 +-
fs/fat/inode.c | 6 ++++--
fs/hfs/super.c | 6 ++++--
fs/hfsplus/options.c | 2 +-
fs/isofs/inode.c | 5 +++--
fs/jfs/jfs_unicode.c | 2 +-
fs/jfs/super.c | 3 ++-
fs/nls/nls_base.c | 2 +-
fs/ntfs/inode.c | 2 +-
fs/ntfs/super.c | 6 +++---
fs/ntfs/unistr.c | 5 +++--
fs/udf/super.c | 3 ++-
include/linux/nls.h | 5 +++++
16 files changed, 37 insertions(+), 24 deletions(-)

diff --git a/fs/befs/linuxvfs.c b/fs/befs/linuxvfs.c
index 0ba368fbfad4..8b7af0a9011a 100644
--- a/fs/befs/linuxvfs.c
+++ b/fs/befs/linuxvfs.c
@@ -555,7 +555,7 @@ befs_utf2nls(struct super_block *sb, const char *in,

conv_err:
befs_error(sb, "Name using character set %s contains a character that "
- "cannot be converted to unicode.", nls->charset);
+ "cannot be converted to unicode.", nls_charset_name(nls));
befs_debug(sb, "<--- %s", __func__);
kfree(result);
return -EILSEQ;
@@ -635,7 +635,7 @@ befs_nls2utf(struct super_block *sb, const char *in,

conv_err:
befs_error(sb, "Name using character set %s contains a character that "
- "cannot be converted to unicode.", nls->charset);
+ "cannot be converted to unicode.", nls_charset_name(nls));
befs_debug(sb, "<--- %s", __func__);
kfree(result);
return -EILSEQ;
diff --git a/fs/cifs/cifs_unicode.c b/fs/cifs/cifs_unicode.c
index ffad8b4f90d1..2a9396d24f60 100644
--- a/fs/cifs/cifs_unicode.c
+++ b/fs/cifs/cifs_unicode.c
@@ -153,7 +153,7 @@ cifs_mapchar(char *target, const __u16 *from, const struct nls_table *cp,

surrogate_pair:
/* convert SURROGATE_PAIR and IVS */
- if (strcmp(cp->charset, "utf8"))
+ if (strcmp(nls_charset_name(cp), "utf8"))
goto unknown;
len = utf16s_to_utf8s(from, 3, UTF16_LITTLE_ENDIAN, target, 6);
if (len <= 0)
@@ -268,7 +268,7 @@ cifs_strtoUTF16(__le16 *to, const char *from, int len,
wchar_t wchar_to; /* needed to quiet sparse */

/* special case for utf8 to handle no plane0 chars */
- if (!strcmp(codepage->charset, "utf8")) {
+ if (!strcmp(nls_charset_name(codepage), "utf8")) {
/*
* convert utf8 -> utf16, we assume we have enough space
* as caller should have assumed conversion does not overflow
@@ -527,7 +527,7 @@ cifsConvertToUTF16(__le16 *target, const char *source, int srclen,
goto ctoUTF16;

/* convert SURROGATE_PAIR */
- if (strcmp(cp->charset, "utf8") || !wchar_to)
+ if (strcmp(nls_charset_name(cp), "utf8") || !wchar_to)
goto unknown;
if (*(source + i) & 0x80) {
charlen = utf8_to_utf32(source + i, 6, &u);
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index 865706edb307..b0531986cfd7 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -414,7 +414,7 @@ cifs_show_nls(struct seq_file *s, struct nls_table *cur)
/* Display iocharset= option if it's not default charset */
def = load_nls_default();
if (def != cur)
- seq_printf(s, ",iocharset=%s", cur->charset);
+ seq_printf(s, ",iocharset=%s", nls_charset_name(cur));
unload_nls(def);
}

diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
index 6f24f129a751..3011276a06f0 100644
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -3191,7 +3191,7 @@ compare_mount_options(struct super_block *sb, struct cifs_mnt_data *mnt_data)
old->mnt_dir_mode != new->mnt_dir_mode)
return 0;

- if (strcmp(old->local_nls->charset, new->local_nls->charset))
+ if (strcmp(nls_charset_name(old->local_nls), nls_charset_name(new->local_nls)))
return 0;

if (old->actimeo != new->actimeo)
diff --git a/fs/fat/inode.c b/fs/fat/inode.c
index c0b5b5c3373b..2563dc306e7f 100644
--- a/fs/fat/inode.c
+++ b/fs/fat/inode.c
@@ -948,10 +948,12 @@ static int fat_show_options(struct seq_file *m, struct dentry *root)
seq_printf(m, ",allow_utime=%04o", opts->allow_utime);
if (sbi->nls_disk)
/* strip "cp" prefix from displayed option */
- seq_printf(m, ",codepage=%s", &sbi->nls_disk->charset[2]);
+ seq_printf(m, ",codepage=%s",
+ &nls_charset_name(sbi->nls_disk)[2]);
if (isvfat) {
if (sbi->nls_io)
- seq_printf(m, ",iocharset=%s", sbi->nls_io->charset);
+ seq_printf(m, ",iocharset=%s",
+ nls_charset_name(sbi->nls_io));

switch (opts->shortname) {
case VFAT_SFN_DISPLAY_WIN95 | VFAT_SFN_CREATE_WIN95:
diff --git a/fs/hfs/super.c b/fs/hfs/super.c
index 173876782f73..b16ca01180a5 100644
--- a/fs/hfs/super.c
+++ b/fs/hfs/super.c
@@ -151,9 +151,11 @@ static int hfs_show_options(struct seq_file *seq, struct dentry *root)
if (sbi->session >= 0)
seq_printf(seq, ",session=%u", sbi->session);
if (sbi->nls_disk)
- seq_printf(seq, ",codepage=%s", sbi->nls_disk->charset);
+ seq_printf(seq, ",codepage=%s",
+ nls_charset_name(sbi->nls_disk));
if (sbi->nls_io)
- seq_printf(seq, ",iocharset=%s", sbi->nls_io->charset);
+ seq_printf(seq, ",iocharset=%s",
+ nls_charset_name(sbi->nls_io));
if (sbi->s_quiet)
seq_printf(seq, ",quiet");
return 0;
diff --git a/fs/hfsplus/options.c b/fs/hfsplus/options.c
index 047e05c57560..2d6644465566 100644
--- a/fs/hfsplus/options.c
+++ b/fs/hfsplus/options.c
@@ -230,7 +230,7 @@ int hfsplus_show_options(struct seq_file *seq, struct dentry *root)
if (sbi->session >= 0)
seq_printf(seq, ",session=%u", sbi->session);
if (sbi->nls)
- seq_printf(seq, ",nls=%s", sbi->nls->charset);
+ seq_printf(seq, ",nls=%s", nls_charset_name(sbi->nls));
if (test_bit(HFSPLUS_SB_NODECOMPOSE, &sbi->flags))
seq_puts(seq, ",nodecompose");
if (test_bit(HFSPLUS_SB_NOBARRIER, &sbi->flags))
diff --git a/fs/isofs/inode.c b/fs/isofs/inode.c
index 488a9e7f8f66..b23a3955b8c6 100644
--- a/fs/isofs/inode.c
+++ b/fs/isofs/inode.c
@@ -520,8 +520,9 @@ static int isofs_show_options(struct seq_file *m, struct dentry *root)

#ifdef CONFIG_JOLIET
if (sbi->s_nls_iocharset &&
- strcmp(sbi->s_nls_iocharset->charset, CONFIG_NLS_DEFAULT) != 0)
- seq_printf(m, ",iocharset=%s", sbi->s_nls_iocharset->charset);
+ strcmp(nls_charset_name(sbi->s_nls_iocharset), CONFIG_NLS_DEFAULT) != 0)
+ seq_printf(m, ",iocharset=%s",
+ nls_charset_name(sbi->s_nls_iocharset));
#endif
return 0;
}
diff --git a/fs/jfs/jfs_unicode.c b/fs/jfs/jfs_unicode.c
index 4ca88ef661e9..1e89b3b8caa7 100644
--- a/fs/jfs/jfs_unicode.c
+++ b/fs/jfs/jfs_unicode.c
@@ -92,7 +92,7 @@ static int jfs_strtoUCS(wchar_t * to, const unsigned char *from, int len,
jfs_err("jfs_strtoUCS: char2uni returned %d.",
charlen);
jfs_err("charset = %s, char = 0x%x",
- codepage->charset, *from);
+ nls_charset_name(codepage), *from);
return charlen;
}
}
diff --git a/fs/jfs/super.c b/fs/jfs/super.c
index 65d8fc87ab11..a04ff4bc5afd 100644
--- a/fs/jfs/super.c
+++ b/fs/jfs/super.c
@@ -736,7 +736,8 @@ static int jfs_show_options(struct seq_file *seq, struct dentry *root)
if (sbi->flag & JFS_DISCARD)
seq_printf(seq, ",discard=%u", sbi->minblks_trim);
if (sbi->nls_tab)
- seq_printf(seq, ",iocharset=%s", sbi->nls_tab->charset);
+ seq_printf(seq, ",iocharset=%s",
+ nls_charset_name(sbi->nls_tab));
if (sbi->flag & JFS_ERR_CONTINUE)
seq_printf(seq, ",errors=continue");
if (sbi->flag & JFS_ERR_PANIC)
diff --git a/fs/nls/nls_base.c b/fs/nls/nls_base.c
index 52ccd34b1e79..e5d083b6e2b2 100644
--- a/fs/nls/nls_base.c
+++ b/fs/nls/nls_base.c
@@ -277,7 +277,7 @@ static struct nls_table *find_nls(char *charset)
struct nls_table *nls;
spin_lock(&nls_lock);
for (nls = tables; nls; nls = nls->next) {
- if (!strcmp(nls->charset, charset))
+ if (!strcmp(nls_charset_name(nls), charset))
break;
if (nls->alias && !strcmp(nls->alias, charset))
break;
diff --git a/fs/ntfs/inode.c b/fs/ntfs/inode.c
index bd3221cbdd95..872ef265b117 100644
--- a/fs/ntfs/inode.c
+++ b/fs/ntfs/inode.c
@@ -2313,7 +2313,7 @@ int ntfs_show_options(struct seq_file *sf, struct dentry *root)
seq_printf(sf, ",fmask=0%o", vol->fmask);
seq_printf(sf, ",dmask=0%o", vol->dmask);
}
- seq_printf(sf, ",nls=%s", vol->nls_map->charset);
+ seq_printf(sf, ",nls=%s", nls_charset_name(vol->nls_map));
if (NVolCaseSensitive(vol))
seq_printf(sf, ",case_sensitive");
if (NVolShowSystemFiles(vol))
diff --git a/fs/ntfs/super.c b/fs/ntfs/super.c
index bb7159f697f2..1c68c33e9816 100644
--- a/fs/ntfs/super.c
+++ b/fs/ntfs/super.c
@@ -224,7 +224,7 @@ static bool parse_options(ntfs_volume *vol, char *opt)
}
ntfs_error(vol->sb, "NLS character set %s not "
"found. Using previous one %s.",
- v, old_nls->charset);
+ v, nls_charset_name(old_nls));
nls_map = old_nls;
} else /* nls_map */ {
unload_nls(old_nls);
@@ -274,7 +274,7 @@ static bool parse_options(ntfs_volume *vol, char *opt)
"on remount.");
return false;
} /* else (!vol->nls_map) */
- ntfs_debug("Using NLS character set %s.", nls_map->charset);
+ ntfs_debug("Using NLS character set %s.", nls_charset_name(nls_map));
vol->nls_map = nls_map;
} else /* (!nls_map) */ {
if (!vol->nls_map) {
@@ -285,7 +285,7 @@ static bool parse_options(ntfs_volume *vol, char *opt)
return false;
}
ntfs_debug("Using default NLS character set (%s).",
- vol->nls_map->charset);
+ nls_charset_name(vol->nls_map));
}
}
if (mft_zone_multiplier != -1) {
diff --git a/fs/ntfs/unistr.c b/fs/ntfs/unistr.c
index e0a5f33441df..a30911979a55 100644
--- a/fs/ntfs/unistr.c
+++ b/fs/ntfs/unistr.c
@@ -297,7 +297,7 @@ int ntfs_nlstoucs(const ntfs_volume *vol, const char *ins,
if (wc_len < 0) {
ntfs_error(vol->sb, "Name using character set %s contains "
"characters that cannot be converted to "
- "Unicode.", nls->charset);
+ "Unicode.", nls_charset_name(nls));
i = -EILSEQ;
} else /* if (o >= NTFS_MAX_NAME_LEN) */ {
ntfs_error(vol->sb, "Name is too long (maximum length for a "
@@ -386,7 +386,8 @@ retry: wc = nls_uni2char(nls, le16_to_cpu(ins[i]),
conversion_err:
ntfs_error(vol->sb, "Unicode name contains characters that cannot be "
"converted to character set %s. You might want to "
- "try to use the mount option nls=utf8.", nls->charset);
+ "try to use the mount option nls=utf8.",
+ nls_charset_name(nls));
if (ns != *outs)
kfree(ns);
if (wc != -ENAMETOOLONG)
diff --git a/fs/udf/super.c b/fs/udf/super.c
index 8f2f56d9a1bb..284087eb64d0 100644
--- a/fs/udf/super.c
+++ b/fs/udf/super.c
@@ -361,7 +361,8 @@ static int udf_show_options(struct seq_file *seq, struct dentry *root)
if (UDF_QUERY_FLAG(sb, UDF_FLAG_UTF8))
seq_puts(seq, ",utf8");
if (UDF_QUERY_FLAG(sb, UDF_FLAG_NLS_MAP) && sbi->s_nls_map)
- seq_printf(seq, ",iocharset=%s", sbi->s_nls_map->charset);
+ seq_printf(seq, ",iocharset=%s",
+ nls_charset_name(sbi->s_nls_map));

return 0;
}
diff --git a/include/linux/nls.h b/include/linux/nls.h
index 5073ecd57279..cacbcd7d63e6 100644
--- a/include/linux/nls.h
+++ b/include/linux/nls.h
@@ -72,6 +72,11 @@ static inline int nls_char2uni(const struct nls_table *table,
return table->char2uni(rawstring, boundlen, uni);
}

+static inline const char *nls_charset_name(const struct nls_table *table)
+{
+ return table->charset;
+}
+
static inline unsigned char nls_tolower(struct nls_table *t, unsigned char c)
{
unsigned char nc = t->charset2lower[c];
--
2.20.0.rc2

2018-12-06 22:05:38

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: [PATCH v4 15/23] nls: utf8: Introduce code for UTF-8 normalization

From: Olaf Weber <[email protected]>

Supporting functions for UTF-8 normalization are in utf8norm.c with the
header utf8norm.h. Two normalization forms are supported: nfkdi and
nfkdicf.

nfkdi:
- Apply unicode normalization form NFKD.
- Remove any Default_Ignorable_Code_Point.

nfkdicf:
- Apply unicode normalization form NFKD.
- Remove any Default_Ignorable_Code_Point.
- Apply a full casefold (C + F).

For the purposes of the code, a string is valid UTF-8 if:

- The values encoded are 0x1..0x10FFFF.
- The surrogate codepoints 0xD800..0xDFFFF are not encoded.
- The shortest possible encoding is used for all values.

The supporting functions work on null-terminated strings (utf8 prefix)
and on length-limited strings (utf8n prefix).

>From the original SGI patch and for conformity with coding standards,
the utf8data_t typedef was dropped, since it was just masking the struct
keyword. On other occasions, namely utf8leaf_t and utf8trie_t, I
decided to keep it, since they are simple pointers to memory buffers,
and using uchars here wouldn't provide any more meaningful information.

Changes since RFC v2:
- Merge to NLS system

Changes since RFC v1:
- utf8_version_is_supported receives maj, min and rev as separate
arguments. (Olaf Weber)

Signed-off-by: Olaf Weber <[email protected]>
Signed-off-by: Gabriel Krisman Bertazi <[email protected]>
[Rebase to Mainline]
[Fix up checkpatch.pl warnings]
[Drop typedefs]
[Merge with NLS subsystem]
---
fs/nls/Makefile | 2 +
fs/nls/nls_utf8-norm.c | 640 +++++++++++++++++++++++++++++++++++++++++
fs/nls/utf8n.h | 112 ++++++++
3 files changed, 754 insertions(+)
create mode 100644 fs/nls/nls_utf8-norm.c
create mode 100644 fs/nls/utf8n.h

diff --git a/fs/nls/Makefile b/fs/nls/Makefile
index c94221b6108d..bd13c1a90767 100644
--- a/fs/nls/Makefile
+++ b/fs/nls/Makefile
@@ -46,6 +46,7 @@ obj-$(CONFIG_NLS_KOI8_U) += nls_koi8-u.o nls_koi8-ru.o

obj-$(CONFIG_NLS_UTF8) += nls_utf8.o
nls_utf8-y += nls_utf8-core.o
+nls_utf8-$(CONFIG_NLS_UTF8_NORMALIZATION) += nls_utf8-norm.o

obj-$(CONFIG_NLS_MAC_CELTIC) += mac-celtic.o
obj-$(CONFIG_NLS_MAC_CENTEURO) += mac-centeuro.o
@@ -59,6 +60,7 @@ obj-$(CONFIG_NLS_MAC_ROMANIAN) += mac-romanian.o
obj-$(CONFIG_NLS_MAC_ROMAN) += mac-roman.o
obj-$(CONFIG_NLS_MAC_TURKISH) += mac-turkish.o

+$(obj)/nls_utf8-norm.o: $(obj)/utf8data.h
$(obj)/utf8data.h: $(srctree)/$(src)/ucd/*.txt $(objtree)/scripts/mkutf8data FORCE
$(call cmd,mkutf8data)
quiet_cmd_mkutf8data = MKUTF8DATA $@
diff --git a/fs/nls/nls_utf8-norm.c b/fs/nls/nls_utf8-norm.c
new file mode 100644
index 000000000000..ca0bbf644b49
--- /dev/null
+++ b/fs/nls/nls_utf8-norm.c
@@ -0,0 +1,640 @@
+/*
+ * Copyright (c) 2014 SGI.
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#include "utf8n.h"
+
+struct utf8data {
+ unsigned int maxage;
+ unsigned int offset;
+};
+
+#define __INCLUDED_FROM_UTF8NORM_C__
+#include "utf8data.h"
+#undef __INCLUDED_FROM_UTF8NORM_C__
+
+int utf8version_is_supported(u8 maj, u8 min, u8 rev)
+{
+ int i = ARRAY_SIZE(utf8agetab) - 1;
+ unsigned int sb_utf8version = UNICODE_AGE(maj, min, rev);
+
+ while (i >= 0 && utf8agetab[i] != 0) {
+ if (sb_utf8version == utf8agetab[i])
+ return 1;
+ i--;
+ }
+ return 0;
+}
+EXPORT_SYMBOL(utf8version_is_supported);
+
+/*
+ * UTF-8 valid ranges.
+ *
+ * The UTF-8 encoding spreads the bits of a 32bit word over several
+ * bytes. This table gives the ranges that can be held and how they'd
+ * be represented.
+ *
+ * 0x00000000 0x0000007F: 0xxxxxxx
+ * 0x00000000 0x000007FF: 110xxxxx 10xxxxxx
+ * 0x00000000 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ *
+ * There is an additional requirement on UTF-8, in that only the
+ * shortest representation of a 32bit value is to be used. A decoder
+ * must not decode sequences that do not satisfy this requirement.
+ * Thus the allowed ranges have a lower bound.
+ *
+ * 0x00000000 0x0000007F: 0xxxxxxx
+ * 0x00000080 0x000007FF: 110xxxxx 10xxxxxx
+ * 0x00000800 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
+ * 0x00010000 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x00200000 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ * 0x04000000 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
+ *
+ * Actual unicode characters are limited to the range 0x0 - 0x10FFFF,
+ * 17 planes of 65536 values. This limits the sequences actually seen
+ * even more, to just the following.
+ *
+ * 0 - 0x7F: 0 - 0x7F
+ * 0x80 - 0x7FF: 0xC2 0x80 - 0xDF 0xBF
+ * 0x800 - 0xFFFF: 0xE0 0xA0 0x80 - 0xEF 0xBF 0xBF
+ * 0x10000 - 0x10FFFF: 0xF0 0x90 0x80 0x80 - 0xF4 0x8F 0xBF 0xBF
+ *
+ * Within those ranges the surrogates 0xD800 - 0xDFFF are not allowed.
+ *
+ * Note that the longest sequence seen with valid usage is 4 bytes,
+ * the same a single UTF-32 character. This makes the UTF-8
+ * representation of Unicode strictly smaller than UTF-32.
+ *
+ * The shortest sequence requirement was introduced by:
+ * Corrigendum #1: UTF-8 Shortest Form
+ * It can be found here:
+ * http://www.unicode.org/versions/corrigendum1.html
+ *
+ */
+
+/*
+ * Return the number of bytes used by the current UTF-8 sequence.
+ * Assumes the input points to the first byte of a valid UTF-8
+ * sequence.
+ */
+static inline int utf8clen(const char *s)
+{
+ unsigned char c = *s;
+
+ return 1 + (c >= 0xC0) + (c >= 0xE0) + (c >= 0xF0);
+}
+
+/*
+ * utf8trie_t
+ *
+ * A compact binary tree, used to decode UTF-8 characters.
+ *
+ * Internal nodes are one byte for the node itself, and up to three
+ * bytes for an offset into the tree. The first byte contains the
+ * following information:
+ * NEXTBYTE - flag - advance to next byte if set
+ * BITNUM - 3 bit field - the bit number to tested
+ * OFFLEN - 2 bit field - number of bytes in the offset
+ * if offlen == 0 (non-branching node)
+ * RIGHTPATH - 1 bit field - set if the following node is for the
+ * right-hand path (tested bit is set)
+ * TRIENODE - 1 bit field - set if the following node is an internal
+ * node, otherwise it is a leaf node
+ * if offlen != 0 (branching node)
+ * LEFTNODE - 1 bit field - set if the left-hand node is internal
+ * RIGHTNODE - 1 bit field - set if the right-hand node is internal
+ *
+ * Due to the way utf8 works, there cannot be branching nodes with
+ * NEXTBYTE set, and moreover those nodes always have a righthand
+ * descendant.
+ */
+typedef const unsigned char utf8trie_t;
+#define BITNUM 0x07
+#define NEXTBYTE 0x08
+#define OFFLEN 0x30
+#define OFFLEN_SHIFT 4
+#define RIGHTPATH 0x40
+#define TRIENODE 0x80
+#define RIGHTNODE 0x40
+#define LEFTNODE 0x80
+
+/*
+ * utf8leaf_t
+ *
+ * The leaves of the trie are embedded in the trie, and so the same
+ * underlying datatype: unsigned char.
+ *
+ * leaf[0]: The unicode version, stored as a generation number that is
+ * an index into utf8agetab[]. With this we can filter code
+ * points based on the unicode version in which they were
+ * defined. The CCC of a non-defined code point is 0.
+ * leaf[1]: Canonical Combining Class. During normalization, we need
+ * to do a stable sort into ascending order of all characters
+ * with a non-zero CCC that occur between two characters with
+ * a CCC of 0, or at the begin or end of a string.
+ * The unicode standard guarantees that all CCC values are
+ * between 0 and 254 inclusive, which leaves 255 available as
+ * a special value.
+ * Code points with CCC 0 are known as stoppers.
+ * leaf[2]: Decomposition. If leaf[1] == 255, then leaf[2] is the
+ * start of a NUL-terminated string that is the decomposition
+ * of the character.
+ * The CCC of a decomposable character is the same as the CCC
+ * of the first character of its decomposition.
+ * Some characters decompose as the empty string: these are
+ * characters with the Default_Ignorable_Code_Point property.
+ * These do affect normalization, as they all have CCC 0.
+ *
+ * The decompositions in the trie have been fully expanded.
+ *
+ * Casefolding, if applicable, is also done using decompositions.
+ *
+ * The trie is constructed in such a way that leaves exist for all
+ * UTF-8 sequences that match the criteria from the "UTF-8 valid
+ * ranges" comment above, and only for those sequences. Therefore a
+ * lookup in the trie can be used to validate the UTF-8 input.
+ */
+typedef const unsigned char utf8leaf_t;
+
+#define LEAF_GEN(LEAF) ((LEAF)[0])
+#define LEAF_CCC(LEAF) ((LEAF)[1])
+#define LEAF_STR(LEAF) ((const char *)((LEAF) + 2))
+
+#define MINCCC (0)
+#define MAXCCC (254)
+#define STOPPER (0)
+#define DECOMPOSE (255)
+
+/*
+ * Use trie to scan s, touching at most len bytes.
+ * Returns the leaf if one exists, NULL otherwise.
+ *
+ * A non-NULL return guarantees that the UTF-8 sequence starting at s
+ * is well-formed and corresponds to a known unicode code point. The
+ * shorthand for this will be "is valid UTF-8 unicode".
+ */
+static utf8leaf_t *utf8nlookup(const struct utf8data *data, const char *s,
+ size_t len)
+{
+ utf8trie_t *trie = utf8data + data->offset;
+ int offlen;
+ int offset;
+ int mask;
+ int node;
+
+ if (!data)
+ return NULL;
+ if (len == 0)
+ return NULL;
+ node = 1;
+ while (node) {
+ offlen = (*trie & OFFLEN) >> OFFLEN_SHIFT;
+ if (*trie & NEXTBYTE) {
+ if (--len == 0)
+ return NULL;
+ s++;
+ }
+ mask = 1 << (*trie & BITNUM);
+ if (*s & mask) {
+ /* Right leg */
+ if (offlen) {
+ /* Right node at offset of trie */
+ node = (*trie & RIGHTNODE);
+ offset = trie[offlen];
+ while (--offlen) {
+ offset <<= 8;
+ offset |= trie[offlen];
+ }
+ trie += offset;
+ } else if (*trie & RIGHTPATH) {
+ /* Right node after this node */
+ node = (*trie & TRIENODE);
+ trie++;
+ } else {
+ /* No right node. */
+ node = 0;
+ trie = NULL;
+ }
+ } else {
+ /* Left leg */
+ if (offlen) {
+ /* Left node after this node. */
+ node = (*trie & LEFTNODE);
+ trie += offlen + 1;
+ } else if (*trie & RIGHTPATH) {
+ /* No left node. */
+ node = 0;
+ trie = NULL;
+ } else {
+ /* Left node after this node */
+ node = (*trie & TRIENODE);
+ trie++;
+ }
+ }
+ }
+ return trie;
+}
+
+/*
+ * Use trie to scan s.
+ * Returns the leaf if one exists, NULL otherwise.
+ *
+ * Forwards to utf8nlookup().
+ */
+static utf8leaf_t *utf8lookup(const struct utf8data *data, const char *s)
+{
+ return utf8nlookup(data, s, (size_t)-1);
+}
+
+/*
+ * Maximum age of any character in s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ * Return 0 if only non-assigned code points are used.
+ */
+int utf8agemax(const struct utf8data *data, const char *s)
+{
+ utf8leaf_t *leaf;
+ int age = 0;
+ int leaf_age;
+
+ if (!data)
+ return -1;
+ while (*s) {
+ leaf = utf8lookup(data, s);
+ if (!leaf)
+ return -1;
+
+ leaf_age = utf8agetab[LEAF_GEN(leaf)];
+ if (leaf_age <= data->maxage && leaf_age > age)
+ age = leaf_age;
+ s += utf8clen(s);
+ }
+ return age;
+}
+EXPORT_SYMBOL(utf8agemax);
+
+/*
+ * Minimum age of any character in s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ * Return 0 if non-assigned code points are used.
+ */
+int utf8agemin(const struct utf8data *data, const char *s)
+{
+ utf8leaf_t *leaf;
+ int age;
+ int leaf_age;
+
+ if (!data)
+ return -1;
+ age = data->maxage;
+ while (*s) {
+ leaf = utf8lookup(data, s);
+ if (!leaf)
+ return -1;
+ leaf_age = utf8agetab[LEAF_GEN(leaf)];
+ if (leaf_age <= data->maxage && leaf_age < age)
+ age = leaf_age;
+ s += utf8clen(s);
+ }
+ return age;
+}
+EXPORT_SYMBOL(utf8agemin);
+
+/*
+ * Maximum age of any character in s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+int utf8nagemax(const struct utf8data *data, const char *s, size_t len)
+{
+ utf8leaf_t *leaf;
+ int age = 0;
+ int leaf_age;
+
+ if (!data)
+ return -1;
+ while (len && *s) {
+ leaf = utf8nlookup(data, s, len);
+ if (!leaf)
+ return -1;
+ leaf_age = utf8agetab[LEAF_GEN(leaf)];
+ if (leaf_age <= data->maxage && leaf_age > age)
+ age = leaf_age;
+ len -= utf8clen(s);
+ s += utf8clen(s);
+ }
+ return age;
+}
+EXPORT_SYMBOL(utf8nagemax);
+
+/*
+ * Maximum age of any character in s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+int utf8nagemin(const struct utf8data *data, const char *s, size_t len)
+{
+ utf8leaf_t *leaf;
+ int leaf_age;
+ int age;
+
+ if (!data)
+ return -1;
+ age = data->maxage;
+ while (len && *s) {
+ leaf = utf8nlookup(data, s, len);
+ if (!leaf)
+ return -1;
+ leaf_age = utf8agetab[LEAF_GEN(leaf)];
+ if (leaf_age <= data->maxage && leaf_age < age)
+ age = leaf_age;
+ len -= utf8clen(s);
+ s += utf8clen(s);
+ }
+ return age;
+}
+EXPORT_SYMBOL(utf8nagemin);
+
+/*
+ * Length of the normalization of s.
+ * Return -1 if s is not valid UTF-8 unicode.
+ *
+ * A string of Default_Ignorable_Code_Point has length 0.
+ */
+ssize_t utf8len(const struct utf8data *data, const char *s)
+{
+ utf8leaf_t *leaf;
+ size_t ret = 0;
+
+ if (!data)
+ return -1;
+ while (*s) {
+ leaf = utf8lookup(data, s);
+ if (!leaf)
+ return -1;
+ if (utf8agetab[LEAF_GEN(leaf)] > data->maxage)
+ ret += utf8clen(s);
+ else if (LEAF_CCC(leaf) == DECOMPOSE)
+ ret += strlen(LEAF_STR(leaf));
+ else
+ ret += utf8clen(s);
+ s += utf8clen(s);
+ }
+ return ret;
+}
+EXPORT_SYMBOL(utf8len);
+
+/*
+ * Length of the normalization of s, touch at most len bytes.
+ * Return -1 if s is not valid UTF-8 unicode.
+ */
+ssize_t utf8nlen(const struct utf8data *data, const char *s, size_t len)
+{
+ utf8leaf_t *leaf;
+ size_t ret = 0;
+
+ if (!data)
+ return -1;
+ while (len && *s) {
+ leaf = utf8nlookup(data, s, len);
+ if (!leaf)
+ return -1;
+ if (utf8agetab[LEAF_GEN(leaf)] > data->maxage)
+ ret += utf8clen(s);
+ else if (LEAF_CCC(leaf) == DECOMPOSE)
+ ret += strlen(LEAF_STR(leaf));
+ else
+ ret += utf8clen(s);
+ len -= utf8clen(s);
+ s += utf8clen(s);
+ }
+ return ret;
+}
+EXPORT_SYMBOL(utf8nlen);
+
+/*
+ * Set up an utf8cursor for use by utf8byte().
+ *
+ * u8c : pointer to cursor.
+ * data : const struct utf8data to use for normalization.
+ * s : string.
+ * len : length of s.
+ *
+ * Returns -1 on error, 0 on success.
+ */
+int utf8ncursor(struct utf8cursor *u8c, const struct utf8data *data,
+ const char *s, size_t len)
+{
+ if (!data)
+ return -1;
+ if (!s)
+ return -1;
+ u8c->data = data;
+ u8c->s = s;
+ u8c->p = NULL;
+ u8c->ss = NULL;
+ u8c->sp = NULL;
+ u8c->len = len;
+ u8c->slen = 0;
+ u8c->ccc = STOPPER;
+ u8c->nccc = STOPPER;
+ /* Check we didn't clobber the maximum length. */
+ if (u8c->len != len)
+ return -1;
+ /* The first byte of s may not be an utf8 continuation. */
+ if (len > 0 && (*s & 0xC0) == 0x80)
+ return -1;
+ return 0;
+}
+EXPORT_SYMBOL(utf8ncursor);
+
+/*
+ * Set up an utf8cursor for use by utf8byte().
+ *
+ * u8c : pointer to cursor.
+ * data : const struct utf8data to use for normalization.
+ * s : NUL-terminated string.
+ *
+ * Returns -1 on error, 0 on success.
+ */
+int utf8cursor(struct utf8cursor *u8c, const struct utf8data *data,
+ const char *s)
+{
+ return utf8ncursor(u8c, data, s, (unsigned int)-1);
+}
+EXPORT_SYMBOL(utf8cursor);
+
+/*
+ * Get one byte from the normalized form of the string described by u8c.
+ *
+ * Returns the byte cast to an unsigned char on succes, and -1 on failure.
+ *
+ * The cursor keeps track of the location in the string in u8c->s.
+ * When a character is decomposed, the current location is stored in
+ * u8c->p, and u8c->s is set to the start of the decomposition. Note
+ * that bytes from a decomposition do not count against u8c->len.
+ *
+ * Characters are emitted if they match the current CCC in u8c->ccc.
+ * Hitting end-of-string while u8c->ccc == STOPPER means we're done,
+ * and the function returns 0 in that case.
+ *
+ * Sorting by CCC is done by repeatedly scanning the string. The
+ * values of u8c->s and u8c->p are stored in u8c->ss and u8c->sp at
+ * the start of the scan. The first pass finds the lowest CCC to be
+ * emitted and stores it in u8c->nccc, the second pass emits the
+ * characters with this CCC and finds the next lowest CCC. This limits
+ * the number of passes to 1 + the number of different CCCs in the
+ * sequence being scanned.
+ *
+ * Therefore:
+ * u8c->p != NULL -> a decomposition is being scanned.
+ * u8c->ss != NULL -> this is a repeating scan.
+ * u8c->ccc == -1 -> this is the first scan of a repeating scan.
+ */
+int utf8byte(struct utf8cursor *u8c)
+{
+ utf8leaf_t *leaf;
+ int ccc;
+
+ for (;;) {
+ /* Check for the end of a decomposed character. */
+ if (u8c->p && *u8c->s == '\0') {
+ u8c->s = u8c->p;
+ u8c->p = NULL;
+ }
+
+ /* Check for end-of-string. */
+ if (!u8c->p && (u8c->len == 0 || *u8c->s == '\0')) {
+ /* There is no next byte. */
+ if (u8c->ccc == STOPPER)
+ return 0;
+ /* End-of-string during a scan counts as a stopper. */
+ ccc = STOPPER;
+ goto ccc_mismatch;
+ } else if ((*u8c->s & 0xC0) == 0x80) {
+ /* This is a continuation of the current character. */
+ if (!u8c->p)
+ u8c->len--;
+ return (unsigned char)*u8c->s++;
+ }
+
+ /* Look up the data for the current character. */
+ if (u8c->p)
+ leaf = utf8lookup(u8c->data, u8c->s);
+ else
+ leaf = utf8nlookup(u8c->data, u8c->s, u8c->len);
+
+ /* No leaf found implies that the input is a binary blob. */
+ if (!leaf)
+ return -1;
+
+ ccc = LEAF_CCC(leaf);
+ /* Characters that are too new have CCC 0. */
+ if (utf8agetab[LEAF_GEN(leaf)] > u8c->data->maxage) {
+ ccc = STOPPER;
+ } else if (ccc == DECOMPOSE) {
+ u8c->len -= utf8clen(u8c->s);
+ u8c->p = u8c->s + utf8clen(u8c->s);
+ u8c->s = LEAF_STR(leaf);
+ /* Empty decomposition implies CCC 0. */
+ if (*u8c->s == '\0') {
+ if (u8c->ccc == STOPPER)
+ continue;
+ ccc = STOPPER;
+ goto ccc_mismatch;
+ }
+ leaf = utf8lookup(u8c->data, u8c->s);
+ }
+
+ /*
+ * If this is not a stopper, then see if it updates
+ * the next canonical class to be emitted.
+ */
+ if (ccc != STOPPER && u8c->ccc < ccc && ccc < u8c->nccc)
+ u8c->nccc = ccc;
+
+ /*
+ * Return the current byte if this is the current
+ * combining class.
+ */
+ if (ccc == u8c->ccc) {
+ if (!u8c->p)
+ u8c->len--;
+ return (unsigned char)*u8c->s++;
+ }
+
+ /* Current combining class mismatch. */
+ccc_mismatch:
+ if (u8c->nccc == STOPPER) {
+ /*
+ * Scan forward for the first canonical class
+ * to be emitted. Save the position from
+ * which to restart.
+ */
+ u8c->ccc = MINCCC - 1;
+ u8c->nccc = ccc;
+ u8c->sp = u8c->p;
+ u8c->ss = u8c->s;
+ u8c->slen = u8c->len;
+ if (!u8c->p)
+ u8c->len -= utf8clen(u8c->s);
+ u8c->s += utf8clen(u8c->s);
+ } else if (ccc != STOPPER) {
+ /* Not a stopper, and not the ccc we're emitting. */
+ if (!u8c->p)
+ u8c->len -= utf8clen(u8c->s);
+ u8c->s += utf8clen(u8c->s);
+ } else if (u8c->nccc != MAXCCC + 1) {
+ /* At a stopper, restart for next ccc. */
+ u8c->ccc = u8c->nccc;
+ u8c->nccc = MAXCCC + 1;
+ u8c->s = u8c->ss;
+ u8c->p = u8c->sp;
+ u8c->len = u8c->slen;
+ } else {
+ /* All done, proceed from here. */
+ u8c->ccc = STOPPER;
+ u8c->nccc = STOPPER;
+ u8c->sp = NULL;
+ u8c->ss = NULL;
+ u8c->slen = 0;
+ }
+ }
+}
+EXPORT_SYMBOL(utf8byte);
+
+const struct utf8data *utf8nfkdi(unsigned int maxage)
+{
+ int i = ARRAY_SIZE(utf8nfkdidata) - 1;
+
+ while (maxage < utf8nfkdidata[i].maxage)
+ i--;
+ if (maxage > utf8nfkdidata[i].maxage)
+ return NULL;
+ return &utf8nfkdidata[i];
+}
+EXPORT_SYMBOL(utf8nfkdi);
+
+const struct utf8data *utf8nfkdicf(unsigned int maxage)
+{
+ int i = ARRAY_SIZE(utf8nfkdicfdata) - 1;
+
+ while (maxage < utf8nfkdicfdata[i].maxage)
+ i--;
+ if (maxage > utf8nfkdicfdata[i].maxage)
+ return NULL;
+ return &utf8nfkdicfdata[i];
+}
+EXPORT_SYMBOL(utf8nfkdicf);
diff --git a/fs/nls/utf8n.h b/fs/nls/utf8n.h
new file mode 100644
index 000000000000..0f5fc14d4fd2
--- /dev/null
+++ b/fs/nls/utf8n.h
@@ -0,0 +1,112 @@
+/*
+ * Copyright (c) 2014 SGI.
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#ifndef UTF8NORM_H
+#define UTF8NORM_H
+
+#include <linux/types.h>
+#include <linux/export.h>
+#include <linux/string.h>
+#include <linux/module.h>
+
+/* Encoding a unicode version number as a single unsigned int. */
+#define UNICODE_MAJ_SHIFT (16)
+#define UNICODE_MIN_SHIFT (8)
+
+#define UNICODE_AGE(MAJ, MIN, REV) \
+ (((unsigned int)(MAJ) << UNICODE_MAJ_SHIFT) | \
+ ((unsigned int)(MIN) << UNICODE_MIN_SHIFT) | \
+ ((unsigned int)(REV)))
+
+/* Highest unicode version supported by the data tables. */
+extern int utf8version_is_supported(u8 maj, u8 min, u8 rev);
+
+/*
+ * Look for the correct const struct utf8data for a unicode version.
+ * Returns NULL if the version requested is too new.
+ *
+ * Two normalization forms are supported: nfkdi and nfkdicf.
+ *
+ * nfkdi:
+ * - Apply unicode normalization form NFKD.
+ * - Remove any Default_Ignorable_Code_Point.
+ *
+ * nfkdicf:
+ * - Apply unicode normalization form NFKD.
+ * - Remove any Default_Ignorable_Code_Point.
+ * - Apply a full casefold (C + F).
+ */
+extern const struct utf8data *utf8nfkdi(unsigned int maxage);
+extern const struct utf8data *utf8nfkdicf(unsigned int maxage);
+
+/*
+ * Determine the maximum age of any unicode character in the string.
+ * Returns 0 if only unassigned code points are present.
+ * Returns -1 if the input is not valid UTF-8.
+ */
+extern int utf8agemax(const struct utf8data *data, const char *s);
+extern int utf8nagemax(const struct utf8data *data, const char *s, size_t len);
+
+/*
+ * Determine the minimum age of any unicode character in the string.
+ * Returns 0 if any unassigned code points are present.
+ * Returns -1 if the input is not valid UTF-8.
+ */
+extern int utf8agemin(const struct utf8data *data, const char *s);
+extern int utf8nagemin(const struct utf8data *data, const char *s, size_t len);
+
+/*
+ * Determine the length of the normalized from of the string,
+ * excluding any terminating NULL byte.
+ * Returns 0 if only ignorable code points are present.
+ * Returns -1 if the input is not valid UTF-8.
+ */
+extern ssize_t utf8len(const struct utf8data *data, const char *s);
+extern ssize_t utf8nlen(const struct utf8data *data, const char *s, size_t len);
+
+/*
+ * Cursor structure used by the normalizer.
+ */
+struct utf8cursor {
+ const struct utf8data *data;
+ const char *s;
+ const char *p;
+ const char *ss;
+ const char *sp;
+ unsigned int len;
+ unsigned int slen;
+ short int ccc;
+ short int nccc;
+};
+
+/*
+ * Initialize a utf8cursor to normalize a string.
+ * Returns 0 on success.
+ * Returns -1 on failure.
+ */
+extern int utf8cursor(struct utf8cursor *u8c, const struct utf8data *data,
+ const char *s);
+extern int utf8ncursor(struct utf8cursor *u8c, const struct utf8data *data,
+ const char *s, size_t len);
+
+/*
+ * Get the next byte in the normalization.
+ * Returns a value > 0 && < 256 on success.
+ * Returns 0 when the end of the normalization is reached.
+ * Returns -1 if the string being normalized is not valid UTF-8.
+ */
+extern int utf8byte(struct utf8cursor *u8c);
+
+#endif /* UTF8NORM_H */
--
2.20.0.rc2