Received: by 2002:ac0:a594:0:0:0:0:0 with SMTP id m20-v6csp578460imm; Mon, 21 May 2018 10:38:20 -0700 (PDT) X-Google-Smtp-Source: AB8JxZpolacXb12Llqa+HCmKBEUm3HnwuqmJpC6HfuN0gBGk0/SAXB27AX9iC8yx0modXCw6tW/b X-Received: by 2002:a65:4b4d:: with SMTP id k13-v6mr16211061pgt.198.1526924300057; Mon, 21 May 2018 10:38:20 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1526924300; cv=none; d=google.com; s=arc-20160816; b=FEwQ0+eM0+klCazbtj9103CD/AZUVvYMsYfbuCqUpZCvBfnRpmq5hYqTlw9iB1GHP+ B7LoBcKYdaI37AijpgbnsPvSMldoyBz70X8ctydkWPce9YacLjLHBH5TATjBNY8PeJzS UXwp2yNQJQYSiqdyajNhClPpLdLZz1MvjDJWwBJk1e3a0XZnxSfeYrJ1UtFOTY9lzpM9 GTD0evwaeocJ+A4oLAOrsVkA084G4X+01+uOY8aEHmDQ+6nZpy9yY6nZk58/g37u3v+A 2EHhJLexu8/xuq9GgQc2doI8LNyAGP2s+z3Y+8ZUoCfv6Tz2L2FNMtYr8GMYQs5xSjPi qaiA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:message-id:date:subject:cc:to:from :arc-authentication-results; bh=wIyZoNZAFr32KiMzq93bWpJjAr2Ga5jg2tDww+H/W38=; b=k4dSWqp2oo7i011EIBCBAVWAA7Dgj+YqNTRkRqMz+5IX0i9m2p/61ZeELQF4Xy5Xyg g4ICXPR98Kp5s4nDK4zzx0+GUTH27026sTZo71YxSNo42Owttgw+dxOU3vBM0+B3PFhX FAKm1GjYSGdh4wL83qVBbs6uk/xHEc9GiXP1PJJPA/qbN5HZ+xJeTaZuD2fWMBSC3djy ky4V/Sq5KuBooX9jja0Mzzsbi6BJZXkXugrQS5lp6LHjq+hNBaM0ID2q219dIJTi2O/7 /k+9vV6MwoccGp7qXbs1mwPq9270/odtnmPYTsZk5Y3/Mz0nejUgUcgO6s8fv0VXwmRb MqLQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=collabora.co.uk Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id f9-v6si2377064pgn.334.2018.05.21.10.38.05; Mon, 21 May 2018 10:38:20 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=collabora.co.uk Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753440AbeEURgE (ORCPT + 99 others); Mon, 21 May 2018 13:36:04 -0400 Received: from bhuna.collabora.co.uk ([46.235.227.227]:45240 "EHLO bhuna.collabora.co.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753193AbeEURgC (ORCPT ); Mon, 21 May 2018 13:36:02 -0400 Received: from [127.0.0.1] (localhost [127.0.0.1]) (Authenticated sender: krisman) with ESMTPSA id 3F71028707D From: Gabriel Krisman Bertazi To: viro@ZenIV.linux.org.uk Cc: jra@google.com, tytso@mit.edu, olaf@sgi.com, darrick.wong@oracle.com, kernel@lists.collabora.co.uk, linux-fsdevel@vger.kernel.org, david@fromorbit.com, jack@suse.cz, linux-kernel@vger.kernel.org, Gabriel Krisman Bertazi Subject: [PATCH v2 00/15] NLS refactor and UTF-8 normalization Date: Mon, 21 May 2018 14:36:02 -0300 Message-Id: <20180521173617.31625-1-krisman@collabora.co.uk> X-Mailer: git-send-email 2.17.0 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Archive of previous versions: https://www.spinics.net/lists/linux-fsdevel/msg125523.html The goal of this patchset is to adapt the NLS subsystem to support full UTF-8 normalization and casefold operations for specific versions of unicode. There are many use cases of this feature and, while my specific interest is on case-insensitiveness for local filesystems, I am aware this might be in the wishlist of people working on SMB, NFS and others. The first part of the patchset refactors the NLS subsytem to allow multiple tables of the same encoding (differentiated by the version) in an inexpensive way. It also create hooks for some higher-level operations, like comparisons and normalization. A new interface is exported to request a specific version of the charset. The second part of the patchset introduces the utf8 normalization code as a new NLS charset. It is implemented as a separated charset, called utf8n to preserve curent behavior of the nls_utf8 charset. The normalization core is provided by the SGI patches from 2014, which I refactored, adapted and updated to unicode 10.0.0. The last patch is a test module, which does some sanity check on the normalization code when it is loaded. As usual, the unicode source files are not part of the patch because they are too big and would bounce the emails. Which means that kbuild will complain about not being able to generate ucd/*.txt files. Kbuild will also complain about code that is just being moved, which I'm not going to fix at this time. please refer to v1, for more information. If you are interested in fetching everything with minimal effort, you can clone a branch from: https://gitlab.collabora.com/krisman/linux.git -b nls-merge-int Changes since v1: - Fix nls_base.ko module build Gabriel Krisman Bertazi (11): nls: Wrap uni2char/char2uni callers nls: Wrap charset field access nls: Wrap charset hooks in ops structure nls: Split default charset from NLS core nls: Split struct nls_charset from struct nls_table nls: Add support for multiple versions of an encoding nls: Add new interface for string comparisons nls: Let charsets define the behavior of tolower/toupper nls: Add optional normalization and casefold hooks nls: utf8norm: Integrate utf8norm code with NLS subsystem nls: utf8norm: Introduce test module for utf8norm implementation Olaf Weber (4): nls: utf8norm: Add unicode character database files scripts: add trie generator for UTF-8 nls: utf8norm: Introduce code for UTF-8 normalization nls: utf8norm: reduce the size of utf8data[] drivers/staging/ncpfs/ioctl.c | 13 +- drivers/staging/ncpfs/ncplib_kernel.c | 8 +- fs/befs/linuxvfs.c | 8 +- fs/cifs/cifs_unicode.c | 15 +- fs/cifs/cifsfs.c | 2 +- fs/cifs/connect.c | 2 +- fs/cifs/dir.c | 7 +- fs/fat/dir.c | 13 +- fs/fat/inode.c | 6 +- fs/fat/namei_vfat.c | 6 +- fs/hfs/super.c | 6 +- fs/hfs/trans.c | 9 +- fs/hfsplus/options.c | 2 +- fs/hfsplus/unicode.c | 6 +- fs/isofs/inode.c | 5 +- fs/isofs/joliet.c | 3 +- fs/jfs/jfs_unicode.c | 9 +- fs/jfs/super.c | 3 +- fs/nls/Kconfig | 13 + fs/nls/Makefile | 19 + fs/nls/mac-celtic.c | 34 +- fs/nls/mac-centeuro.c | 34 +- fs/nls/mac-croatian.c | 34 +- fs/nls/mac-cyrillic.c | 34 +- fs/nls/mac-gaelic.c | 34 +- fs/nls/mac-greek.c | 34 +- fs/nls/mac-iceland.c | 34 +- fs/nls/mac-inuit.c | 34 +- fs/nls/mac-roman.c | 34 +- fs/nls/mac-romanian.c | 34 +- fs/nls/mac-turkish.c | 34 +- fs/nls/nls_ascii.c | 34 +- fs/nls/nls_core.c | 137 + fs/nls/nls_cp1250.c | 34 +- fs/nls/nls_cp1251.c | 34 +- fs/nls/nls_cp1255.c | 36 +- fs/nls/nls_cp437.c | 34 +- fs/nls/nls_cp737.c | 34 +- fs/nls/nls_cp775.c | 34 +- fs/nls/nls_cp850.c | 34 +- fs/nls/nls_cp852.c | 34 +- fs/nls/nls_cp855.c | 34 +- fs/nls/nls_cp857.c | 34 +- fs/nls/nls_cp860.c | 34 +- fs/nls/nls_cp861.c | 34 +- fs/nls/nls_cp862.c | 34 +- fs/nls/nls_cp863.c | 34 +- fs/nls/nls_cp864.c | 34 +- fs/nls/nls_cp865.c | 34 +- fs/nls/nls_cp866.c | 34 +- fs/nls/nls_cp869.c | 34 +- fs/nls/nls_cp874.c | 36 +- fs/nls/nls_cp932.c | 36 +- fs/nls/nls_cp936.c | 36 +- fs/nls/nls_cp949.c | 36 +- fs/nls/nls_cp950.c | 36 +- fs/nls/{nls_base.c => nls_default.c} | 124 +- fs/nls/nls_euc-jp.c | 29 +- fs/nls/nls_iso8859-1.c | 34 +- fs/nls/nls_iso8859-13.c | 34 +- fs/nls/nls_iso8859-14.c | 34 +- fs/nls/nls_iso8859-15.c | 34 +- fs/nls/nls_iso8859-2.c | 34 +- fs/nls/nls_iso8859-3.c | 34 +- fs/nls/nls_iso8859-4.c | 34 +- fs/nls/nls_iso8859-5.c | 34 +- fs/nls/nls_iso8859-6.c | 34 +- fs/nls/nls_iso8859-7.c | 34 +- fs/nls/nls_iso8859-9.c | 34 +- fs/nls/nls_koi8-r.c | 34 +- fs/nls/nls_koi8-ru.c | 30 +- fs/nls/nls_koi8-u.c | 34 +- fs/nls/nls_utf8.c | 34 +- fs/nls/nls_utf8n-core.c | 291 +++ fs/nls/nls_utf8n-norm.c | 797 ++++++ fs/nls/nls_utf8n-selftest.c | 307 +++ fs/nls/ucd/README | 33 + fs/nls/utf8n.h | 117 + fs/ntfs/inode.c | 2 +- fs/ntfs/super.c | 6 +- fs/ntfs/unistr.c | 13 +- fs/udf/super.c | 3 +- fs/udf/unicode.c | 4 +- include/linux/nls.h | 126 +- scripts/Makefile | 1 + scripts/mkutf8data.c | 3464 +++++++++++++++++++++++++ 86 files changed, 6772 insertions(+), 545 deletions(-) create mode 100644 fs/nls/nls_core.c rename fs/nls/{nls_base.c => nls_default.c} (89%) create mode 100644 fs/nls/nls_utf8n-core.c create mode 100644 fs/nls/nls_utf8n-norm.c create mode 100644 fs/nls/nls_utf8n-selftest.c create mode 100644 fs/nls/ucd/README create mode 100644 fs/nls/utf8n.h create mode 100644 scripts/mkutf8data.c -- 2.17.0