Received: by 2002:a25:c205:0:0:0:0:0 with SMTP id s5csp3649876ybf; Tue, 3 Mar 2020 09:47:05 -0800 (PST) X-Google-Smtp-Source: ADFU+vs1i35WGDnHLoaPMq6QkGZTCf30MzkUb5ZaBh3BoSmrZNJqhO3JcQrveZxgI0UFYmF7H6XW X-Received: by 2002:aca:be56:: with SMTP id o83mr3316543oif.25.1583257625290; Tue, 03 Mar 2020 09:47:05 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1583257625; cv=none; d=google.com; s=arc-20160816; b=gITcLYf8yk/aDkEYTRZNnGENFHfGce6XmWpk/2Md3f0tGiYCLZzp+yRjJQ4kU+LF8E SWIrbUInza59BoemICUAwa3/Zjr8Y/7L4z0tN06sCrwEzl4SSUyS/F0Om13LEoN7f8E0 /0MUbWXNOQNOY6ViNwDX3oUTZo+OrlCheCX+nvdunaLhmetmT8N6HSXA6e8YnM6Ovmm2 aBfwwXMiomviX0Q0ke/zXmpabPBEmUO5FEd/Cfy25vEpBqKdRrnm04P1/EQOMKz2emFd UeatXCw0tNjRzvK31z8smEzoaWdLJGZdDBfbMyeqvmLdZjFo3sjrY7Vg9Ug//DLGeSDG QSTg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=Odj2Fv28dRa4dJ2FSlOcoPrGk+UrRD9CjCJq2sufZFo=; b=MpgTAbXrCVwKCJuDOWOWBjgNUaXX0S7ae5B2KLIsXjKsbXzKY3zTZWRKC0Oavx7FeU sBiMLJdz7zFVI1npCsH/4BObAALILorDFyScwzOnnicjmPb1qLJwiZaMQuqhfjIG6aGt zQrIUUEercQ/m7n8mNZK3tTLmpm1y0iRfp1ddxhvaIQ94qL7E6BYk7oGzRBUNawd7Nxs b/DhbsM+PcZsfD1h8pI4ZPXN5UspiKvDUB9nqvphqhJ4dcKUBA/7ZTFRd2Uf1dmht4KE faDgeInY4XC+gg7KeE0djzMmxacruHWXMPKcF87D3fv2gQW8PtHYHe/KRtUQPVVBc+FR IFyA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id t5si3918961otc.315.2020.03.03.09.46.53; Tue, 03 Mar 2020 09:47:05 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730129AbgCCRW1 (ORCPT + 99 others); Tue, 3 Mar 2020 12:22:27 -0500 Received: from outgoing-auth-1.mit.edu ([18.9.28.11]:39327 "EHLO outgoing.mit.edu" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1729404AbgCCRW1 (ORCPT ); Tue, 3 Mar 2020 12:22:27 -0500 Received: from callcc.thunk.org (guestnat-104-133-0-105.corp.google.com [104.133.0.105] (may be forged)) (authenticated bits=0) (User authenticated as tytso@ATHENA.MIT.EDU) by outgoing.mit.edu (8.14.7/8.12.4) with ESMTP id 023HM9dR009316 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 3 Mar 2020 12:22:13 -0500 Received: by callcc.thunk.org (Postfix, from userid 15806) id 818CE42045B; Tue, 3 Mar 2020 12:22:09 -0500 (EST) Date: Tue, 3 Mar 2020 12:22:09 -0500 From: "Theodore Y. Ts'o" To: lampahome Cc: Aleksa Sarai , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: why do we need utf8 normalization when compare name? Message-ID: <20200303172209.GB61444@mit.edu> References: <20200302103754.nsvtne2vvduug77e@yavin> <20200302104741.b5lypijqlbpq5lgz@yavin> <20200303070928.aawxoyeq77wnc3ts@yavin> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Mar 03, 2020 at 06:13:56PM +0800, lampahome wrote: > > > And yes, once the strings are normalised and encoded as UTF-8 you then > > do a byte-by-byte comparison (if the comparison is case-insensitive then > > fs/unicode/... will case-fold the Unicode symbols during normalisation). > > What I'm confused is why encoded as utf-8 after normalize finished? > From above, turn "ñ" (U+00F1) and "n◌̃" (U+006E U+0303) into the same > Unicode string. Then why should we just compare bytes from normalized. For the same reason why we don't upcase or downcase all of the letters in a directory with case-folding. The term for this is "case-preserving, case-insensitive" matching. So that means that if you save a file as "Makefile", ls will return "Makefile", and not "MAKEFILE" or "makefile". Of course, if you delete or truncate "makefile", it will affect the file stored in the directory as "Makefile", and the file system will not allow a directory with case-folding enabled to contain "makefile" and "Makefile" at the same time. Simiarly, with normalization, we preserve the existing utf-8 form (both the composed and decomposed forms are valid utf-8), but we compare without taking the composition form into account. Cheers, - Ted P.S. Some people may hate this, but if the goal is interoperability with how Windows and MacOS does things, this is basically what they do as well. (Well, mostly; MacOS is a little weird for historical reasons.) P.P.S. And before you comment on it, as one Internationalization expert once said, I18N *is* complicated. It truly would be easier to teach all of the world to speak a single language and use it as the "Federation Standard" language, ala Star Trek. For better or for worse, that's not happening, and so we deal with the world as it is, not as we would like it to be. :-)