Received: by 2002:a25:c205:0:0:0:0:0 with SMTP id s5csp2139550ybf; Mon, 2 Mar 2020 02:49:21 -0800 (PST) X-Google-Smtp-Source: ADFU+vshOPZ6xr2+HZxwdB6fy05p4aOP0g7e1mhR1dlsOXAuECll54bDnihfPyOclEoUNhAXBeWE X-Received: by 2002:a54:4801:: with SMTP id j1mr3183522oij.108.1583146161579; Mon, 02 Mar 2020 02:49:21 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1583146161; cv=none; d=google.com; s=arc-20160816; b=JxHTLXx9BuenaD/Iu1pxNeein8ZK7bZ94GAOZQYs2O8QoLeqsalMwKKUefSnxaqj/o v4wS29rQKQ/R7Vha2VamIvQKf2y6uRLIirKUBp19Lyber1ndka3vpNB6YqRT+rcWqzRw 6RM/lj8pg6+5Je+4wS53uFcd3o/fJ0FGkOkdBZjgFPD0WZaluhgL5IXuv5dmKgFApQa0 kmFhy0uNC/fIs0Cah7nxwjAf8vUneifAMt7Jnr2qaeHAM8yWer686UQ4xTT3m8s8BJbe JEsJZ1PEOfQaWRNQkE24xNUmOJIf3qr9uPyMaGJlJ8xTCBBFb9/nG5TadoCEhUnAIZKr J6Hg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date; bh=ji4m+uNQASU8D44EbA746tr5WIo8d/LgYh32t32J+vE=; b=jDv05JDqmBsntiqysz4NNMcbXoiY9sU3H3ZICwmdQ2pqgD+HlJKfYGf1f3XamgTYf9 FjxaXpQHKkcNx/SW0XG/4TriNAAxWxaT2c87bhcIfVDDkv+qI7/pNR7BlJov0yeiUl1R m3a4Jv+DIS3Pw1IjnOM5EafQnzj9Yz3HjZpUQL935+hQA/FYzLrctsJXyGT9chwajtOm gBAfurXeoK8bCxh45gzxt0kBgvFIgomtHes0y4pxQ8UBhJo60CACGEIoejx071j6wr0l XvJyDhzsXHU7GDqfkvlPIMt1ltBQT8li+Js9TNmtWPvd8JDOF9CYfk10xGvmtxbonP5o Rf1A== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id k13si6505146otp.224.2020.03.02.02.49.10; Mon, 02 Mar 2020 02:49:21 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727505AbgCBKr4 (ORCPT + 99 others); Mon, 2 Mar 2020 05:47:56 -0500 Received: from mout-p-202.mailbox.org ([80.241.56.172]:62776 "EHLO mout-p-202.mailbox.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726874AbgCBKrz (ORCPT ); Mon, 2 Mar 2020 05:47:55 -0500 Received: from smtp2.mailbox.org (smtp2.mailbox.org [IPv6:2001:67c:2050:105:465:1:2:0]) (using TLSv1.2 with cipher ECDHE-RSA-CHACHA20-POLY1305 (256/256 bits)) (No client certificate requested) by mout-p-202.mailbox.org (Postfix) with ESMTPS id 48WH0n5XKGzQlFf; Mon, 2 Mar 2020 11:47:53 +0100 (CET) X-Virus-Scanned: amavisd-new at heinlein-support.de Received: from smtp2.mailbox.org ([80.241.60.241]) by spamfilter06.heinlein-hosting.de (spamfilter06.heinlein-hosting.de [80.241.56.125]) (amavisd-new, port 10030) with ESMTP id CJiqeDl4dVnk; Mon, 2 Mar 2020 11:47:47 +0100 (CET) Date: Mon, 2 Mar 2020 21:47:41 +1100 From: Aleksa Sarai To: lampahome Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: why do we need utf8 normalization when compare name? Message-ID: <20200302104741.b5lypijqlbpq5lgz@yavin> References: <20200302103754.nsvtne2vvduug77e@yavin> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="ofzwvqyr4zfj5kgu" Content-Disposition: inline In-Reply-To: <20200302103754.nsvtne2vvduug77e@yavin> Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org --ofzwvqyr4zfj5kgu Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On 2020-03-02, Aleksa Sarai wrote: > On 2020-03-02, lampahome wrote: > > According to case insensitive since kernel 5.2, d_compare will > > transform string into normalized form and then compare. > > > > But why do we need this normalization function? Could we just compare > > by utf8 string? >=20 > The problem is that there are multiple ways to represent the same glyph > in Unicode -- for instance, you can represent =C3=85 (the symbol for > angstrom) as both U+212B and U+0041 U+030A (the latin letter "A" > followed by the ring-above symbol "=C2=B0"). Different software may choos= e to > represent the same glyphs in different Unicode forms, hence the need for > normalisation. Sorry, a better example would've been "=C3=B1" (U+00F1). You can also represent it as "n" (U+006E) followed by "=E2=97=8C=CC=83" (U+0303 -- "comb= ining tilde"). Both forms are defined by Unicode to be canonically equivalent so it would be incorrect to treat the two Unicode strings differently (that isn't quite the case for "=C3=85"). > [1] is the Wikipedia article that describes this problem and what the > different kinds of Unicode normalisation are. >=20 > [1]: https://en.wikipedia.org/wiki/Unicode_equivalence --=20 Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH --ofzwvqyr4zfj5kgu Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQSxZm6dtfE8gxLLfYqdlLljIbnQEgUCXlzkSwAKCRCdlLljIbnQ ElDCAP9EJDnDkIaOwFuLSTBfU69vEr37WcdbjHX7dM37DzM8ewD/Wf5wba4nNgXl 6hFISESd7nDPkngNJbvth4eos5W9MQE= =rwv8 -----END PGP SIGNATURE----- --ofzwvqyr4zfj5kgu--