Received: by 2002:a25:8b91:0:0:0:0:0 with SMTP id j17csp2726604ybl; Mon, 20 Jan 2020 08:14:42 -0800 (PST) X-Google-Smtp-Source: APXvYqyeG1lBGGST4gX4gmvfb31SZyk6tRuRqM9fmvLQEdmC1U0/eE/lOXN/zGOOfWzVB5RgVGfM X-Received: by 2002:aca:1b08:: with SMTP id b8mr62457oib.62.1579536881817; Mon, 20 Jan 2020 08:14:41 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1579536881; cv=none; d=google.com; s=arc-20160816; b=FycL/DZOzfK4Xevtx/dJylYCys1vh7tmPEUyHzqyalTWTLkKWGfNamK0D18zPi92Vs iBXxBeBneH2CkukyDa+ySQDhqdeWkpa5FA7pQlvVTPgO7WfDPsLapw9xLMVll5xImrCX 4yUQytjhg8MER2WfRcxhwoR0diMJvxeoq5fmulpCtjzI+4L0fFUjr0vwImsgE2uUDV4B /utf68A8+fAxaeAY4Ep+0Xcdk4w3GAcyWH8YbjuH5RO2lhdfI72eFqJKPuWG8rTZNJUg 7pKcJZ8ERauz0sAVoFNhi1QNe/xogbGHZMQavxzhSIAnRpORLr+TexPSMuP8vf4FNILb 74RQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=orawBFDIZZucYAtXhWR7gn/apByGRnzkvBzJV6L8Bl0=; b=npeNuWLTBf2jHcIO6ayue5vm/R8pRhJVOkrxz7PjTZ6WCJswiV8/D62IEnadJn0sUu elIezl3X71+ivlbe/bmItSXVPFfHY2KFFrwmNOK9EbXDoyBtq182NeUvEmYBsLjNqZvV cWu3iuNrJghpqEq5aS8J8yrysdEbUd+lOq9qp+Qg7S3SFKqm8Hjm5pVvvbTcZmb59LQO z4U5BXb9cSCNNW3WfsgZBSSNLLKSnFrE+edaF4wCh4U6j7xpHJ8S01Mjzv6k87G0ZyF1 Hw9UEpJL+/4Vrt1B6bbcp+xsAzhtFyLYryuOLL+fteBBHRU8cyOBYZG/bmYRKshEwTm+ MyaQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id y205si18480714oig.137.2020.01.20.08.14.29; Mon, 20 Jan 2020 08:14:41 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729030AbgATQMN (ORCPT + 99 others); Mon, 20 Jan 2020 11:12:13 -0500 Received: from zeniv.linux.org.uk ([195.92.253.2]:51872 "EHLO ZenIV.linux.org.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726626AbgATQMM (ORCPT ); Mon, 20 Jan 2020 11:12:12 -0500 Received: from viro by ZenIV.linux.org.uk with local (Exim 4.92.3 #3 (Red Hat Linux)) id 1itZeo-00C6v8-G5; Mon, 20 Jan 2020 16:12:06 +0000 Date: Mon, 20 Jan 2020 16:12:06 +0000 From: Al Viro To: David Laight Cc: 'Pali =?iso-8859-1?Q?Roh=E1r'?= , OGAWA Hirofumi , "linux-kernel@vger.kernel.org" , "linux-fsdevel@vger.kernel.org" , "Theodore Y. Ts'o" , Namjae Jeon , Gabriel Krisman Bertazi Subject: Re: vfat: Broken case-insensitive support for UTF-8 Message-ID: <20200120161206.GC8904@ZenIV.linux.org.uk> References: <20200119221455.bac7dc55g56q2l4r@pali> <87sgkan57p.fsf@mail.parknet.co.jp> <20200120110438.ak7jpyy66clx5v6x@pali> <89eba9906011446f8441090f496278d2@AcuMS.aculab.com> <20200120152009.5vbemgmvhke4qupq@pali> <1a4c545dc7f14e33b7e59321a0aab868@AcuMS.aculab.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <1a4c545dc7f14e33b7e59321a0aab868@AcuMS.aculab.com> Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Jan 20, 2020 at 03:47:22PM +0000, David Laight wrote: > From: Pali Roh?r > > Sent: 20 January 2020 15:20 > ... > > This is not possible. There is 1:1 mapping between UTF-8 sequence and > > Unicode code point. wchar_t in kernel represent either one Unicode code > > point (limited up to U+FFFF in NLS framework functions) or 2bytes in > > UTF-16 sequence (only in utf8s_to_utf16s() and utf16s_to_utf8s() > > functions). > > Unfortunately there is neither a 1:1 mapping of all possible byte sequences > to wchar_t (or unicode code points), nor a 1:1 mapping of all possible > wchar_t values to UTF-8. > Really both need to be defined - even for otherwise 'invalid' sequences. Who. Cares? Filename is a sequence of octets, not codepoints. Its interpretation is entirely up to the userland. Same goes for the notion of "case" (locale-dependent, etc.); some filesystems impose their (arbitrary) restrictions on the possible octet sequences (and equally arbitrary equivalence relations between them) that can be approximated in terms of upper/lower case in some locale. It does not matter how arbitrary those are, or what stands behind them: * don't do that for any new filesystem designs * for existing filesystem types, the actual behaviour of native implementation IS THE ONE AND ONLY AUTHORITY. It does not matter from what misguided thought process it has come from; the absolute requirement is that if you mount a filesystem valid from the native implementation POV, you must leave it in a state that would be valid from the native implementation POV. That's it. Any talk about normalization, etc. is completely pointless - for any sane uses it's an opaque stream of octets that filesystem and VFS should leave the fuck alone. Codepoints, encodings, etc. come into the game only to an extent they are useful to describe the weird rules given filesystem might have. And they are just that - tools to describe externally imposed mappings.