Received: by 2002:a25:6193:0:0:0:0:0 with SMTP id v141csp3213500ybb; Mon, 13 Apr 2020 03:14:04 -0700 (PDT) X-Google-Smtp-Source: APiQypLb5KA2A15ljmaZJjIWujtF5LgetsQNrWT7goMEZp41IUyuhTdh++MPyNHSZZwrJpjlQ99I X-Received: by 2002:a05:6402:14c1:: with SMTP id f1mr3094032edx.221.1586772844655; Mon, 13 Apr 2020 03:14:04 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1586772844; cv=none; d=google.com; s=arc-20160816; b=i9C90TMHRgWrB89e3+I1i2OG9U2KIj3IedWFPUzrCw+xCW54IhI9Ua6ayHwbYqtKZI jhWJYlKgMDK+4cgS2+i4ikFDgQvGsZwYwzJeeg/fXLMdZ44r4DT72mKxeO24PcRBVzq8 eaxINRnYakTfb1iq2uyT5un1cMxswoRu+/AQT+azdKtpFCMPdPiEgLwCEgabexyy94pT q6iBzR9vwzQQqVyoMDTSltevzS7/satm0jFhURgijs4JjOM04OTJh6mSClES3bBaRiQs P6iXSz2kC35S4rcXCL9VbFHIgxm+b9jNNsnu4+oWdZMJA9usHqFWdrnIQyx+KXDTZiMS 7AAw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-transfer-encoding:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=YeBNTGFe81CL3iGlfiHtsHFzWES3gy/pQp6oUQm/SDo=; b=G/9/oOYlLyFSTLCfpRyyZXt2GSiA/xrcsB8p8xx6CwzpZRbi7acauvmlhcThjSRXL6 hn0mzFzJAEPKHriKtnGQ+k59hpeVvUVymyLawJ7RBRNZ5p2MmRa1z/+VmUI6hfdm7ReV CYDFREmX5SexOfUCA+DxFQRrUy1I4ZonDXBlokKjcXuEa/omQeKwa7Aqy/WWYj563QTn wDd/aVha+IvSVJGEg/YDnbG+AGtheCqBQDmZKbY5lCvVSVrqXxXijUa6WWi4JzZQ4UKt kLo9CM0IMpFlHiIW+ONkZaY2Yeq11OfS668BXc2hGLRmAoC/ncXewEBQzDLO1BXeuTQD 1cvg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=DLEHL99N; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id b3si5920789edn.402.2020.04.13.03.13.39; Mon, 13 Apr 2020 03:14:04 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=DLEHL99N; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728424AbgDMKKM (ORCPT + 99 others); Mon, 13 Apr 2020 06:10:12 -0400 Received: from mail.kernel.org ([198.145.29.99]:43788 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728131AbgDMKKL (ORCPT ); Mon, 13 Apr 2020 06:10:11 -0400 Received: from pali.im (pali.im [31.31.79.79]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id E8BE1206E9; Mon, 13 Apr 2020 10:10:09 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1586772610; bh=7qI04vutPCDuNF2NDS2Ze8iRiFNiOMPONsZJ3KLq4U8=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=DLEHL99NsAudonUlrw1+pxZYvpIGGNNDWZtHpEG9GBPMypfAfHIAGoPDgu90BJ1Dn 5DJgsijlCpKG/y7l4veviAiq+F1fyBpixJrpY4Y/ugRy6VrXza5kZqq0WvVFFa4uob niLJ/2Jslsf3PJaSbH/jxweFIShaCVPtmF89GDbU= Received: by pali.im (Postfix) id 82054895; Mon, 13 Apr 2020 12:10:07 +0200 (CEST) Date: Mon, 13 Apr 2020 12:10:07 +0200 From: Pali =?utf-8?B?Um9ow6Fy?= To: "Kohada.Tetsuhiro@dc.MitsubishiElectric.co.jp" Cc: "viro@zeniv.linux.org.uk" , "'linux-fsdevel@vger.kernel.org'" , "'linux-kernel@vger.kernel.org'" , "'namjae.jeon@samsung.com'" , "'sj1557.seo@samsung.com'" Subject: Re: [PATCH 1/4] exfat: Simplify exfat_utf8_d_hash() for code points above U+FFFF Message-ID: <20200413101007.lbey6q5u6jz3ulmr@pali> References: <20200403204037.hs4ae6cl3osogrso@pali> <20200407100648.phkvxbmv2kootyt7@pali> <20200408090435.i3ufmbfinx5dyd7w@pali> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: NeoMutt/20180716 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Monday 13 April 2020 08:13:45 Kohada.Tetsuhiro@dc.MitsubishiElectric.co.jp wrote: > > On Wednesday 08 April 2020 03:59:06 Kohada.Tetsuhiro@dc.MitsubishiElectric.co.jp wrote: > > > > So partial_name_hash() like I used it in this patch series is enough? > > > > > > I think partial_name_hash() is enough for 8/16/21bit characters. > > > > Great! > > > > Al, could you please take this patch series? > > I think it's good. > > > > > Another point about the discrimination of 21bit characters: > > > I think that checking in exfat_toupper () can be more simplified. > > > > > > ex: return a < PLANE_SIZE && sbi->vol_utbl[a] ? sbi->vol_utbl[a] : a; > > > > I was thinking about it, but it needs more refactoring. Currently > > exfat_toupper() is used on other places for UTF-16 (u16 array) and therefore it cannot be extended to take more then 16 > > bit value. > > I’m also a little worried that exfat_toupper() is designed for only utf16. > Currently, it is converting from utf8 to utf32 in some places, and from utf8 to utf16 in others. > Another way would be to unify to utf16. > > > But I agree that this is another step which can be improved. > > Yes. There are two problems with it: We do not know how code points above U+FFFF could be converted to upper case. Basically from exfat specification can be deduced it only for U+0000 .. U+FFFF code points. We asked if we can get answer from MS, but I have not received any response yet. Second problem is that all MS filesystems (vfat, ntfs and exfat) do not use UCS-2 nor UTF-16, but rather some mix between it. Basically any sequence of 16bit values (except those :/<>... vfat chars) is valid, even unpaired surrogate half. So surrogate pair (two 16bit values) represents one unicode code point (as in UTF-16), but one unpaired surrogate half is also valid and represent (invalid) unicode code point of its value. In unicode are not defined code points for values of single / half surrogate. Therefore if we talk about encoding UTF-16 vs UTF-32 we first need to fix a way how to handle those non-representative values in VFS encoding (iocharset=) as UTF-8 is not able to represent it too. One option is to extend UTF-8 to WTF-8 encoding [1] (yes, this is a real and make sense!) and then ideally change exfat_toupper() to UTF-32 without restriction for surrogate pairs values. Btw, same problem with UTF-16 also in vfat, ntfs and also in iso/joliet kernel drivers. [1] - https://simonsapin.github.io/wtf-8/