Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933849AbbEOAZq (ORCPT ); Thu, 14 May 2015 20:25:46 -0400 Received: from mail-la0-f41.google.com ([209.85.215.41]:33094 "EHLO mail-la0-f41.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1423126AbbEOAZm (ORCPT ); Thu, 14 May 2015 20:25:42 -0400 MIME-Version: 1.0 In-Reply-To: <20150514233632.GG7232@ZenIV.linux.org.uk> References: <20150505052205.GS889@ZenIV.linux.org.uk> <20150511180650.GA4147@ZenIV.linux.org.uk> <20150513222533.GA24192@ZenIV.linux.org.uk> <20150514033040.GF7232@ZenIV.linux.org.uk> <20150514220932.GC31808@samba2> <20150514233632.GG7232@ZenIV.linux.org.uk> Date: Thu, 14 May 2015 17:25:39 -0700 X-Google-Sender-Auth: eX1Z0vZARJjb7OiYZDouKA4jBVE Message-ID: Subject: Re: [RFC][PATCHSET v3] non-recursive pathname resolution & RCU symlinks From: Linus Torvalds To: Al Viro Cc: Jeremy Allison , Linux Kernel Mailing List , linux-fsdevel , Christoph Hellwig , Neil Brown Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3576 Lines: 77 On Thu, May 14, 2015 at 4:36 PM, Al Viro wrote: > On Thu, May 14, 2015 at 04:24:13PM -0700, Linus Torvalds wrote: > >> So ASCII-only case-insensitivity is sufficient for you guys? >> >> Doing case-insensitive lookups at a vfs layer level wouldn't be >> impossible (add some new lookup flag, so it would *not* be >> per-filesystem, it would be per-operation!), > > ENOPARSE. Either two names are equivalent or they are not; it's not a > per-operation thing. What do you mean? We can easily make things per-operation, by adding another flag. We already have per-operation flags like LOOKUP_FOLLOW, which decides if we follow the last symlink or not. We could add a LOOKUP_ICASE, which decides whether we compare case or not. Obviously, we'd have to ad the proper O_ICASE for open (and AT_ICASE for fstatat() and friends). Exactly like we do for LOOKUP_FOLLOW. HOWEVER. The reason ASCII-only matters is two-fold: (a) hashing needs to work, and hash all equivalent names to the same bucket. And we need to hash the same *regardless* of whether the operation was done with ICASE or not. With ASCII, this is fairly easy: we could easily make the hashing just mask bit 5 in each byte, and that wouldn't slow us down at all, and it would hardly change the hash effectiveness either. m In particular, with ASCII, we can trivially still do the word-at-a-time hashing. So there's fairly little downside. (b) The *compare* needs to work too. In particular, right now we very much try to avoid comparing the names by checking both the full hash and the name length. Again, that's fine with ASCII - two names that differ in case are the same length. And again, we can still use the word-at-a-time compare, just have a mask (and at compare time, we can make the mask depend on ICASE). Sure, you'll still have to do a more careful compare (becaue case-insensitivity is not *just* "same except for bit 5 even in ASCII), but we can trivially have a ICASE test up front, and keep the fast case exactly the same as before. Now, doing full UTF-8 is *much* harder. Part of it is that outside of ASCII, you literally have cases that are ambiguous. Part of it is that outside of ASCII, now the lengths aren't even guaranteed to match. And part of it is that now you have to do things that are much more complex than just masking bits in parallel for multiple bytes at the same time (although you can still have a fast-path that depends on just masking the high bit, to at least say "this is just the ASCII subcase"). But doing ASCII ICASE compares wouldn't be that hard, and wouldn't affect performance. Btw, don't get me wrong. I'm not saying it's a great idea. I think icase compares are stupid. Really really stupid. But samba might be worth jumping though a few hoops for. The real problem is that even with just ASCII, it does make it much easier to create nasty hash collisions in the dentry hashes (same hash from 256 variations of aAaAAaaA - just repeat the same letter in different variations of lower/upper case). So even plain ASCII icase has some real problems. But it's conceptually not that hard. True UTF-8 icase? That's an absolute *nightmare*, and causes serious problems. OS X got it very very wrong, for example, by messing up the normalization. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/