Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932892AbbEEX7w (ORCPT ); Tue, 5 May 2015 19:59:52 -0400 Received: from zeniv.linux.org.uk ([195.92.253.2]:42822 "EHLO ZenIV.linux.org.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932693AbbEEX7v (ORCPT ); Tue, 5 May 2015 19:59:51 -0400 Date: Wed, 6 May 2015 00:59:47 +0100 From: Al Viro To: Linus Torvalds Cc: Linux Kernel Mailing List , linux-fsdevel , Christoph Hellwig , Neil Brown Subject: Re: [RFC][PATCHSET] non-recursive pathname resolution Message-ID: <20150505235947.GV889@ZenIV.linux.org.uk> References: <20150505052205.GS889@ZenIV.linux.org.uk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 7471 Lines: 159 On Tue, May 05, 2015 at 08:20:16AM -0700, Linus Torvalds wrote: > On Mon, May 4, 2015 at 10:22 PM, Al Viro wrote: > > My apologies for the patchbomb from hell, but I really think > > it needs a public review. > > Aside from the small nits I noted (and I guess I can even live with > the bad label name as long as it never gets resurrected) it looks > good. But I'd obviously like it to get as much testing/beating on as > possible before merging: I'm assuming it will have gone at least > through xfstests etc.. It is passing xfstests and LTP, plus some basic "create a twisted forest of symlinks and walk it" tests, but yes, it obviously needs more beating. I'll push everything up to #76 into -next tonight (with the changes you asked for). As for the next steps... * we want _all_ failure exits from RCU mode to empty the stack immediately - it will be too late to call ->put_link() once we dropped rcu_read_lock(). * we want failure exits from complete_walk() to empty the stack, for the sake of consistency (due to the above it will do so when called in RCU mode). * we want to postpone assignment to nd->seq until after we are sure we haven't found a symlink to follow; I have that in local queue, will post in the next batch for review. * we want to have pick_link() store directly into nd->stack[...].link; no point postponing that to get_link(). nd->link can die at that point. Done in local queue. * we want to store inode and seq along with path in nd->stack[...]. It means extra 4 words in struct nameidata (EMBEDDED_LEVELS * 2), i.e. not too horrible wrt stack footprint. should_follow_link()/pick_link() get those as additional arguments. Done in local queue. * we need a variant of legitimize_mnt() that would _not_ leave RCU mode. What I have in mind is legitimize_mnt(m) { switch (__legitimize_mnt(m, seq)) { case 0: return true; default: rcu_read_unlock(); mntput(m); rcu_read_lock(); case 1: return false; } } with the ability to call __legitimize_mnt(), remember the return value and, in case of failure, do all pending ->put_link() before dropping rcu_read_lock() We'll need that for unlazy_walk() et.al. - do this __legitimize_mnt() on path->nd.mnt and all nd->stack[0..nd->depth - 1].mnt, and if any of them fail, stop right there, do ->put_link() on everything, drop rcu_read_lock(), mntput() everything on stack we'd already legitimized, reset nd->depth to 0 and bugger off the way we currently would. If all of __legitimize_mnt() succeed, we go and acquire dentry references, checking them against nd->seq for nd->path and nd->stack[i].seq for nd->stack[i].link. Any failure => ->put_link() for everything on stack, rcu_read_unlock(), dput() everything we'd managed to grab in the stack, mntput() everything in the stack, reset nd->depth to 0 and bugger off the way we currently would. complete_walk() is similar - it starts with could be if (nd->flags & LOOKUP_RCU) { if (!(nd->flags & LOOKUP_ROOT)) nd->root.mnt = NULL; if (unlikely(unlazy_walk(nd, NULL)) return -ECHILD; } and perhaps it would make sense to try just that and see if unlazy_walk() goes visibly higher in profiles... Note that it *can* be called with non-empty stack and be required to legitimize its (only) element - e.g. creat() on a dangling symlink will step into that. So we can't just start with unconditionally emptying the stack. terminate_walk() is simpler - if we are in RCU mode, just call ->put_link() on everything in stack and reset nd->depth; if we are not, leave as in this patchset. * ->put_link() should get inode + cookie instead of dentry + cookie; sure, right now the only instance that wants more than cookie is in hppfs and that's scheduled for removal, but we need the inode to find the damn method anyway, so passing it is no extra burden. Should be RCU-safe, which is actually true for all instances. * ->follow_link() should take dentry + inode, with NULL/inode meaning "we are in RCU mode, return -ECHILD instead of doing anything you can't do in RCU pathwalk". get_link() should be basically touch atime (might throw out of RCU mode) do security_inode_follow_link(), fuck off if it fails nd->last_type = LAST_BIND; if (inode->i_link) return it, we are done if (in RCU mode) { link = ...->follow_link(NULL, inode); if (link == ERR_PTR(-ECHILD)) try to unlazy, return link if that fails else return link; } link = ...->follow_link(dentry, inode); then as in this patchset. * nd_alloc_stack() should do GFP_ATOMIC allocation in RCU mode and treat a failure as "try to unlazy, fail with ECHILD if can't do that, proceed to GFP_KERNEL allocation otherwise". * may_follow_link() should, in case of failure, check if it's in RCU mode and do terminate_walk()/fail with ECHILD instead of going into audit crap. Sure, it'll cost us a restart that will almost certainly end with hard error on the same symlink, but audit_log_link_denied() isn't usable in RCU mode. We probably could just make it use GFP_ATOMIC for allocations, but that's a separate story... * we need set_root_rcu() to store ->d_seq of root in nameidata, so that traversing an absolute symlink in RCU mode would be able to start with correct nd->seq - fetch nd->root.dentry->d_inode, check nd->root.dentry->d_seq against nd->root_seq, fail with -ECHILD on mismatch and use inode and root_seq for nd->inode / nd->seq otherwise. * put_link() should, as in Neil's patchset, have path_put() conditional on non-RCU mode. * unlazy_walk() is really unpleasant. In addition to the issues around legitimizing many vfsmounts without being allowed to drop rcu_read_lock (discussed above), we have three use cases: 1) unlazy_walk(nd, NULL). legitimize nd->path and everything already in nd->stack, from 0 to nd->depth - 1. On failure do ->put_link() for everything on stack *before* dropping rcu_read_lock(). 2) unlazy_walk() in lookup_fast(). We want to give it correct seq number (and I already have that done in local tree) and legitimize that dentry + what case 1 would do. 3) unlazy between deciding it's a symlink to follow and succeeding with ->follow_link() in get_link(). It's this case that really sucks - we want to legitimize everything we would in case 1 + path/dentry of our link. And it _can_ have a different vfsmount - we can't just rely on getting nd->path.mnt being equal to it. It's actually more like case 1 than case 2 - "legitimize nd->path and everything on stack from 0 to nd->depth, without ->put_link() in case of failure done only from 0 to nd->depth - 1 (as in case 1)". We probably should just do this: m = mnt of pending link res = __legitimize_mnt(m); if (likely(!res)) { d = dentry of pending link try to legitimize d if failed res = 2; } if (unlikely(res)) { terminate_walk(nd); if (res != 1) mntput(m); return -ECHILD; } res = unlazy_walk(nd, NULL); if (unlikely(res)) { dput(d); mntput(m); } return res; AFAICS, the above should suffice, modulo teaching ->follow_link() instances to fail with ECHILD when called in RCU mode and then teaching some of them _not_ to. Note that fast symlinks would be fine without the last part and that covers most of the actual use... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/