Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754889AbbEKSG7 (ORCPT ); Mon, 11 May 2015 14:06:59 -0400 Received: from zeniv.linux.org.uk ([195.92.253.2]:46862 "EHLO ZenIV.linux.org.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754872AbbEKSGw (ORCPT ); Mon, 11 May 2015 14:06:52 -0400 Date: Mon, 11 May 2015 19:06:50 +0100 From: Al Viro To: Linus Torvalds Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, Christoph Hellwig , Neil Brown Subject: [RFC][PATCHSET v3] non-recursive pathname resolution & RCU symlinks Message-ID: <20150511180650.GA4147@ZenIV.linux.org.uk> References: <20150505052205.GS889@ZenIV.linux.org.uk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150505052205.GS889@ZenIV.linux.org.uk> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 12778 Lines: 265 > There are several things in that queue; most of it is fs/namei.c > rewrite and closer to the end we get > * non-recursive link_path_walk() > * the only symlink-related limitation is that there should be no > more than 40 symlinks traversals, no matter how nested they are. > * stack footprint independent from the nesting and about on par with > mainline in "no symlinks" case. > * some work towards the symlink traversal without dropping from > RCU mode; not complete, but much better starting point than the current > mainline. > * IMO more readable logics around the pathname resolution in > general, but then I'd spent the last couple of weeks messing with that code, > so my perception might be seriously warped. ... and in addition to that - symlink traversal without dropping out of RCU mode. For now - only the fast symlinks, but extending that to symlinks with non-trivial ->follow_link() is pretty straightforward. > Most of that stuff is mine, with exception of 3 commits from > Neil's series (with some modifications). I've marked those with [N] in > the list below. 6 patches from Neil's series now, plus one from David Howells (marked [D]). > It really needs review and more testing; it *does* pass xfstests, > LTP and local tests with no regressions, but at the very least it'll need > careful profiling, as well as more eyes to read it through. The whole > thing is based at -rc1. Individual patches are in followups; those who > prefer to access it via git can fetch it at > git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git link_path_walk Same branch, rebased on -rc3. Still passes all tests with no regressions. Balance is at about -0.2KLoC... Please, read and comment. Series layout is pretty much unchanged - the first 76 commits (out of 110 ;-/) are fairly close to what had been posted in the previous series. * assorted preliminary patches: [1/110] 9p: don't bother with 4K allocation for 24-byte local array... [2/110] 9p: don't bother with __getname() in ->follow_link() [3/110][N] ovl: rearrange ovl_follow_link to it doesn't need to call ->put_link [4/110] ext4: split inode_operations for encrypted symlinks off the rest * getting rid of boilerplate for inline symlinks [5/110] libfs: simple_follow_link() [6/110] ext2: use simple_follow_link() [7/110] befs: switch to simple_follow_link() [8/110] ext3: switch to simple_follow_link() [9/110] ext4: switch to simple_follow_link() [10/110] jffs2: switch to simple_follow_link() [11/110] shmem: switch to simple_follow_link() [12/110] debugfs: switch to simple_follow_link() [13/110] ufs: switch to simple_follow_link() [14/110] ubifs: switch to simple_follow_link() [15/110] sysv: switch to simple_follow_link() [16/110] jfs: switch to simple_follow_link() [17/110] freevxfs: switch to simple_follow_link() [18/110] exofs: switch to {simple,page}_symlink_inode_operations [19/110] ceph: switch to simple_follow_link() * more assorted preparations: [20/110] logfs: fix a pagecache leak for symlinks [21/110][N] SECURITY: remove nameidata arg from inode_follow_link. [22/110] uninline walk_component() * getting trailing symlinks handling in do_last() into saner shape: [23/110] namei: take O_NOFOLLOW treatment into do_last() [24/110] do_last: kill symlink_ok [25/110] do_last: regularize the logics around following symlinks * getting rid of nameidata on stack during ->rmdir()/->unlink() and double nameidata during ->rename(): [26/110] namei: get rid of lookup_hash() [27/110] name: shift nameidata down into user_path_walk() * single instance of nameidata across the adjacent path_mountpoint() calls: [28/110] namei: lift nameidata into filename_mountpoint() At that point preparations end. The next one changes the calling conventions for ->follow_link/->put_link. We still are passing nameidata to ->follow_link(), but almost all instances do not use it anymore. Moreover, nd->saved_names[] is gone and so's nameidata in generic_readlink(). That one has the biggest footprint outside of fs/namei.c - it's a flagday change. [29/110] new ->follow_link() and ->put_link() calling conventions * some preparations for link_path_walk() massage: [30/110] namei.c: separate the parts of follow_link() that find the link body [31/110] namei: don't bother with ->follow_link() if ->i_link is set [32/110] namei: introduce nameidata->link [33/110] do_last: move path there from caller's stack frame * link_path_walk()/follow_link() rewrite. It's done in small steps, getting the damn thing into form that would allow to get rid of recursion: [34/110] namei: expand nested_symlink() in its only caller [35/110] namei: expand the call of follow_link() in link_path_walk() [36/110] namei: move the calls of may_follow_link() into follow_link() [37/110] namei: rename follow_link to trailing_symlink, move it down [38/110] link_path_walk: handle get_link() returning ERR_PTR() immediately [39/110] link_path_walk: don't bother with walk_component() after jumping link [40/110] link_path_walk: turn inner loop into explicit goto [41/110] link_path_walk: massage a bit more [42/110] link_path_walk: get rid of duplication [43/110] link_path_walk: final preparations to killing recursion * dumb and staightforward "put all variables we needed to be separate in each frame into an array of structs" killing of recursion. Array is still on stack frame of link_path_walk(), we are still with the same depth limits, etc. All of that will change shortly afterwards. [44/110] link_path_walk: kill the recursion * post-operation cleanup: [45/110] link_path_walk: split "return from recursive call" path [46/110] link_path_walk: cleanup - turn goto start; into continue; * array moves into nameidata, with extra (0th) element used instead of link/cookie pairs of local variables we used to have in trailing symlink loops: [47/110] namei: move link/cookie pairs into nameidata * more post-op cleanup - arguments of several functions are always the fields of the same array element or pointers to such: [48/110] namei: trim redundant arguments of trailing_symlink() [49/110] namei: trim redundant arguments of fs/namei.c:put_link() [50/110] namei: trim the arguments of get_link() * trim the array in nameidata to a couple of elements, use it until we need more and allocate a big (41-element) array when needed. Result: no more nesting limitations. [51/110] namei: remove restrictions on nesting depth * the use of array is seriously counterintuitive - for everything except the last component we are _not_ using the first element of array; moreover, way too many places are manipulating nd->depth. I've tried to sanitize all of that in one patch, but after several failures I ended up splitting the massage into smaller steps. The composition of those has ended up fairly small, but convincing yourself that it's correct will mean reporoducing this splitup ;-/ [52/110] link_path_walk: nd->depth massage, part 1 [53/110] link_path_walk: nd->depth massage, part 2 [54/110] link_path_walk: nd->depth massage, part 3 [55/110] link_path_walk: nd->depth massage, part 4 [56/110] trailing_symlink: nd->depth massage, part 5 [57/110] get_link: nd->depth massage, part 6 [58/110] trailing_symlink: nd->depth massage, part 7 [59/110] put_link: nd->depth massage, part 8 [60/110] link_path_walk: nd->depth massage, part 9 [61/110] link_path_walk: nd->depth massage, part 10 [62/110] link_path_walk: end of nd->depth massage [63/110] namei: we never need more than MAXSYMLINKS entries in nd->stack * getting more regular rules for calling terminate_walk()/put_link(). The reason for that part is (besides getting simpler logics in the end) that with RCU-symlink series we'll really need to have terminate_walk() trigger put_link() on everything we have on stack: [64/110] namei: lift (open-coded) terminate_walk() in follow_dotdot_rcu() into callers [65/110] lift terminate_walk() into callers of walk_component() [66/110] namei: lift (open-coded) terminate_walk() into callers of get_link() [67/110] namei: take put_link() into {lookup,mountpoint,do}_last() [68/110] namei: have terminate_walk() do put_link() on everything left [69/110] link_path_walk: move the OK: inside the loop * more put_link() work - basically, we are consolidating the "we decide that we'd found a symlink that needs to be followed" logics. Again, that's one of the areas RCU-symlink will need to modify: [70/110] namei: new calling conventions for walk_component() [71/110] namei: make should_follow_link() store the link in nd->link [72/110] namei: move link count check and stack allocation into pick_link() Next 4 commits are about not passing nameidata to ->follow_link() - it's the same situation as with pt_regs and IRQ handlers and the same kind of solution. Only 2 instances out of 26 use nameidata at all, and only to pass it to nd_jump_link(); seeing that we already have the "begin using this nameidata instance/end using it" pairs bracketing the areas where thread works with given instance (we need that for sane freeing of on-demand allocated nd->stack), the solution is obvious... [73/110] lustre: rip the private symlink nesting limit out [74/110][N] VFS: replace {, total_}link_count in task_struct with pointer to nameidata [75/110] namei: simplify the callers of follow_managed() [76/110] don't pass nameidata to ->follow_link() That's where the parts common with v2 end. The rest is new. get_link() reorganization - make it handle the jump-to-root and return the remaining part of link to follow, with "/" treated as we do treat pure jumps (i.e. return NULL). Adjust callers (sucking a lot of duplication into get_link()). [77/110] namei: simplify failure exits in get_link() [78/110] namei: simpler treatment of symlinks with nothing other that / in the body [79/110] namei: take the treatment of absolute symlinks to get_link() convert complete_walk() to use of unlazy_walk()/terminate_walk(): [80/110] namei: fold put_link() into the failure case of complete_walk() the next bunch takes pushing the link on stack from get_link() to pick_link(): [81/110] namei: move bumping the refcount of link->mnt into pick_link() [82/110] may_follow_link(): trim arguments [83/110] namei: kill nd->link [84/110] namei: take increment of nd->depth into pick_link() Lifting terminate_walk() and link_path_walk() calls up, reorganization and cleanup of top-level loops: [85/110] namei: may_follow_link() - lift terminate_walk() on failures into caller [86/110] namei: split off filename_lookupat() with LOOKUP_PARENT [87/110] namei: get rid of nameidata->base [88/110] namei: path_init() calling conventions change [89/110] namei: lift link_path_walk() call out of trailing_symlink() [90/110] namei: lift terminate_walk() all the way up [91/110] link_path_walk: use explicit returns for failure exits Preparations for RCU work: we need to stop manging nd->seq when finding a symlink and we'll need to keep inode as well as vfsmount/dentry in the stack: [92/110] namei: explicitly pass seq number to unlazy_walk() when dentry != NULL [93/110] namei: don't mangle nd->seq in lookup_fast() [94/110] namei: store inode in nd->stack[] [95/110][D] VFS: Handle lower layer dentry/inode in pathwalk [96/110] namei: pick_link() callers already have inode More prep work: RCU-safe Linux S&M interaction (that one's entirely Neil's) [97/110][N] security/selinux: pass 'flags' arg to avc_audit() and avc_has_perm_flags() [98/110][N] security: make inode_follow_link RCU-walk aware minor ->put_link() API change: [99/110] switch ->put_link() from dentry to inode [100/110] new helper: free_page_put_link() More prep work: put_link() and may_follow_link() made RCU-safe [101/110] namei: make put_link() RCU-safe [102/110] namei: make may_follow_link() safe in RCU mode Legitimizing nd->stack[] on unlazy_walk(): [103/110] new helper: __legitimize_mnt() [104/110] namei: store seq numbers in nd->stack[] [105/110] namei: make unlazy_walk and terminate_walk handle nd->stack, add unlazy_link ... and payoff - shift unlazying down to get_link() and narrow it to the "we don't have ->i_link and have to call the actual method" case: [106/110] namei: don't unlazy until get_link() [107/110][N] VFS/namei: make the use of touch_atime() in get_link() RCU-safe. [108/110] enable passing fast relative symlinks without dropping out of RCU mode [109/110] namei: handle absolute symlinks without dropping out of RCU mode documenting the API changes so far: [110/110] update Documentation/filesystems/ regarding the follow_link/put_link changes -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/