Received: by 2002:ac0:bc90:0:0:0:0:0 with SMTP id a16csp3032800img; Mon, 25 Mar 2019 02:16:56 -0700 (PDT) X-Google-Smtp-Source: APXvYqwkdpPCYAwLiiis95s1sq5nyoDGlnoDvysMvO0lZ8kzXpTKqN+73OTSGfg3pcaPINBhWgIq X-Received: by 2002:a63:2aca:: with SMTP id q193mr22456789pgq.269.1553505416685; Mon, 25 Mar 2019 02:16:56 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1553505416; cv=none; d=google.com; s=arc-20160816; b=hXQQHhwZBcwVIfIYF5GEkefOto+tLTWc1UU26NvoUOPOLcqNe7dt/CM633Koy5P6ac mr4pWOAkP1i0OsgPMjA2UZ8K4ZmySBnfqvUPoxtJ38FahXtX0tNU3ft5K3XsaY7J6Jeo hYYj0h35PAjn8dAnjY+jS+6ZGbT5y+wl6QctzS/hsiEo8Muzwv2LjZhzVYzh6SY3G9DC x70J7Yj0/7GT99knidEZjvSzwOCof2h7YOJnV22fmPZ6kSIOGCcsaBMpKnLs97rH2UaS VEOpA3heQNTnd9cqExo4RIL/lwZ0Krd/G/lI7eY3UNjxO3eFNEf5qcgX4VnMdsYQU6a4 arxA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject; bh=JVz8IfivBc+5yHkOxSZKCfePC+QnjtCd25fyKDv5UQM=; b=F3gxyNj0Jw6Cl/nVdzzgzEV+G5z0O/2aleGUsCI/nXxFkvT3fCBlOrfSBsNpg2/vgZ 0WTMhtMwTCPfE5FniitEaH5XpwKxZmUJz+Lo+SUoIN8wuxDJDs0gAsNTcFnMtVG9vgEN SytBR2oUJz8HXFmwVCc6Y7Us+l4htSHTRb3L5gLTWa4s8o341etEE/awGIjp9VmJ9RKA FnVRwDjHln6WLLhEQq3Hhfx5cixHL+eCrqUANWmDvQStqcFBLK0Ch2DxmYqn/757euQs ySy3+CqqduvwvpTPmf9Y/szuNbfamdTUe/sT/HiJAI6is4tq27TXbYnEHCN5QHZauE2O 9fFA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id f20si430250pfd.51.2019.03.25.02.16.42; Mon, 25 Mar 2019 02:16:56 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730253AbfCYJPr (ORCPT + 99 others); Mon, 25 Mar 2019 05:15:47 -0400 Received: from www62.your-server.de ([213.133.104.62]:36888 "EHLO www62.your-server.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730076AbfCYJPr (ORCPT ); Mon, 25 Mar 2019 05:15:47 -0400 Received: from [78.46.172.3] (helo=sslproxy06.your-server.de) by www62.your-server.de with esmtpsa (TLSv1.2:DHE-RSA-AES256-GCM-SHA384:256) (Exim 4.89_1) (envelope-from ) id 1h8Lhm-0003Zi-55; Mon, 25 Mar 2019 10:15:42 +0100 Received: from [178.197.248.24] (helo=linux.home) by sslproxy06.your-server.de with esmtpsa (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256) (Exim 4.89) (envelope-from ) id 1h8Lhl-000PoA-U8; Mon, 25 Mar 2019 10:15:42 +0100 Subject: Re: KASAN: use-after-free Read in path_lookupat To: Al Viro , Linus Torvalds Cc: syzbot , Alexei Starovoitov , linux-fsdevel , Linux List Kernel Mailing , syzkaller-bugs References: <0000000000006946d2057bbd0eef@google.com> <20190325045744.GK2217@ZenIV.linux.org.uk> From: Daniel Borkmann Message-ID: <717eaff9-1f49-28f3-bbb6-c8f14ebbe03b@iogearbox.net> Date: Mon, 25 Mar 2019 10:15:40 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.3.0 MIME-Version: 1.0 In-Reply-To: <20190325045744.GK2217@ZenIV.linux.org.uk> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-Authenticated-Sender: daniel@iogearbox.net X-Virus-Scanned: Clear (ClamAV 0.100.2/25399/Mon Mar 25 08:46:48 2019) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Al, hi Linus, On 03/25/2019 05:57 AM, Al Viro wrote: > On Sun, Mar 24, 2019 at 06:23:24PM -0700, Linus Torvalds wrote: > >> Al, comments? At the very least, if we don't make >> simple_symlink_inode_operations() do that, we should have a *big* >> comment that if it's not part of the inode data, it needs to be >> RCU-delayed. > > simple_symlink_inode_operations is red herring here - what matters > is ->i_link being set; those have ->get_link == simple_get_link, > but note that it is *not* called: > res = inode->i_link; > if (!res) { > const char * (*get)(struct dentry *, struct inode *, > struct delayed_call *); > get = inode->i_op->get_link; > if (nd->flags & LOOKUP_RCU) { > res = get(NULL, inode, &last->done); > if (res == ERR_PTR(-ECHILD)) { > if (unlikely(unlazy_walk(nd))) > return ERR_PTR(-ECHILD); > res = get(dentry, inode, &last->done); > } > } else { > res = get(dentry, inode, &last->done); > } > if (IS_ERR_OR_NULL(res)) > return res; > } > for traversal and similar for readlink(2). And we certainly don't want > to allocate copies in those cases - it would fuck RCU traversals for > all fast symlinks (i.e. for the majority of symlinks out there). > > Actual situation: > > * shmem, erofs: OK, kfree() from the thing ->destroy_inode() is calling via > call_rcu(). > * befs, ext2, ext4, freevxfs, jfs, orangefs, ufs: OK, coallocated with inode > * debugfs: broken > * jffs2: broken, freeing of f->target should be moved to jffs2_i_callback(). > * ubifs: broken, ought to move kfree(ui->data); from ubifs_destroy_inode() to > ubifs_i_callback() > * ceph: broken, needs to move kfree(ci->symlink) from ceph_destroy_inode() > to ceph_i_callback(). > * bpf: broken > > So we have 5 broken cases, all with the same kind of fix: move freeing > into the RCU-delayed part of ->destroy_inode(); for debugfs and bpf > that requires adding ->alloc_inode()/->destroy_inode(), rather than > relying upon the defaults from fs/inode.c I believe I copied that logic from one of the other fs'es back then, sigh. Thanks for the analysis and hints for fixing. Would the below (only compile tested for now) look reasonable to you? I believe there is no other way around than to add our own inode cache, but so be it. The freeing of the i_link is then RCU-deferred in bpf_destroy_inode_deferred(): kernel/bpf/inode.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 52 insertions(+), 5 deletions(-) diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c index 2ada5e2..69fcaf6 100644 --- a/kernel/bpf/inode.c +++ b/kernel/bpf/inode.c @@ -80,6 +80,47 @@ static const struct inode_operations bpf_dir_iops; static const struct inode_operations bpf_prog_iops = { }; static const struct inode_operations bpf_map_iops = { }; +static struct kmem_cache *bpf_inode_cache __read_mostly; + +struct inode *bpf_alloc_inode(struct super_block *sb) +{ + return kmem_cache_alloc(bpf_inode_cache, GFP_KERNEL); +} + +static void bpf_destroy_inode_deferred(struct rcu_head *head) +{ + struct inode *inode = container_of(head, struct inode, i_rcu); + + if (S_ISLNK(inode->i_mode)) + kfree(inode->i_link); + kmem_cache_free(bpf_inode_cache, inode); +} + +static void bpf_destroy_inode(struct inode *inode) +{ + call_rcu(&inode->i_rcu, bpf_destroy_inode_deferred); +} + +static void bpf_inode_init_once(void *inode) +{ + inode_init_once(inode); +} + +static int bpf_init_inode_cache(void) +{ + bpf_inode_cache = kmem_cache_create("bpf_inode_cache", + sizeof(struct inode), 0, + (SLAB_RECLAIM_ACCOUNT| + SLAB_MEM_SPREAD|SLAB_ACCOUNT), + bpf_inode_init_once); + return !bpf_inode_cache ? -ENOMEM : 0; +} + +static void bpf_destroy_inode_cache(void) +{ + kmem_cache_destroy(bpf_inode_cache); +} + static struct inode *bpf_get_inode(struct super_block *sb, const struct inode *dir, umode_t mode) @@ -561,8 +602,6 @@ static void bpf_evict_inode(struct inode *inode) truncate_inode_pages_final(&inode->i_data); clear_inode(inode); - if (S_ISLNK(inode->i_mode)) - kfree(inode->i_link); if (!bpf_inode_type(inode, &type)) bpf_any_put(inode->i_private, type); } @@ -580,6 +619,8 @@ static int bpf_show_options(struct seq_file *m, struct dentry *root) } static const struct super_operations bpf_super_ops = { + .alloc_inode = bpf_alloc_inode, + .destroy_inode = bpf_destroy_inode, .statfs = simple_statfs, .drop_inode = generic_delete_inode, .show_options = bpf_show_options, @@ -671,13 +712,19 @@ static int __init bpf_init(void) { int ret; - ret = sysfs_create_mount_point(fs_kobj, "bpf"); + ret = bpf_init_inode_cache(); if (ret) return ret; - + ret = sysfs_create_mount_point(fs_kobj, "bpf"); + if (ret) { + bpf_destroy_inode_cache(); + return ret; + } ret = register_filesystem(&bpf_fs_type); - if (ret) + if (ret) { sysfs_remove_mount_point(fs_kobj, "bpf"); + bpf_destroy_inode_cache(); + } return ret; } -- 2.9.5 >> Or maybe we could add a final inode callback function for "rcu_free" >> that is called as the RCU-delayed freeing of the inode itself happens? >> And then people could hook into that for freeing the inode->i_link >> data. > > You mean, split ->destroy_inode() into immediate and RCU-delayed parts? > There are filesystems where both parts are non-empty - we can't just > switch all ->destroy_inode() work to call_rcu(). > >> So many choices.. But the current situation seems unnecessarily >> complex for the filesystem, and isn't really documented. >> >> Our documentation currently says for get_link(): "If the body won't go >> away until the inode is gone, nothing else is needed", which is wrong >> (or at least very misleading, since the last "inode is gone" callback >> we have is that evict() function). > > s/inode is gone/struct inode is freed/, but it's obviously not clear > enough. >