Received: by 2002:ab2:7a55:0:b0:1f4:4a7d:290d with SMTP id u21csp608826lqp; Fri, 5 Apr 2024 03:48:02 -0700 (PDT) X-Forwarded-Encrypted: i=3; AJvYcCV/L30ZQAePIH6Edl7s+uPhwbZYeh2+MhRazITQhjmp6A9bt7kbMdQ7TdjJ375oNEneOP3SNyZt/5Rr2iumfCRR1pdjlHgBLrSE3ev80Q== X-Google-Smtp-Source: AGHT+IHI7293bcCJzX7y/6m648RSxvdSUeL1iTYVmYQ8ol9TXQO4LHeO/XPVST3GApfqGm0T27BC X-Received: by 2002:a05:6a00:180e:b0:6ea:b1f5:112b with SMTP id y14-20020a056a00180e00b006eab1f5112bmr1216134pfa.21.1712314082453; Fri, 05 Apr 2024 03:48:02 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1712314082; cv=pass; d=google.com; s=arc-20160816; b=hKj7OtV/5jo1mU7++Jb9neAmuY7/J1OlmwywfqgzbSuE8diCsKlJUQTGNJ+eFZ+wz0 qHi39aUbYb4HFQ5UdQkaxU8uc+ZuCIzqJZBQmRyAN+NqV9IRBNTFs6Cf1OD6xrXkRgB/ Jqg2W2xSjmKCph4XmdTwM70FzRyTyYQVfuYhQxbRkbKjtUCPacdPnvwYkhomJtMzErWQ rejoxTxdVmOYwFJH+tapJ9hwRj8jvWIRynSTFqWQ6eeX8BOzUsiNqGIPC4F+wlmYgRgM 7bw7uh92kw4Ll9v5zjCR8eo277TPxnRea4Vu+bM6TJdlsPhbwGLH++wRXtJOccK5F9WP ZP9g== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:list-unsubscribe:list-subscribe:list-id:precedence :references:message-id:subject:cc:to:from:date:dkim-signature; bh=ds5MPJAlpkYuLiOmIsj74yovjikijrDUwRuhe84qjTs=; fh=BHHAV2YVweMuB9JlP85t9iVcUTyq8s5ul3SZSr+NcOg=; b=Ff4e1VhkMWo+Rg9AUDPILwVNHa9RHf4lLW4u7f4sawEb2Q/Qmbg0tvfEwSSDBF9qgR q3NsCcVHKP7N3cKBcPMMrhd+b/+LkzLU2w2ejs7LkA5ssUSGCqKOQCEoZ8Zmhqz1HMLf 2etnCJgJmq1bLRGcKWRgzvnlne/LDbJxWlCPC1AEAnJ84mR9Uea7lvakaFKFF827wzCQ cuV67FbEExadL0XyfvzcAHMqiNOQI+MLzvsSqSyLOa5DCOVfbjb3hpzkLfMDX0VfR+C7 OaJOx+NSvoDw9FZnxYsWR9EDHU3j4wx75sWdDixXIPndO9N5Pi7QgPTn6/jdIUsrHkK7 UdhQ==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=US0mLwcY; arc=pass (i=1 dkim=pass dkdomain=kernel.org); spf=pass (google.com: domain of linux-kernel+bounces-132850-linux.lists.archive=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) smtp.mailfrom="linux-kernel+bounces-132850-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from sv.mirrors.kernel.org (sv.mirrors.kernel.org. [139.178.88.99]) by mx.google.com with ESMTPS id g10-20020a056a000b8a00b006e6cc8f90f5si1174865pfj.274.2024.04.05.03.48.02 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 05 Apr 2024 03:48:02 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel+bounces-132850-linux.lists.archive=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) client-ip=139.178.88.99; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=US0mLwcY; arc=pass (i=1 dkim=pass dkdomain=kernel.org); spf=pass (google.com: domain of linux-kernel+bounces-132850-linux.lists.archive=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) smtp.mailfrom="linux-kernel+bounces-132850-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sv.mirrors.kernel.org (Postfix) with ESMTPS id 1C736282A16 for ; Fri, 5 Apr 2024 10:48:02 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 4B09016ABCE; Fri, 5 Apr 2024 10:47:53 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="US0mLwcY" Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 449891474DC; Fri, 5 Apr 2024 10:47:51 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1712314072; cv=none; b=uw6sJD8nlYoboEO2k41dxlZtEeWOao4TItAeUKvp1U+2m6MOSV/fnLwhs+hX3ApKsxQIC5XBuiUTCBkyV2T10NyRBm/G+qn/fjLZS11SVAN/VdxWYVEXs57fgp8oL9YX5jDeT/hfIXpYOrk/vACH018ZjUUrRvxXZwbtagiwS+s= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1712314072; c=relaxed/simple; bh=w3gv16hWhQOTk178aOx4KwMpMU7NKvm5pP8h9qn985s=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=AGzao7Uy5LyTruVXFy2pnQB3CD+Wjy/+dJjMWn+oih9kFCNvEvrRrswt5COingokq9igRSjJOIgY90TLuTCleQaC3HtI7viEf5KR/4VUQpBi3u8OFQN8GrrPT9QIn1hXVPtNhynyV5Kxx4oOvfU7cR6lEpFOBQalnv/c6i8gI8I= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=US0mLwcY; arc=none smtp.client-ip=10.30.226.201 Received: by smtp.kernel.org (Postfix) with ESMTPSA id 4617EC433C7; Fri, 5 Apr 2024 10:47:48 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1712314071; bh=w3gv16hWhQOTk178aOx4KwMpMU7NKvm5pP8h9qn985s=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=US0mLwcYJa8P+1fVrSqw4WDCrnVvlzxpPTq4gXVT/TormCbqLVLFrX8BJflLcbdo1 vO5U3O/Jz1pJlDRYPvbLXxuohtmwp5s71p8JwuPHWErE2sbcZy2AB+r2pfynt2R3+e EhbaN4HHmyftf7S+X0Z773x8E9P2+jdSnt1ZMlrfNwPqRqaSVz2RsmKg/hgXRfLkva WasfArMM9LkyykGedaHFDX/01zUYwuR/Niux/MBGCPmR9Zl6cqUsp+epIaqorfMy1o Z/cyVGdsx3kAp0qiTzSW/HIHQg43AGP0OZlZ8cgog1wwMuYyz7MDhW/OZBHMhnmGik iojWRqhxebaGg== Date: Fri, 5 Apr 2024 12:47:45 +0200 From: Christian Brauner To: Amir Goldstein Cc: Al Viro , syzbot , gregkh@linuxfoundation.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, syzkaller-bugs@googlegroups.com, tj@kernel.org, valesini@yandex-team.ru, Christoph Hellwig , Jan Kara , Miklos Szeredi Subject: Re: [syzbot] [kernfs?] possible deadlock in kernfs_fop_llseek Message-ID: <20240405-speerwerfen-quetschen-d3de254cf830@brauner> References: <00000000000098f75506153551a1@google.com> <0000000000002f2066061539e54b@google.com> <20240404081122.GQ538574@ZenIV> <20240404082110.GR538574@ZenIV> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: On Thu, Apr 04, 2024 at 12:33:40PM +0300, Amir Goldstein wrote: > On Thu, Apr 4, 2024 at 11:21 AM Al Viro wrote: > > > > On Thu, Apr 04, 2024 at 09:11:22AM +0100, Al Viro wrote: > > > On Thu, Apr 04, 2024 at 09:54:35AM +0300, Amir Goldstein wrote: > > > > > > > > In the lockdep dependency chain, overlayfs inode lock is taken > > > > before kernfs internal of->mutex, where kernfs (sysfs) is the lower > > > > layer of overlayfs, which is sane. > > > > > > > > With /sys/power/resume (and probably other files), sysfs also > > > > behaves as a stacking filesystem, calling vfs helpers, such as > > > > lookup_bdev() -> kern_path(), which is a behavior of a stacked > > > > filesystem, without all the precautions that comes with behaving > > > > as a stacked filesystem. > > > > > > No. This is far worse than anything stacked filesystems do - it's > > > an arbitrary pathname resolution while holding a lock. > > > It's not local. Just about anything (including automounts, etc.) > > > can be happening there and it pushes the lock in question outside > > > of *ALL* pathwalk-related locks. Pathname doesn't have to > > > resolve to anything on overlayfs - it can just go through > > > a symlink on it, or walk into it and traverse a bunch of .. > > > afterwards, etc. > > > > > > Don't confuse that with stacking - it's not even close. > > > You can't use that anywhere near overlayfs layers. > > > > > > Maybe isolate it into a separate filesystem, to be automounted > > > on /sys/power. And make anyone playing with overlayfs with > > > sysfs as a layer mount the damn thing on top of power/ in your > > > overlayfs. But using that thing as a part of layer is > > > a non-starter. > > I don't follow what you are saying. > Which code is in non-starter violation? > kernfs for calling lookup_bdev() with internal of->mutex held? > Overlayfs for allowing sysfs as a lower layer and calling > vfs_llseek(lower_sysfs_file,...) during copy up while ovl inode is held > for legit reasons (e.g. from ovl_rename())? > > > > > Incidentally, why do you need to lock overlayfs inode to call > > vfs_llseek() on the underlying file? It might (or might not) > > need to lock the underlying file (for things like ->i_size, > > etc.), but that will be done by ->llseek() instance and it > > would deal with the inode in the layer, not overlayfs one. > > We do not (anymore) lock ovl inode in ovl_llseek(), see: > b1f9d3858f72 ovl: use ovl_inode_lock in ovl_llseek() > but ovl inode is held in operations (e.g. ovl_rename) > which trigger copy up and call vfs_llseek() on the lower file. > > > > > Similar question applies to ovl_write_iter() - why do you > > need to hold the overlayfs inode locked during the call of > > backing_file_write_iter()? > > > > Not sure. This question I need to defer to Miklos. > I see in several places the pattern: > inode_lock(inode); > /* Update mode */ > ovl_copyattr(inode); > ret = file_remove_privs(file); > ... > /* Update size */ > ovl_file_modified(file); > ... > inode_unlock(inode); > > so it could be related to atomic remove privs and update mtime, > but possibly we could convert all of those inode_lock() to > ovl_inode_lock() (i.e. internal lock below vfs inode lock). > > [...] > > Consider the scenario when unlink() is called on that sucker > > during the write() that triggers that pathwalk. We have > > > > unlink: blocked on overlayfs inode of file, while holding the parent directory. > > write: holding the overlayfs inode of file and trying to resolve a pathname > > that contains .../power/suspend_stats/../../...; blocked on attempt to lock > > parent so we could do a lookup in it. > > This specifically cannot happen because sysfs is not allowed as an > upper layer only as a lower layer, so overlayfs itself will not be writing to > /sys/power/resume. I don't understand that part. If overlayfs uses /sys/power/ as a lower layer it can open and write to /sys/power/resume, no? Honestly, why don't you just block /sys/power from appearing in any layer in overlayfs? This seems like such a niche use-case that it's so unlikely that this will be used that I would just try and kill it. If you do it like Al suggested and switch it to an automount you get that for free. But I guess you can also just block it without that. (Frankly, I find it weird that sysfs is allowed as a layer in any case. I completely forgot about this. Imho, both procfs and sysfs should not be usable as a lower layer - procfs is, I know - and then only select parts should be like /sys/fs/cgroup or sm where I can see the container people somehow using this to mess with the cgroup tree or something.) > > > > > No llseek involved anywhere, kernfs of->mutex held, but not contended, > > deadlock purely on ->i_rwsem of overlayfs inodes. > > > > Holding overlayfs inode locked during the call of lookup_bdev() is really > > no-go. > > Yes. I see that, but how can this be resolved? > > If the specific file /sys/power/resume will not have FMODE_LSEEK, > then ovl_copy_up_file() will not try to skip_hole on it at all and the > llseek lock dependency will be averted. > > The question is whether opt-out of FMODE_LSEEK for /sys/power/resume > will break anything or if we should mark seq files in another way so that > overlayfs would not try to seek_hole on any of them categorically. > > Thanks, > Amir.