Received: by 2002:ab2:7b86:0:b0:1f7:5705:b850 with SMTP id q6csp1427240lqh; Mon, 6 May 2024 07:29:23 -0700 (PDT) X-Forwarded-Encrypted: i=3; AJvYcCVNR8D2ZHFOAjD0G7GrWxSwVQ5F7w9g1YIr8YO0TSN0VWqZIujgWPptg8UlOOPOGOp2/+pODEDCABZraNsYyNt145cxr2qdUHd5BRA2AQ== X-Google-Smtp-Source: AGHT+IGR/2OByEpPl1bxqSTcI0LV8jnM5S8F+KDAy4WFqYpf8GIkqVyS4d0rNhEpgHKxJFyBTdbq X-Received: by 2002:a17:902:f649:b0:1e2:bf94:487 with SMTP id m9-20020a170902f64900b001e2bf940487mr13023236plg.57.1715005760686; Mon, 06 May 2024 07:29:20 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1715005760; cv=pass; d=google.com; s=arc-20160816; b=ONABWtUNRdxLtZ7dtKbfSf7k+dW7625BWsfwPtEmw9JF2RuCyxBo1DIxfJYwkBpJnF aQMiwhlRn7ID9bYplUd5D4kOolczHvLGXJvLtboHhbe+EbWS9l2a+VgAEEgyAFIeQmDs b1Zgd5JQnZa7LbuC7gxKrZmrDskvMyrOpF+Xc0cg9QPN0dYnVb72jxZgmkn3VJgAhDZj pMa9m8OfKfE0UnmlwVobZ/4PgHzicJi7rudKiO3ddakWwVWvnChXbp4DXk+Na3DZGl3h y68K8q/CNOjMIm1skqgxC5N+Da3cIocqPa9IMKn+Al+wliH+v0d5Yrt6rVAkBYI5B/D8 c+Gw== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=in-reply-to:content-disposition:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:references:message-id:subject:cc :to:from:date:dkim-signature; bh=L6iRPUnNzCkh1rezJ+RK/X3Fo6geMOmfh+umokqmfic=; fh=K+u6k4LeW50YAlFBITh17z6/Kw8sjFmvdXtpOlBV5c8=; b=mQ1lBIlcwn9wXXc33277q5jQgUkwd013GTKEwlFeRhBv01ov4zxta7oGoL0ORs7Irq iFW1J4/KLvNKFjHgYzoJLYj53qWYV47aLrLVBswmwedcr1jFPCZ32abszQt1Pwsey+1r nna0biV83QbPMyEOskYRcbWFxi/QXaXBfm3yACysb9bsBI5NwXKLEZsjd+uIwWbmwbwh ajElh1on9Q7qCe76EgMU82tcsFZRQujRi9BnmNfM1iR9czNSJDIbc4o+DOKhBiG4HUvJ xShN2yhavU3hphEp4csE9pB/1L0/jMstVYm5O/uKcgArsTuhU2wqS9sgLTMHs61g2FXP 7MWQ==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=XQUTHXDJ; arc=pass (i=1 dkim=pass dkdomain=kernel.org); spf=pass (google.com: domain of linux-kernel+bounces-169974-linux.lists.archive=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) smtp.mailfrom="linux-kernel+bounces-169974-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from sv.mirrors.kernel.org (sv.mirrors.kernel.org. [139.178.88.99]) by mx.google.com with ESMTPS id p1-20020a170902c70100b001eb5235450fsi4275710plp.298.2024.05.06.07.29.20 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 06 May 2024 07:29:20 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel+bounces-169974-linux.lists.archive=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) client-ip=139.178.88.99; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=XQUTHXDJ; arc=pass (i=1 dkim=pass dkdomain=kernel.org); spf=pass (google.com: domain of linux-kernel+bounces-169974-linux.lists.archive=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) smtp.mailfrom="linux-kernel+bounces-169974-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sv.mirrors.kernel.org (Postfix) with ESMTPS id 7C5822858B7 for ; Mon, 6 May 2024 14:22:19 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 3DDBD13D510; Mon, 6 May 2024 14:19:15 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="XQUTHXDJ" Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3660313D297; Mon, 6 May 2024 14:19:13 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1715005154; cv=none; b=MM+5V3nQ/pqO5pLAYWW2K0yman1OueOO6+V0xldP7/JEESE91g5TJQN4AJXBHjmUF5MMktkGWdxmyOQ3H/0nw90w58zXH2Mu2LGqTVeuq6KkKboH8ASsWUjgc8pcPL5jlB7puo/mOWvzhqmv37LvQLwUKxR+xYCLMVlJSci72L0= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1715005154; c=relaxed/simple; bh=l39k6V0I7wZACUt8VAIAhxRDJ0/vIbV1E6k1IrSGtK0=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=uMgSkPAzvWeK575Mo97/8lGiY4YKM9tLOxGWR4t1/IUWgiEzBc1YMK19Me8eVsUw7XSdyfDtz1Q9yxjyO467xC6pYrWbwmVh3L+AujLm1j3riOE3QdDXy/oCFK5fMnA8qCWMZ3TQEd9wExKfwZ8P5V4SOAShdU3kXyhUQKPz9rM= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=XQUTHXDJ; arc=none smtp.client-ip=10.30.226.201 Received: by smtp.kernel.org (Postfix) with ESMTPSA id 70FA1C116B1; Mon, 6 May 2024 14:19:09 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1715005153; bh=l39k6V0I7wZACUt8VAIAhxRDJ0/vIbV1E6k1IrSGtK0=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=XQUTHXDJ6U4b3MW5pJhdFcgCJ8GRlt0WUo07V6/eFe4FkmlpEPSUBRp9vzJif66tz 48bynhPLwPbGqNMjPjj6OCh966S+vS0L+TkmKifUTt5sqDR6RUpnaGxjEj7U0EF0V9 qXKKSVwP8jp/jZeB3jCnG0uVdSbcj857lvLjY9a3itUWr2kcoFj+c6vnZ8m+ECCUAz kbUkpPgsM/Pww3u4arhcLXvQW5lj1M6fvyJbgQYolNLKm2LesyooMdJvsfHZae8613 tNt1E2bagVx7r2Gbuqaow7JCziJ74q3PJGx+P71AqPMmgTbDhK1CamdAE2gqEK620F lC4wZIfMVgufQ== Date: Mon, 6 May 2024 16:19:06 +0200 From: Christian Brauner To: Linus Torvalds Cc: Al Viro , keescook@chromium.org, axboe@kernel.dk, christian.koenig@amd.com, dri-devel@lists.freedesktop.org, io-uring@vger.kernel.org, jack@suse.cz, laura@labbott.name, linaro-mm-sig@lists.linaro.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-media@vger.kernel.org, minhquangbui99@gmail.com, sumit.semwal@linaro.org, syzbot+045b454ab35fd82a35fb@syzkaller.appspotmail.com, syzkaller-bugs@googlegroups.com Subject: Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes Message-ID: <20240506-zerbrach-zehnkampf-80281b1452c8@brauner> References: <20240503211129.679762-2-torvalds@linux-foundation.org> <20240503212428.GY2118490@ZenIV> <20240504-wohngebiet-restwert-6c3c94fddbdd@brauner> <20240505-gelehnt-anfahren-8250b487da2c@brauner> <20240506-injizieren-administration-f5900157566a@brauner> <20240506-zweibeinig-mahnen-daa579a233db@brauner> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20240506-zweibeinig-mahnen-daa579a233db@brauner> On Mon, May 06, 2024 at 11:27:04AM +0200, Christian Brauner wrote: > On Mon, May 06, 2024 at 10:45:35AM +0200, Christian Brauner wrote: > > > The fact is, it's not dma-buf that is violating any rules. It's epoll. > > > > I agree that epoll() not taking a reference on the file is at least > > unexpected and contradicts the usual code patterns for the sake of > > performance and that it very likely is the case that most callers of > > f_op->poll() don't know this. > > > > Note, I cleary wrote upthread that I'm ok to do it like you suggested > > but raised two concerns a) there's currently only one instance of > > prolonged @file lifetime in f_op->poll() afaict and b) that there's > > possibly going to be some performance impact on epoll(). > > > > So it's at least worth discussing what's more important because epoll() > > is very widely used and it's not that we haven't favored performance > > before. > > > > But you've already said that you aren't concerned with performance on > > epoll() upthread. So afaict then there's really not a lot more to > > discuss other than take the patch and see whether we get any complaints. > > Two closing thoughts: > > (1) I wonder if this won't cause userspace regressions for the semantics > of epoll because dying files are now silently ignored whereas before > they'd generated events. > > (2) The other part is that this seems to me that epoll() will now > temporarly pin filesystems opening up the possibility for spurious > EBUSY errors. > > If you register a file descriptor in an epoll instance and then > close it and umount the filesystem but epoll managed to do an fget() > on that fd before that close() call then epoll will pin that > filesystem. > > If the f_op->poll() method does something that can take a while > (blocks on a shared mutex of that subsystem) that umount is very > likely going to return EBUSY suddenly. > > Afaict, before that this wouldn't have been an issue at all and is > likely more serious than performance. > > (One option would be to only do epi_fget() for stuff like > dma-buf that's never unmounted. That'll cover nearly every > driver out there. Only "real" filesystems would have to contend with > @file count going to zero but honestly they also deal with dentry > lookup under RCU which is way more adventurous than this.) > > Maybe I'm barking up the wrong tree though. Sorry, had to step out for an appointment. Under the assumption that I'm not entirely off with this - and I really could be ofc - then one possibility would be that we enable persistence of files from f_op->poll() for SB_NOUSER filesystems. That'll catch everything that's relying on anonymous inodes (drm and all drivers) and init_pseudo() so everything that isn't actually unmountable (pipefs, pidfs, sockfs, etc.). So something like the _completely untested_ diff on top of your proposal above: diff --git a/fs/eventpoll.c b/fs/eventpoll.c index a3f0f868adc4..95968a462544 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -1018,8 +1018,24 @@ static struct file *epi_fget(const struct epitem *epi) static __poll_t ep_item_poll(const struct epitem *epi, poll_table *pt, int depth) { - struct file *file = epi_fget(epi); + struct file *file = epi->ffd.file; __poll_t res; + bool unrefd = false; + + /* + * Taking a reference for anything that isn't mountable is fine + * because we don't have to worry about spurious EBUSY warnings + * from umount(). + * + * File count might go to zero in f_op->poll() for mountable + * filesystems. + */ + if (file->f_inode->i_sb->s_flags & SB_NOUSER) { + unrefd = true; + file = epi_fget(epi); + } else if (file_count(file) == 0) { + file = NULL; + } /* * We could return EPOLLERR | EPOLLHUP or something, @@ -1034,7 +1050,9 @@ static __poll_t ep_item_poll(const struct epitem *epi, poll_table *pt, res = vfs_poll(file, pt); else res = __ep_eventpoll_poll(file, pt, depth); - fput(file); + + if (unrefd) + fput(file); return res & epi->event.events; } Basically, my worry is that we end up with really annoying to debug EBUSYs caused by epoll(). I'd really like to avoid that. But again, I might be wrong and this isn't an issue.