Received: by 2002:ab2:6991:0:b0:1f7:f6c3:9cb1 with SMTP id v17csp754851lqo; Wed, 8 May 2024 13:59:47 -0700 (PDT) X-Forwarded-Encrypted: i=3; AJvYcCUj1+Jgm+fHJnGtDdKM1exlo5QSj3meRGyjlhfiu4NtDzF3fn1Og4ftajtjpaC/M9Jz1UC/er1ODsaYFMPshYnwNsRgjV2ecPQ4AzmhWA== X-Google-Smtp-Source: AGHT+IHXJYFdMqX8uDDYDvfVUPMv0XUCBu+nJZlIgGopX6xtZV/BGQOnBP3h7GGZ3w8X+k2RugBX X-Received: by 2002:aa7:9a4d:0:b0:6ea:8793:6d64 with SMTP id d2e1a72fcca58-6f49c2a3a95mr4362010b3a.23.1715201987377; Wed, 08 May 2024 13:59:47 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1715201987; cv=pass; d=google.com; s=arc-20160816; b=nIxXZvKkKNDde4ZAS6psjxNgBKRAXCLIFfO/REIvo2wlXtDIjDHnaHmbh46kq1WH5Q OIJ4G9eslJFFvDikMf3M9XKoNf/T3g07B0IGCokvMw5i23FpSo4hIgNT++RDebkGDnXH +DRXi7f2ShRaUTWjflITEb3fc9NnZBbgooKSXjcCPToRiChPomIqZMzKCXddSLUvdXW6 F+hLYpOyNLLCtkTn/ewdED49aL7TUzUn67TsvmpJNmeTOjGX19KKmbxD0DqxqQAqQjyQ hGxdzh9GxF7KzeIfW2U/2pW7+rNSSBEJCcX8Z0zvmktE8ljo8OXpxVdqpIMoIH7dVLa0 Ovhg== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:list-unsubscribe:list-subscribe:list-id:precedence :references:mail-followup-to:message-id:subject:cc:to:from:date :dkim-signature; bh=zQT930nNUV1vsHVoX3P+3mBNNldXqcqQMqEjZfBnVWc=; fh=9hNp+pIykD1iPINZIevKGSOQcPKiQyxtJ2cZuNVfOMk=; b=JJl/bJvI2tguiSjgdO3I3o2lujJ03h/xtLxyUSebSZM31shjSd4+Zr6Aw5AELzTMzv BDLDMcj8sWa6hHBK7Io6Qlf+xKP2mpjJWpdLj4dYDX5BE9Z+RMFAfDY1MibvTSK8ZBBO UAFaNQP/9HfMJSOcN+WLJp+/HFE8ffVnkOEiYSWY0r2fCfbYlCeexlSw5oDbUiILiFT8 XvakAI0uVUhu90QJaDGPu52nMdbNkY5hI400JNvdtKbQ1JNHOTCBLa4t2jFpFzVJs3bh 6TtTnz3ydihFfce4jeHIeLJKxVgS9b+qrvprVD3pJlgYIzKfD1wfsHi0GKbM3hDpUN60 dISA==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@ffwll.ch header.s=google header.b=dYcN+kxw; arc=pass (i=1 dkim=pass dkdomain=ffwll.ch); spf=pass (google.com: domain of linux-kernel+bounces-173505-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-173505-linux.lists.archive=gmail.com@vger.kernel.org" Return-Path: Received: from sv.mirrors.kernel.org (sv.mirrors.kernel.org. [2604:1380:45e3:2400::1]) by mx.google.com with ESMTPS id d2e1a72fcca58-6f4d2aecd19si20718b3a.174.2024.05.08.13.59.47 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 08 May 2024 13:59:47 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel+bounces-173505-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) client-ip=2604:1380:45e3:2400::1; Authentication-Results: mx.google.com; dkim=pass header.i=@ffwll.ch header.s=google header.b=dYcN+kxw; arc=pass (i=1 dkim=pass dkdomain=ffwll.ch); spf=pass (google.com: domain of linux-kernel+bounces-173505-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-173505-linux.lists.archive=gmail.com@vger.kernel.org" Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sv.mirrors.kernel.org (Postfix) with ESMTPS id C3FB6289C65 for ; Wed, 8 May 2024 15:45:56 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id B394986626; Wed, 8 May 2024 15:45:44 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=ffwll.ch header.i=@ffwll.ch header.b="dYcN+kxw" Received: from mail-lj1-f181.google.com (mail-lj1-f181.google.com [209.85.208.181]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CED80126F39 for ; Wed, 8 May 2024 15:45:41 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.208.181 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1715183143; cv=none; b=asoku518VbrtWelv5SV+rldun5KZgKzmT5OMrhvq9B0VNNB7sttdU27AUOkJtfutiXSAFMbYxMdysLyaBLdL+wJ+SLEZGAv4Q6Xn8Nz6HDKBYNrclVN8ebhnzUfjG03LIqWSU+s5d8KmSXb34ACkHcH+PK/JAHmlgT+mWz9y9mo= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1715183143; c=relaxed/simple; bh=Y3tAYnJhvWIsSCmSdE6nAn4Ex+kbs5Wn07R+879JUgw=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=OMtp+cX+YoWKiAeQoxZ5m7XWNFgK7AqS4n6J8sinKuLKteUWoig5OfuL6y38b2g7O0JsRDDbWeMNMTqmfCSBZLaXK9tE9ngdeT7y4hmUwSpySvEuBk9V8LKWhQmccPdMftKdFx6G9nFh3q613X8zSrJOJDryBOB+fGlTl2RiCB8= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=ffwll.ch; spf=none smtp.mailfrom=ffwll.ch; dkim=pass (1024-bit key) header.d=ffwll.ch header.i=@ffwll.ch header.b=dYcN+kxw; arc=none smtp.client-ip=209.85.208.181 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=ffwll.ch Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=ffwll.ch Received: by mail-lj1-f181.google.com with SMTP id 38308e7fff4ca-2e20088184cso8716941fa.1 for ; Wed, 08 May 2024 08:45:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ffwll.ch; s=google; t=1715183140; x=1715787940; darn=vger.kernel.org; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:mail-followup-to:message-id:subject:cc:to :from:date:from:to:cc:subject:date:message-id:reply-to; bh=zQT930nNUV1vsHVoX3P+3mBNNldXqcqQMqEjZfBnVWc=; b=dYcN+kxwyl/LDgV8q+OrMUejOE3sF4If55u6Q4DDQuO14gEOV55VGR/v3fLGfCE3rl lw2rtPdcATyCKO5vaNaFY08OhDf16opdJ+PsszPFheeppq49YFnXbBqayOigcfrNHsFH s/B6i7hj0yh0wRUjuBjMjimgYxE6AmkVgfGFc= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1715183140; x=1715787940; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:mail-followup-to:message-id:subject:cc:to :from:date:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=zQT930nNUV1vsHVoX3P+3mBNNldXqcqQMqEjZfBnVWc=; b=moBzDQLvOZ1fWaVX86sw+5TBLwbuDwaN2OqAvEX/zyKAPwE2WdRKOf5EaL2v74YSTh IRq7FebJazBeSfG1X5TpdcLfe0nU8zcT1lMpDimuctyKeLB3DFZ0Th3dsh0SrtlT5+g8 +5euifWHkVxFSqlipa2XU9/vB/B6s6N9WmNCMflloenz0APU5GSu6lWuy+vfmmsffvpP dlhTxSW8Cx7xtr5YJEeBtbLRn6l46EnF/MKwM32tcTXZZwfr6VpodfFiMe9xwPTb9D++ QnjshG6LNGJ93GDWu+sMN3DKyiJSLOrc1MV4ELSRL83MS8NW+FUtkdq1pAaEpBwednAl lAzA== X-Forwarded-Encrypted: i=1; AJvYcCW8rgIxCQzxKRTIoNxao1qjsMxw6mBUpeHBqf1CvkWcAZ4tmDITIY018vBg9dLpyEmm9SOtFIZeeijAnkTvtchhabEKS4Ia3ATRDbr6 X-Gm-Message-State: AOJu0YyMFY7z7vTIVmZj2hlKt4E1d3Be9dKIy+F15UtdUZPwOxX0C3AM X4dgOhh6xTAfItIHeep18ecSygj2OcUUd4ffO0Z4I9S/5iRL15/hiCKAenuB1pQ= X-Received: by 2002:ac2:5b84:0:b0:51f:6d6d:57b with SMTP id 2adb3069b0e04-5217d6346e2mr2041007e87.6.1715183139855; Wed, 08 May 2024 08:45:39 -0700 (PDT) Received: from phenom.ffwll.local ([2a02:168:57f4:0:efd0:b9e5:5ae6:c2fa]) by smtp.gmail.com with ESMTPSA id cq27-20020a056402221b00b005727eb1ed6asm7672318edb.68.2024.05.08.08.45.38 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 08 May 2024 08:45:39 -0700 (PDT) Date: Wed, 8 May 2024 17:45:37 +0200 From: Daniel Vetter To: Christian Brauner Cc: Christian =?iso-8859-1?Q?K=F6nig?= , Linus Torvalds , Al Viro , keescook@chromium.org, axboe@kernel.dk, christian.koenig@amd.com, dri-devel@lists.freedesktop.org, io-uring@vger.kernel.org, jack@suse.cz, laura@labbott.name, linaro-mm-sig@lists.linaro.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-media@vger.kernel.org, minhquangbui99@gmail.com, sumit.semwal@linaro.org, syzbot+045b454ab35fd82a35fb@syzkaller.appspotmail.com, syzkaller-bugs@googlegroups.com Subject: Re: [Linaro-mm-sig] Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes Message-ID: Mail-Followup-To: Christian Brauner , Christian =?iso-8859-1?Q?K=F6nig?= , Linus Torvalds , Al Viro , keescook@chromium.org, axboe@kernel.dk, christian.koenig@amd.com, dri-devel@lists.freedesktop.org, io-uring@vger.kernel.org, jack@suse.cz, laura@labbott.name, linaro-mm-sig@lists.linaro.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-media@vger.kernel.org, minhquangbui99@gmail.com, sumit.semwal@linaro.org, syzbot+045b454ab35fd82a35fb@syzkaller.appspotmail.com, syzkaller-bugs@googlegroups.com References: <202405031110.6F47982593@keescook> <20240503211129.679762-2-torvalds@linux-foundation.org> <20240503212428.GY2118490@ZenIV> <20240504-wohngebiet-restwert-6c3c94fddbdd@brauner> <20240508-risse-fehlpass-895202f594fd@brauner> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20240508-risse-fehlpass-895202f594fd@brauner> X-Operating-System: Linux phenom 6.6.15-amd64 On Wed, May 08, 2024 at 12:08:57PM +0200, Christian Brauner wrote: > On Mon, May 06, 2024 at 04:29:44PM +0200, Christian K?nig wrote: > > Am 04.05.24 um 20:20 schrieb Linus Torvalds: > > > On Sat, 4 May 2024 at 08:32, Linus Torvalds > > > wrote: > > > > Lookie here, the fundamental issue is that epoll can call '->poll()' > > > > on a file descriptor that is being closed concurrently. > > > Thinking some more about this, and replying to myself... > > > > > > Actually, I wonder if we could *really* fix this by simply moving the > > > eventpoll_release() to where it really belongs. > > > > > > If we did it in file_close_fd_locked(), it would actually make a > > > *lot* more sense. Particularly since eventpoll actually uses this: > > > > > > struct epoll_filefd { > > > struct file *file; > > > int fd; > > > } __packed; > > > > > > ie it doesn't just use the 'struct file *', it uses the 'fd' itself > > > (for ep_find()). > > > > > > (Strictly speaking, it should also have a pointer to the 'struct > > > files_struct' to make the 'int fd' be meaningful). > > > > While I completely agree on this I unfortunately have to ruin the idea. > > > > Before we had KCMP some people relied on the strange behavior of eventpoll > > to compare struct files when the fd is the same. > > > > I just recently suggested that solution to somebody at AMD as a workaround > > when KCMP is disabled because of security hardening and I'm pretty sure I've > > seen it somewhere else as well. > > > > So when we change that it would break (undocumented?) UAPI behavior. > > I've worked on that a bit yesterday and I learned new things about epoll > and ran into some limitations. > > Like, what happens if process P1 has a file descriptor registered in an > epoll instance and now P1 forks and creates P2. So every file that P1 > maintains gets copied into a new file descriptor table for P2. And the > same file descriptors refer to the same files for both P1 and P2. So this is pretty similar to any other struct file that has resources hanging off the struct file instead of the underlying inode. Like drm chardev files, where all the buffers, gpu contexts and everything else hangs off the file and there's no other way to get at them (except when exporting to some explicitly meant-for-sharing file like dma-buf). If you fork() that it's utter hilarity, which is why absolutely everyone should set O_CLOEXEC on these. Or EPOLL_CLOEXEC for epoll_create. For the uapi issue you describe below my take would be that we should just try, and hope that everyone's been dutifully using O_CLOEXEC. But maybe I'm biased from the gpu world, where we've been hammering it in that "O_CLOEXEC or bust" mantra since well over a decade. Really the only valid use-case is something like systemd handing open files to a service, where it drops priviledges even well before the exec() call. But we can't switch around the defaults for any of these special open files with anything more than just a current seek position as state, since that breaks uapi. -Sima > > So there's two interesting cases here: > > (1) P2 explicitly removes the file descriptor from the epoll instance > via epoll_ctl(EPOLL_CTL_DEL). That removal affects both P1 and P2 > since the pair is only registered once and it isn't > marked whether it belongs to P1 and P2 fdtable. > > So effectively fork()ing with epoll creates a weird shared state > where removal of file descriptors that were registered before the > fork() affects both child and parent. > > I found that surprising even though I've worked with epoll quite > extensively in low-level userspace. > > (2) P2 doesn't close it's file descriptors. It just exits. Since removal > of the file descriptor from the epoll instance isn't done during > close() but during last fput() P1's epoll state remains unaffected > by P2's sloppy exit because P1 still holds references to all files > in its fdtable. > > (Sidenote, if one ends up adding every more duped-fds into epoll > instance that one doesn't explicitly close and all of them refer to > the same file wouldn't one just be allocating new epitems that > are kept around for a really long time?) > > So if the removal of the fd would now be done during close() or during > exit_files() when we call close_files() and since there's currently no > way of differentiating whether P1 or P2 own that fd it would mean that > (2) collapses into (1) and we'd always alter (1)'s epoll state. That > would be a UAPI break. > > So say we record the fdtable to get ownership of that file descriptor so > P2 doesn't close anything in (2) that really belongs to P1 to fix that > problem. > > But afaict, that would break another possible use-case. Namely, where P1 > creates an epoll instance and registeres fds and then fork()s to create > P2. Now P1 can exit and P2 takes over the epoll loop of P1. This > wouldn't work anymore because P1 would deregister all fds it owns in > that epoll instance during exit. I didn't see an immediate nice way of > fixing that issue. > > But note that taking over an epoll loop from the parent doesn't work > reliably for some file descriptors. Consider man signalfd(2): > > epoll(7) semantics > If a process adds (via epoll_ctl(2)) a signalfd file descriptor to an epoll(7) instance, > then epoll_wait(2) returns events only for signals sent to that process. In particular, > if the process then uses fork(2) to create a child process, then the child will be able > to read(2) signals that are sent to it using the signalfd file descriptor, but > epoll_wait(2) will not indicate that the signalfd file descriptor is ready. In this > scenario, a possible workaround is that after the fork(2), the child process can close > the signalfd file descriptor that it inherited from the parent process and then create > another signalfd file descriptor and add it to the epoll instance. Alternatively, the > parent and the child could delay creating their (separate) signalfd file descriptors and > adding them to the epoll instance until after the call to fork(2). > > So effectively P1 opens a signalfd and registers it in an epoll > instance. Then it fork()s and creates P2. Now both P1 and P2 call > epoll_wait(). Since signalfds are always relative to the caller and P1 > did call signalfd_poll() to register the callback only P1 can get > events. So P2 can't take over signalfds in that epoll loop. > > Honestly, the inheritance semantics of epoll across fork() seem pretty > wonky and it would've been better if an epoll fd inherited across > would've returned ESTALE or EINVAL or something. And if that inheritance > of epoll instances would really be a big use-case there'd be some > explicit way to enable this. -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch