Received: by 2002:a05:6a10:1d13:0:0:0:0 with SMTP id pp19csp1004278pxb; Wed, 1 Sep 2021 15:03:46 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyHWVD31d1ZG7sTPqKoe0ouN8vjXz5jt5OAkQONiQvmpeaTffN2ym4WW+cPj3hFvq70l5Bt X-Received: by 2002:a5e:d712:: with SMTP id v18mr12308iom.65.1630533826048; Wed, 01 Sep 2021 15:03:46 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1630533826; cv=none; d=google.com; s=arc-20160816; b=qeOqp/CbU2rfk4UBoIhyYBQkS69gWQuH68yQWGhlqFlMHKYhpOa6IDjEwQ15gpamBw //Mg58aVf4YwquBlmsbaRpyQh6dUq8MvZuRRx+do/FQQcBBKAOzESWeMOqQNFMP301B5 kRBXmOEhOo0sOQP0Wcr1clrUz9LbyfrYBwtuUgR5NG3MW0icg/RJz8DYu1Rs4deeQxC8 QwmJxV+/vweTWUnen6ZgaZU6VU/cQWQYjdszigv60u3RnUI+Inj6cyW9Zr98oYhqFN33 BoiQ3dINRponIJ6c3HROfwwezPTsYUfyFIkzyO87RwaHSj6BZomQHhP957MykMVgkSHP xNYw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :user-agent:references:in-reply-to:message-id:date:subject:cc:to :from:dkim-signature; bh=LI8w+Rmvkpo91dSVwPvNFWy+Y1gEu+PZ7+yPnULDXrM=; b=hgTi2c7tcOn3XPXMDHPM4L7smg4ljAWzPms2KLbni4IDgq8pncA+zZD4h9U0EkDDCz lq582P7k9LnUatoD6pfgRlG1eWKlBQNmKgaTGEsNop6KNWnutoZaOfO5qB/berh2cKb4 t7NhdMUTEolG3enrbbDDsCTbLmJ96DyKcEdexy+kDW0g4eTW6FAInhKWBCmXnwdJjWuF mtEQCjKBO2UWsPH3N6x6kqU8hZxmR6sKRLVeSNSVOfxzIoFuL9AN6J/mlDLMgLJNgVLr XZGnTUUgkw9YLMrstqpnql7CIO3WZkj5IgtZ774rnCgY6FSHBneCbjPHvce7TIfcEVR7 KEbw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linuxfoundation.org header.s=korg header.b=anyUQDZ1; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linuxfoundation.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id c18si710596ioz.59.2021.09.01.15.03.34; Wed, 01 Sep 2021 15:03:46 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@linuxfoundation.org header.s=korg header.b=anyUQDZ1; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linuxfoundation.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1343597AbhIAMn4 (ORCPT + 99 others); Wed, 1 Sep 2021 08:43:56 -0400 Received: from mail.kernel.org ([198.145.29.99]:41872 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1344716AbhIAMj5 (ORCPT ); Wed, 1 Sep 2021 08:39:57 -0400 Received: by mail.kernel.org (Postfix) with ESMTPSA id 8A0106112F; Wed, 1 Sep 2021 12:35:52 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linuxfoundation.org; s=korg; t=1630499753; bh=lZwH8zxKEBCqM8hgNeOLnFpdLMREYnKK2RrxL4Hem1w=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=anyUQDZ1sXXVZkCAH+q8YVLc6SsxY1Ha/+G63y06nh7cH3diLZGA+CyEZMN1cLVci uh2Nf2ohN0dCYKy9nO0rkJQ74ZgTiCax1V9pKZs4lZqSJ7WTRDMLi1Fbihgj8iLSJb ssJE9LgLdUl+BHOOPVRFAKFJVmGXkmc+C4ctkRPQ= From: Greg Kroah-Hartman To: linux-kernel@vger.kernel.org Cc: Greg Kroah-Hartman , stable@vger.kernel.org, kernel test robot , Eric Biederman , Colin Ian King , Linus Torvalds Subject: [PATCH 5.10 073/103] pipe: do FASYNC notifications for every pipe IO, not just state changes Date: Wed, 1 Sep 2021 14:28:23 +0200 Message-Id: <20210901122303.022351907@linuxfoundation.org> X-Mailer: git-send-email 2.33.0 In-Reply-To: <20210901122300.503008474@linuxfoundation.org> References: <20210901122300.503008474@linuxfoundation.org> User-Agent: quilt/0.66 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Linus Torvalds commit fe67f4dd8daa252eb9aa7acb61555f3cc3c1ce4c upstream. It turns out that the SIGIO/FASYNC situation is almost exactly the same as the EPOLLET case was: user space really wants to be notified after every operation. Now, in a perfect world it should be sufficient to only notify user space on "state transitions" when the IO state changes (ie when a pipe goes from unreadable to readable, or from unwritable to writable). User space should then do as much as possible - fully emptying the buffer or what not - and we'll notify it again the next time the state changes. But as with EPOLLET, we have at least one case (stress-ng) where the kernel sent SIGIO due to the pipe being marked for asynchronous notification, but the user space signal handler then didn't actually necessarily read it all before returning (it read more than what was written, but since there could be multiple writes, it could leave data pending). The user space code then expected to get another SIGIO for subsequent writes - even though the pipe had been readable the whole time - and would only then read more. This is arguably a user space bug - and Colin King already fixed the stress-ng code in question - but the kernel regression rules are clear: it doesn't matter if kernel people think that user space did something silly and wrong. What matters is that it used to work. So if user space depends on specific historical kernel behavior, it's a regression when that behavior changes. It's on us: we were silly to have that non-optimal historical behavior, and our old kernel behavior was what user space was tested against. Because of how the FASYNC notification was tied to wakeup behavior, this was first broken by commits f467a6a66419 and 1b6b26ae7053 ("pipe: fix and clarify pipe read/write wakeup logic"), but at the time it seems nobody noticed. Probably because the stress-ng problem case ends up being timing-dependent too. It was then unwittingly fixed by commit 3a34b13a88ca ("pipe: make pipe writes always wake up readers") only to be broken again when by commit 3b844826b6c6 ("pipe: avoid unnecessary EPOLLET wakeups under normal loads"). And at that point the kernel test robot noticed the performance refression in the stress-ng.sigio.ops_per_sec case. So the "Fixes" tag below is somewhat ad hoc, but it matches when the issue was noticed. Fix it for good (knock wood) by simply making the kill_fasync() case separate from the wakeup case. FASYNC is quite rare, and we clearly shouldn't even try to use the "avoid unnecessary wakeups" logic for it. Link: https://lore.kernel.org/lkml/20210824151337.GC27667@xsang-OptiPlex-9020/ Fixes: 3b844826b6c6 ("pipe: avoid unnecessary EPOLLET wakeups under normal loads") Reported-by: kernel test robot Tested-by: Oliver Sang Cc: Eric Biederman Cc: Colin Ian King Signed-off-by: Linus Torvalds Signed-off-by: Greg Kroah-Hartman --- fs/pipe.c | 20 ++++++++------------ 1 file changed, 8 insertions(+), 12 deletions(-) --- a/fs/pipe.c +++ b/fs/pipe.c @@ -363,10 +363,9 @@ pipe_read(struct kiocb *iocb, struct iov * _very_ unlikely case that the pipe was full, but we got * no data. */ - if (unlikely(was_full)) { + if (unlikely(was_full)) wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM); - kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT); - } + kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT); /* * But because we didn't read anything, at this point we can @@ -385,12 +384,11 @@ pipe_read(struct kiocb *iocb, struct iov wake_next_reader = false; __pipe_unlock(pipe); - if (was_full) { + if (was_full) wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM); - kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT); - } if (wake_next_reader) wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM); + kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT); if (ret > 0) file_accessed(filp); return ret; @@ -565,10 +563,9 @@ pipe_write(struct kiocb *iocb, struct io * become empty while we dropped the lock. */ __pipe_unlock(pipe); - if (was_empty) { + if (was_empty) wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM); - kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); - } + kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); wait_event_interruptible_exclusive(pipe->wr_wait, pipe_writable(pipe)); __pipe_lock(pipe); was_empty = pipe_empty(pipe->head, pipe->tail); @@ -591,10 +588,9 @@ out: * Epoll nonsensically wants a wakeup whether the pipe * was already empty or not. */ - if (was_empty || pipe->poll_usage) { + if (was_empty || pipe->poll_usage) wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM); - kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); - } + kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN); if (wake_next_writer) wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM); if (ret > 0 && sb_start_write_trylock(file_inode(filp)->i_sb)) {