Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id E334BC05027 for ; Fri, 10 Feb 2023 15:15:21 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232523AbjBJPPU (ORCPT ); Fri, 10 Feb 2023 10:15:20 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43434 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232558AbjBJPPN (ORCPT ); Fri, 10 Feb 2023 10:15:13 -0500 Received: from ams.source.kernel.org (ams.source.kernel.org [145.40.68.75]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3590356EFC for ; Fri, 10 Feb 2023 07:15:11 -0800 (PST) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ams.source.kernel.org (Postfix) with ESMTPS id 6A4A6B82554 for ; Fri, 10 Feb 2023 15:15:10 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 073BAC433A0 for ; Fri, 10 Feb 2023 15:15:08 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1676042109; bh=ASNHadyozyLIGhBVrhVI7EFdVirhfiV1fq9KsBAOwjw=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=VklJnD6zWz6k/285Nxbc426a8iy7awLNZLUt+bx8D+9w6gNcPVLOiEhxm/uutYb6T YFn2xmFO+yjYuMk0nxEffW7Zk+b59KUWni6zGRjK9aB1l6a9Py1mSyUpb/neh5oPxN 623Y959fjqcir7jqWT1aRloHl4mtlCiPQ/Yw2j8DR9WFu3jhtSxl+EJ4rmHLRwbOJe ivZRy64wvfG5MBAFyYN9Mi+u2tVlAkVONNKuGieU/lcoHdWKwr5U4PrrZcVRNTkHx3 bETO+lCNqOSpMkYwsUbVSzP583C1XO7lOT0m6FD4TrzWTf9rEVzls/Sm2xBtOWYy1m clddAnHBbI8pQ== Received: by mail-ej1-f53.google.com with SMTP id qb15so14647471ejc.1 for ; Fri, 10 Feb 2023 07:15:08 -0800 (PST) X-Gm-Message-State: AO0yUKVP2XPWv/rrttizkYZBr+g5FPC64fzMTZfrSh6JsUqJeW0Kzowr GU4X0OgwGk0L4PqgDd0ww6FQOnMa+toJ4rj2ePDUrw== X-Google-Smtp-Source: AK7set82uoXvXPMdWdCjmW9Ka6p9XO/hezpTDJigjecBYGrCBbgbcLwnXqm27c9nTy16gEHPF4kIoq/7kIfYYzOcQ9Y= X-Received: by 2002:a17:907:1de6:b0:8ae:cb48:3c80 with SMTP id og38-20020a1709071de600b008aecb483c80mr1248578ejc.7.1676042107124; Fri, 10 Feb 2023 07:15:07 -0800 (PST) MIME-Version: 1.0 References: <0cfd9f02-dea7-90e2-e932-c8129b6013c7@samba.org> <20230210021603.GA2825702@dread.disaster.area> <20230210040626.GB2825702@dread.disaster.area> <20230210065747.GD2825702@dread.disaster.area> In-Reply-To: <20230210065747.GD2825702@dread.disaster.area> From: Andy Lutomirski Date: Fri, 10 Feb 2023 07:14:55 -0800 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: copy on write for splice() from file to pipe? To: Dave Chinner Cc: Matthew Wilcox , Linus Torvalds , Stefan Metzmacher , Jens Axboe , linux-fsdevel , Linux API Mailing List , io-uring , "linux-kernel@vger.kernel.org" , Al Viro , Samba Technical Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Feb 9, 2023 at 10:58 PM Dave Chinner wrote: > > On Fri, Feb 10, 2023 at 04:44:41AM +0000, Matthew Wilcox wrote: > > On Fri, Feb 10, 2023 at 03:06:26PM +1100, Dave Chinner wrote: > > > So while I was pondering the complexity of this and watching a great > > > big shiny rocket create lots of heat, light and noise, it occurred > > > > That was kind of fun > > :) > > > > to me that we already have a mechanism for preventing page cache > > > data from being changed while the folios are under IO: > > > SB_I_STABLE_WRITES and folio_wait_stable(). > > > > I thought about bringing that up, but it's not quite right. That works > > great for writeback, but it only works for writeback. We'd need to track > > another per-folio counter ... it'd be like the page pinning kerfuffle, > > only worse. > > Hmmm - I didn't think of that. It needs the counter because the > "stable request" is per folio reference state, not per folio state, > right? And the single flag works for writeback because we can only > have one writeback context in progress at a time? > > Yeah, not sure how to deal with that easily. > > > And for such a rare thing it seems like a poor use of 32 > > bits of per-page state. > > Maybe, but zero copy file data -> network send is a pretty common > workload. Web servers, file servers, remote backup programs, etc. > Having files being written whilst others are reading them is not as > common, but that does happen in a wide variety of shared file server > environments. > > Regardless, I just had a couple of ideas - it they don't work so be > it. > > > Not to mention that you can effectively block > > all writes to a file for an indefinite time by splicing pages to a pipe > > that you then never read from. > > No, I wasn't suggesting that we make pages in transit stable - they > only need to be stable while the network stack requires them to be > stable.... This is exactly where the existing splice API is problematic. You can't splice from a file to a network socket right now. First you splice to a pipe, and now that pipe contains some magical stuff. And it stays there potentially forever. Then you splice it again to a socket. Would this be better if user code could splice straight to a socket? At least in principle, there could be a _limited_ amount of time during which anything needs to wait, and it's fundamentally entirely reasonable if a concurrent write to a file affects data being zero-copied to a socket _during the time after the zero-copy send is requested and before it reports completion_. Frankly, I really don't like having non-immutable data in a pipe. A pipe is supposed to be a thing into which bytes are written and out from which the *same* bytes emerge, at least to the extent that anyone can observe it. Do we really want: $ some_program | tee file21 > file2 to potentially write different data to file1 and file2? --Andy