Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2A6D5C636D7 for ; Fri, 10 Feb 2023 19:56:07 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233504AbjBJT4F (ORCPT ); Fri, 10 Feb 2023 14:56:05 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37862 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233051AbjBJT4D (ORCPT ); Fri, 10 Feb 2023 14:56:03 -0500 Received: from ams.source.kernel.org (ams.source.kernel.org [145.40.68.75]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8162F5599 for ; Fri, 10 Feb 2023 11:56:02 -0800 (PST) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ams.source.kernel.org (Postfix) with ESMTPS id 17187B825DA for ; Fri, 10 Feb 2023 19:56:01 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id BAD2CC4339C for ; Fri, 10 Feb 2023 19:55:59 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1676058959; bh=/rbtwGkm6KjNDPlncngu4DRJNGQVNjUWrQ5+Z6JNEmg=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=dkQHuiF8P/dgaBGd5YJx4xzu3r7trh3/L6g9QQyiESG614aE8kO0FveBLQhhbngZc cGqz0YYE/tUmjT5Z7If+FDBH2GXkOmkpN5WxsYX0iyP1MK29A+AHro0b+C6MjvppY+ Ht7bV5kq+1S33oRfw2XZ7R0ZzWHldTquToqimapb+Bl8ynj9Bltw8tb/b39/FKpKqZ b/k5TZdVIY2s71ynKKWU0tm0QWgB/GexCz4DWX753hq69rDox9qbLhyQzrEj4ey6aH 3DlBRy1q/1yPEMO7TN3aElallOhM3pOp+KDzw3JMdKFKavBSJnHTGY5GuNnN82OsOi wwO4sQ8qOtRzQ== Received: by mail-ed1-f45.google.com with SMTP id da9so5645877edb.12 for ; Fri, 10 Feb 2023 11:55:59 -0800 (PST) X-Gm-Message-State: AO0yUKXAAIpe5kAUVHtbuoBVuoJQKKIYDoigtf+ckfy08cj5dMaVIkIw Co11QLKTzOWHjSdqG6D5gFTNy+Cz1vRLLYlH2ZZYcQ== X-Google-Smtp-Source: AK7set8T+tp2IZwVFvcjeL5tlcMarKtooKEDf9AGe53UI8xtD8A+U4cxUUUA2tv3tXXGolmJnsFOG8/Tj0Yb/wcb2uQ= X-Received: by 2002:a50:ba8d:0:b0:4ab:1c64:a9ed with SMTP id x13-20020a50ba8d000000b004ab1c64a9edmr1535489ede.2.1676058958001; Fri, 10 Feb 2023 11:55:58 -0800 (PST) MIME-Version: 1.0 References: <0cfd9f02-dea7-90e2-e932-c8129b6013c7@samba.org> <20230210021603.GA2825702@dread.disaster.area> <20230210040626.GB2825702@dread.disaster.area> <20230210065747.GD2825702@dread.disaster.area> In-Reply-To: From: Andy Lutomirski Date: Fri, 10 Feb 2023 11:55:45 -0800 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: copy on write for splice() from file to pipe? To: Linus Torvalds Cc: Andy Lutomirski , Dave Chinner , Matthew Wilcox , Stefan Metzmacher , Jens Axboe , linux-fsdevel , Linux API Mailing List , io-uring , "linux-kernel@vger.kernel.org" , Al Viro , Samba Technical Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Feb 10, 2023 at 11:18 AM Linus Torvalds wrote: > > On Fri, Feb 10, 2023 at 11:02 AM Andy Lutomirski wrote: > > > > Second, either make splice more strict or add a new "strict splice" > > variant. Strict splice only completes when it can promise that writes > > to the source that start after strict splice's completion won't change > > what gets written to the destination. > > The thing ius, I think your "strict splice" is pointless and wrong. > > It's pointless, because it simply means that it won't perform well. > > And since the whole point of splice was performance, it's wrong. > > I really think the whole "source needs to be stable" is barking up the > wrong tree. > > You are pointing fingers at splice(). > > And I think that's wrong. > > We should point the fingers at either the _user_ of splice - as Jeremy > Allison has done a couple of times - or we should point it at the sink > that cannot deal with unstable sources. > > Because that whole "source is unstable" is what allows for that higher > performance. The moment you start requiring stability, you _will_ lose > it. You will have to lock the page, you'll have to umap it from any > shared mappings, etc etc. And even if there are no writers, or no > current mappers, all that effort to make sure that is the case is > actually fairly expensive. ... > Because I really think that your "strict splice" model would just mean > that now the kernel would have to add not just a memcpy, but also a > new allocation for that new stable buffer for the memcpy, and that > would all just be very very pointless. > > Alternatively, it would require some kind of nasty hard locking > together with other limitations on what can be done by non-splice > users. I could be wrong, but I don't think any of this is necessary. My strict splice isn't intended to be any more stable than current splice -- it's intended to complete more slowly and more informatively. Now maybe I'm wrong and the impleentation would be nasty, but I think that the only bookkeeping needed is to arrange strict-splice to not complete until the kernel is done with the source's page cache. The use of the source is refcounted already, and a bit of extra work might be needed to track which strict-splice the reference came from, but unless I've missed something, it's not crazy. Looking at the current splice implementaiton, a splice that isn't "strictly completed" is sort of represented by a struct pipe_buffer (I think). The actual implementation of strict-splice might consist of separating pipe_buffer out from a pipe and adding an io_kiocb* and a refcount to it. Or maybe even just adding an io_kiocb* and making the existing refcouting keep also track the io_kiocb*, but that might be complicated. This all boils down to tracking an actual splice all the way through its lifecycle and not reporting it as done until it's all the way done. Anything else is icing on the cake, no? There is absolutely no need to lock files or make page-cache pages immutable or anything like that. i think this is almost exactly what Jeremy and Stefan are asking for re: notification when the system is done with a zero-copy send: > What might be helpful in addition would be some kind of notification that all pages are no longer used by the network layer, IORING_OP_SENDMSG_ZC already supports such a notification, maybe we can build something similar.