Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 89585C64EC4 for ; Fri, 10 Feb 2023 19:02:08 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232981AbjBJTCG (ORCPT ); Fri, 10 Feb 2023 14:02:06 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48722 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232755AbjBJTCE (ORCPT ); Fri, 10 Feb 2023 14:02:04 -0500 Received: from ams.source.kernel.org (ams.source.kernel.org [IPv6:2604:1380:4601:e00::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 83D9A303FC for ; Fri, 10 Feb 2023 11:02:02 -0800 (PST) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ams.source.kernel.org (Postfix) with ESMTPS id 36C05B825C5 for ; Fri, 10 Feb 2023 19:02:01 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id EFDB6C433A8 for ; Fri, 10 Feb 2023 19:01:59 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1676055720; bh=E/qchUqCgMrqhO01lpil+Mub62mxWI4tbL+Ig99y+6M=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=HpodbIcQJGoRTOg6RbPJfi0gFwRkBnHyBYc1kEOAoAXo7rER7KaqCA5sbxVS8dSfw E/BLjVMXQhP/+4bq8iNuyDv17dCN5qE9p5ScpKseDseK/gZgksJC9ZqeWprVEVmfhA yMwWFZyFR+P1+jPraJVHnOobQtTKGYONtz4GpsOR06Xkccj0iKqlgncof31RFgruNQ ASc1vThVl9et2kCmuoHNYYCQLSPDUYfxdzDVDXDM+ZD2nuhxbnWjWqjtzItJ253v4w MKXHVHCwCAP6jJenz96jiXU2ViBYMMqgsHGgLDGlypGoilrQbnF+49xkrdFjupzPV+ IyJ+mFDzIWIVA== Received: by mail-ed1-f49.google.com with SMTP id l12so5599751edb.0 for ; Fri, 10 Feb 2023 11:01:59 -0800 (PST) X-Gm-Message-State: AO0yUKVfpHj+Jtew1pDrn4/q8qDXk5gJHeYRTx66ZK66Mjwd5UFSNsuG 9XExi7iwTQ19TiQrL5wGIbhyWBLxfFwkdxtpiIalqA== X-Google-Smtp-Source: AK7set9x9pw6hgD2dW1+GN20k1m5aEU1KvaMWkyLee7gBmQ0A5OhpixKJfIx8bgUWu+TViK+cwHUCyEdTDQ4IxJllj8= X-Received: by 2002:a50:d595:0:b0:4ac:b481:2b7c with SMTP id v21-20020a50d595000000b004acb4812b7cmr201420edi.2.1676055718121; Fri, 10 Feb 2023 11:01:58 -0800 (PST) MIME-Version: 1.0 References: <0cfd9f02-dea7-90e2-e932-c8129b6013c7@samba.org> <20230210021603.GA2825702@dread.disaster.area> <20230210040626.GB2825702@dread.disaster.area> <20230210065747.GD2825702@dread.disaster.area> In-Reply-To: From: Andy Lutomirski Date: Fri, 10 Feb 2023 11:01:46 -0800 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: copy on write for splice() from file to pipe? To: Linus Torvalds Cc: Andy Lutomirski , Dave Chinner , Matthew Wilcox , Stefan Metzmacher , Jens Axboe , linux-fsdevel , Linux API Mailing List , io-uring , "linux-kernel@vger.kernel.org" , Al Viro , Samba Technical Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Feb 10, 2023 at 10:37 AM Linus Torvalds wrote: > > On Fri, Feb 10, 2023 at 9:57 AM Andy Lutomirski wrote: > > I'm not convinced your suggestion of extending io_uring with new > primitives is any better in practice, though. I don't know if I'm really suggesting new primitives. I think I'm making two change suggestions that go together. First, let splice() and IORING_OP_SPLICE copy (or zero-copy) data from a file to a socket. Second, either make splice more strict or add a new "strict splice" variant. Strict splice only completes when it can promise that writes to the source that start after strict splice's completion won't change what gets written to the destination. I think that strict splice fixes Stefan's use case. It's also easier to reason about than regular splice. The major caveat here is that zero-copy strict splice is fundamentally a potentially long-running operation in a way that zero-copy splice() isn't right now. So the combination of O_NONBLOCK and strict splice() (the syscall, not necessarily the io_uring operation) to something like a TCP socket requires complicated locking or change tracking to make sense. This means that a splice() syscall providing strict semantics to a TCP socket may just need to do a copy, at least in many cases. But maybe that's fine -- very-high-performance networking is moving pretty aggressively to io_uring anyway. And my possibly-quite-out-there claim is that, if Linux implements strict splice, maybe non-strict splice could get replaced in a user ABI-compatible manner with a much simpler non-zero-copy implementation. And strict splice from a file to a pipe could be implemented as a copy -- high performance users can, if needed, start strict-splicing from a file directly to a socket.