Received: by 2002:a05:6a10:f347:0:0:0:0 with SMTP id d7csp2618922pxu; Mon, 7 Dec 2020 10:58:44 -0800 (PST) X-Google-Smtp-Source: ABdhPJxwhN5Oh1g9+Nhddhg/wzvSK08mZjAEBeCUUvONdiUd3e5Q79k3NKNAb67C3XAfmGWpA0ws X-Received: by 2002:a17:906:b793:: with SMTP id dt19mr20232512ejb.120.1607367524297; Mon, 07 Dec 2020 10:58:44 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1607367524; cv=none; d=google.com; s=arc-20160816; b=ljsAWqClzweF9iQH64A/laZi1oNRN+B6W2NcufWWP+PFR6P+zo8f10+A1qBHvh2eBI JPxzE5As6CVhJYKuP4snnSkZHnxjDB+d4EEB4x74YsO6t0fRxOoH0n+RiJDjUf12JXSf RsuQNTeNJN+DT7ZT4FnaEYAbFGiPNepir8KtwJB4Udb6/2Yd+LQKlPKC3uELDqjLZc1c KHjCOvBOhMarutBgTNengfY0WZf4+oQ2oj4ZkzEMnMGwPzCYQDpAClWhMejd5jz+fmTa GqMeH3A+RLijQO7GGGX5yXQ/8OcIHso6RMOqfjplYvOBcUmadWsnbq8UZE5ZUwGeWiD8 IgUg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:to:from:cc:in-reply-to:subject:date:dkim-signature; bh=KtGdp/lwKq2VYARUSn9Vex8GHiah3So/M600I4rdU+k=; b=U+Uh0tYyvrS2vXscLmA/0vxicCY8iNkH6mvtd5741/gDqDcdorwj64hUcal5FpZGE6 vI0uaaj5d+VAj2H7yMOiNKPBBzqsr0YQiDRXVtkKsRg6vQ9WGh4LeDz9qGwMYwGnlZq2 RUr/tveTw+/NiBsTf648x+i6WQohHHi8hIneuh5iTsLnBgGG3ZHMMvhwWV65X5qaoII2 b7A4T5vGHTOEKBl0sR+IHaYw6Pj+qUl2I9DSlDzRCmJb1tn9Ty1G+wt85O/ZL9W83xub P+RQjbR95sIthyDQEb7NOISm4W85jn4MmRXWzM/Nb7zpsW91yj4RpNGswS+cv9WrCu8T tvhg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@dabbelt-com.20150623.gappssmtp.com header.s=20150623 header.b=y8VTp9ob; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id jg19si1119964ejc.365.2020.12.07.10.58.21; Mon, 07 Dec 2020 10:58:44 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@dabbelt-com.20150623.gappssmtp.com header.s=20150623 header.b=y8VTp9ob; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726359AbgLGS4k (ORCPT + 99 others); Mon, 7 Dec 2020 13:56:40 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53466 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726328AbgLGS4k (ORCPT ); Mon, 7 Dec 2020 13:56:40 -0500 Received: from mail-pf1-x443.google.com (mail-pf1-x443.google.com [IPv6:2607:f8b0:4864:20::443]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id ABE31C06179C for ; Mon, 7 Dec 2020 10:55:59 -0800 (PST) Received: by mail-pf1-x443.google.com with SMTP id d2so7086769pfq.5 for ; Mon, 07 Dec 2020 10:55:59 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=dabbelt-com.20150623.gappssmtp.com; s=20150623; h=date:subject:in-reply-to:cc:from:to:message-id:mime-version :content-transfer-encoding; bh=KtGdp/lwKq2VYARUSn9Vex8GHiah3So/M600I4rdU+k=; b=y8VTp9ob9fYKSzfHkWUymZptoMnZAg/NWhWoUrlxY8pjbTg36T7Fdn2pVvvh9zWVaa B87zFRzbc5t2R3rkosrvupXzz338IyKlSY3h6nO3hw2GSlHpQdfeFCFxtUhS5Em8PxAK GskuFtlBmCwDw1vgOearEpEMLctI3/GBGMMkxS4uTHgTlQmXpFEiGHn6AAwbt+EO5Xpo 3asZytuj39HVI8357jpgKbjNw0RDymoX3VrjPjBcw6mvh9J8s4Dahu+rG1Q6cmQ+8YsP 4+x7YjvAxBKlP5PrmfwWH835KSJEkeNn4fEjBTBtYWVE8cRi+N6+9AUXb+oBqFMdGTEM BcDQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:subject:in-reply-to:cc:from:to:message-id :mime-version:content-transfer-encoding; bh=KtGdp/lwKq2VYARUSn9Vex8GHiah3So/M600I4rdU+k=; b=IlfFfP1L6fXK3v4WPktxTINpsLTsvZPKCs96+4cz0gxpUNG4LNyflNGkTQyfOCTEdJ PYNdsfT1DIOYfbTpu8qZUkJrNV27mmRdn3OnqEWuy+opERFA6Jl9oUkxnS/5p21Iy2ot LoIqUfttGliiJ/o6jCpmOfKxUywBxs2ca2iK3YTQMbNTsMbHkpQLW7fal8DwMS9kmEmi SxVU9Xgma7TRw7Dzk78KxDbtFTEUMGMUTtX6iX5/hLISqUW7yBFQKkt8wSO9Drl5e1xO +eL26ghtn8Y/OeU6rYcfp1yrQE+8+LnIYMpVyUBxnKjmYFc0lGGQjXNb56wm7ax0Mdyr KSNA== X-Gm-Message-State: AOAM531sjGfGenEsSlZZQLnGBqRrug2CUot8oYkx0TNQaO1iUlg0swLK mS+PaKtTuAyBGRLK/DwuOPFlFw== X-Received: by 2002:a62:4dc2:0:b029:19d:b6f2:e7bb with SMTP id a185-20020a624dc20000b029019db6f2e7bbmr17129167pfb.74.1607367358874; Mon, 07 Dec 2020 10:55:58 -0800 (PST) Received: from localhost (76-210-143-223.lightspeed.sntcca.sbcglobal.net. [76.210.143.223]) by smtp.gmail.com with ESMTPSA id t36sm14214255pfg.55.2020.12.07.10.55.57 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 07 Dec 2020 10:55:57 -0800 (PST) Date: Mon, 07 Dec 2020 10:55:57 -0800 (PST) X-Google-Original-Date: Mon, 07 Dec 2020 10:55:56 PST (-0800) Subject: Re: [PATCH v1 0/5] dm: dm-user: New target that proxies BIOs to userspace In-Reply-To: <20201204103336.GA7374@infradead.org> CC: dm-devel@redhat.com, agk@redhat.com, snitzer@redhat.com, corbet@lwn.net, song@kernel.org, shuah@kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-raid@vger.kernel.org, linux-kselftest@vger.kernel.org, kernel-team@android.com From: Palmer Dabbelt To: Christoph Hellwig Message-ID: Mime-Version: 1.0 (MHng) Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 04 Dec 2020 02:33:36 PST (-0800), Christoph Hellwig wrote: > What is the advantage over simply using nbd? There's a short bit about that in the cover letter (and in some talks), but I'll expand on it here -- I suppose my most important question is "is this interesting enough to take upstream?", so there should be at least a bit of a description of what it actually enables: I don't think there's any deep fundamental advantages to doing this as opposed to nbd/iscsi over localhost/unix (or by just writing a kernel implementation, for that matter), at least in terms of anything that was previously impossible now becoming possible. There are a handful of things that are easier and/or faster, though. dm-user looks a lot like NBD without the networking. The major difference is which side initiates messages: in NBD the kernel initiates messages, while in dm-user userspace initiates messages (via a read that will block if there is no message, but presumably we'd want to add support for a non-blocking userspace implementations eventually). The NBD approach certainly makes sense for a networked system, as one generally wants to have a single storage server handling multiple clients, but inverting that makes some things simpler in dm-user. One specific advantage of this change is that a dm-user target can be transitioned from one daemon to another without any IO errors: just spin up the second daemon, signal the first to stop requesting new messages, and let it exit. We're using that mechanism to replace the daemon launched by early init (which runs before the security subsystem is up, as in our use case dm-user provides the root filesystem) with one that's properly sandboxed (which can only be launched after the root filesystem has come up). There are ways around this (replacing the DM table, for example), but they don't fit it as cleanly. Unless I'm missing something, NBD servers aren't capable of that style of transition: soft disconnects can only be initiated by the client (the kernel, in this case), which leaves no way for the server to transition while guaranteeing that no IOs error out. It's usually possible to shoehorn this sort of direction reversing concept into network protocols, but it's also usually ugly (I'm thinking of IDLE, for example). I didn't try to actually do it, but my guess would be that adding a way for the server to ask the client to stop sending messages until a new server shows up would be at least as much work as doing this. There are also a handful of possible performance advantages, but I haven't gone through the work to prove any of them out yet as performance isn't all that important for our first use case. For example: * Cutting out the network stack is unlikely to hurt performance. I'm not sure if it will help performance, though. I think if we really had workload where the extra copy was likely to be an issue we'd want an explicit ring buffer, but I have a theory that it would be possible to get very good performance out of a stream-style API by using multiple channels and relying on io_uring to plumb through multiple ops per channel. * There's a comment in the implementation about allowing userspace to insert itself into user_map(), likely by uploading a BPF fragment. There's a whole class of interesting block devices that could be written in this fashion: essentially you keep a cache on a regular block device that handles the common cases by remapping BIOs and passing them along, relegating the more complicated logic to fetch cache misses and watching some subset of the access stream where necessary. We have a use case like this in Android, where we opportunistically store backups in a portion of the TRIM'd space on devices. It's currently implemented entirely in kernel by the dm-bow target, but IIUC that was deemed too Android-specific to merge. Assuming we could get good enough performance we could move that logic to userspace, which lets us shrink our diff with upstream. It feels like some other interesting block devices could be written in a similar fashion. All in all, I've found it a bit hard to figure out what sort of interest people have in dm-user: when I bring this up I seem to run into people who've done similar things before and are vaguely interested, but certainly nobody is chomping at the bit. I'm sending it out in this early state to try and figure out if it's interesting enough to keep going.