Received: by 2002:a05:6a10:6744:0:0:0:0 with SMTP id w4csp2627366pxu; Sun, 18 Oct 2020 09:16:11 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyBGm1kH3CfQc6/RyzA5mLiRzNxvpfJRujmSTT9Ffl98kPsBXjI4Dlt6tqB5JDEN4Bjsi/F X-Received: by 2002:aa7:c3ca:: with SMTP id l10mr14245940edr.72.1603037771414; Sun, 18 Oct 2020 09:16:11 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1603037771; cv=none; d=google.com; s=arc-20160816; b=Mu+VU06Hw+1R/l8/gEpgI4Q3r8M/S91i6flhXTgl8xq05cHbDruuVLn2W4/IoEoID8 pfdDGYOxPjIcI6fumHF9Eb0RKBf2rKk9cnIBZB/J6L3UE/U9W+OoJrxmS6uvIIpj8lXI REpqa8XJ0K7XMexUrcGh/6qtEAjm2yYrCMY542GpnbX4g2D6Q1xlBTR/LGFyEfjQ3TPE DOIb4+n7ap0q4G+QKubG7YZeI1kF4f5/1gLnmhiGi0Jpvx3BpRtWw76smUM8BO06N1Ii sdW3KVW+LPOJrgb+3SWbLc83ssJKLTUhODd0R3Peu1vrwKUIRTH2yRl5nRNepMLALOuy YbFw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=qrY0yGaZm8eEkX45xChPNTQs7tzol6RAxQIKRyVx3Pw=; b=JwtFvCr/Lt4iANHv8tyz7MPXDvuPqixCUmfJhrBkEwZK6ozdNdP7lnxd6MZg+289rY Mo52FIfVfKc8kbdj3BczMqnkQ3EitaxEwXOzU5nWPnU2oxzQ6Nagp++ZVEd0KOJ8F1aF AxYVEOoTzntXsAGmd/UEehAd7YQetdtJ4N5fOKBZWia/TgRRiFVvW47wLbUNeS4Cr5J7 odVpfaD58HTU3N54JkXDEQVmWpIkQf4G3Sc0rJrc6kiMkOi1uxMZ3AM0mxWXv5oiNuE3 wbyBn6GxoMdLOXeAsiTTXp4rXwEBVT3+gxcDzZWwgLf5f3ePXcSNJ9kRPrbTO5qobdnG v/3w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=REQ21UN9; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id f4si3268898edn.495.2020.10.18.09.15.48; Sun, 18 Oct 2020 09:16:11 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=REQ21UN9; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726785AbgJRQOP (ORCPT + 99 others); Sun, 18 Oct 2020 12:14:15 -0400 Received: from mail.kernel.org ([198.145.29.99]:41028 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725776AbgJRQOP (ORCPT ); Sun, 18 Oct 2020 12:14:15 -0400 Received: from mail-wm1-f50.google.com (mail-wm1-f50.google.com [209.85.128.50]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id DC5C822255 for ; Sun, 18 Oct 2020 16:14:13 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1603037654; bh=QkAuJOFzVf0Wwxo6H++VdX8xdeoi48QFNUPVWESxEeY=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=REQ21UN9Va/Tchhw0Iufx5i7bXZEJ9oRfutu3gwrdtekpNdyOP9oS2QBdbHuW8dkc RTeKkF2w8JxMtCocwlSDcoWntutriBaaDLoscjSsibsvmsABsaj4WSuB5g4UAvbHCa sbahCb0xKb0j3iMhYCaCKqibUKXv3mZ2Htcf9Y/k= Received: by mail-wm1-f50.google.com with SMTP id b127so10300075wmb.3 for ; Sun, 18 Oct 2020 09:14:13 -0700 (PDT) X-Gm-Message-State: AOAM533VHEYHIwNJVYLynGO6JTUWksF9f2GQTEO4PxghWIBxQ91K1Ksl pcg9sbHbX4MrandKfOykDjnDlDM08pCZ3TSYJpKS1g== X-Received: by 2002:a05:600c:2256:: with SMTP id a22mr13760040wmm.138.1603037652395; Sun, 18 Oct 2020 09:14:12 -0700 (PDT) MIME-Version: 1.0 References: <6CC3DB03-27BA-4F5E-8ADA-BE605D83A85C@amazon.com> <20201017053712.GA14105@1wt.eu> <20201017064442.GA14117@1wt.eu> <20201018114625-mutt-send-email-mst@kernel.org> <20201018115524-mutt-send-email-mst@kernel.org> In-Reply-To: <20201018115524-mutt-send-email-mst@kernel.org> From: Andy Lutomirski Date: Sun, 18 Oct 2020 09:14:00 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH] drivers/virt: vmgenid: add vm generation id driver To: "Michael S. Tsirkin" Cc: Andy Lutomirski , "Jason A. Donenfeld" , Jann Horn , Willy Tarreau , Colm MacCarthaigh , "Catangiu, Adrian Costin" , "Theodore Y. Ts'o" , Eric Biggers , "open list:DOCUMENTATION" , kernel list , "open list:VIRTIO GPU DRIVER" , "Graf (AWS), Alexander" , "Woodhouse, David" , bonzini@gnu.org, "Singh, Balbir" , "Weiss, Radu" , oridgar@gmail.com, ghammer@redhat.com, Jonathan Corbet , Greg Kroah-Hartman , Qemu Developers , KVM list , Michal Hocko , "Rafael J. Wysocki" , Pavel Machek , Linux API Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, Oct 18, 2020 at 8:59 AM Michael S. Tsirkin wrote: > > On Sun, Oct 18, 2020 at 08:54:36AM -0700, Andy Lutomirski wrote: > > On Sun, Oct 18, 2020 at 8:52 AM Michael S. Tsirkin wrote: > > > > > > On Sat, Oct 17, 2020 at 03:24:08PM +0200, Jason A. Donenfeld wrote: > > > > 4c. The guest kernel maintains an array of physical addresses that are > > > > MADV_WIPEONFORK. The hypervisor knows about this array and its > > > > location through whatever protocol, and before resuming a > > > > moved/snapshotted/duplicated VM, it takes the responsibility for > > > > memzeroing this memory. The huge pro here would be that this > > > > eliminates all races, and reduces complexity quite a bit, because the > > > > hypervisor can perfectly synchronize its bringup (and SMP bringup) > > > > with this, and it can even optimize things like on-disk memory > > > > snapshots to simply not write out those pages to disk. > > > > > > > > A 4c-like approach seems like it'd be a lot of bang for the buck -- we > > > > reuse the existing mechanism (MADV_WIPEONFORK), so there's no new > > > > userspace API to deal with, and it'd be race free, and eliminate a lot > > > > of kernel complexity. > > > > > > Clearly this has a chance to break applications, right? > > > If there's an app that uses this as a non-system-calls way > > > to find out whether there was a fork, it will break > > > when wipe triggers without a fork ... > > > For example, imagine: > > > > > > MADV_WIPEONFORK > > > copy secret data to MADV_DONTFORK > > > fork > > > > > > > > > used to work, with this change it gets 0s instead of the secret data. > > > > > > > > > I am also not sure it's wise to expose each guest process > > > to the hypervisor like this. E.g. each process needs a > > > guest physical address of its own then. This is a finite resource. > > > > > > > > > The mmap interface proposed here is somewhat baroque, but it is > > > certainly simple to implement ... > > > > Wipe of fork/vmgenid/whatever could end up being much more problematic > > than it naively appears -- it could be wiped in the middle of a read. > > Either the API needs to handle this cleanly, or we need something more > > aggressive like signal-on-fork. > > > > --Andy > > > Right, it's not on fork, it's actually when process is snapshotted. > > If we assume it's CRIU we care about, then I > wonder what's wrong with something like > MADV_CHANGEONPTRACE_SEIZE > and basically say it's X bytes which change the value... I feel like we may be approaching this from the wrong end. Rather than saying "what data structure can the kernel expose that might plausibly be useful", how about we try identifying some specific userspace needs and see what a good solution could look like. I can identify two major cryptographic use cases: 1. A userspace RNG. The API exposed by the userspace end is a function that generates random numbers. The userspace code in turn wants to know some things from the kernel: it wants some best-quality-available random seed data from the kernel (and possibly an indication of how good it is) as well as an indication of whether the userspace memory may have been cloned or rolled back, or, failing that, an indication of whether a reseed is needed. Userspace could implement a wide variety of algorithms on top depending on its goals and compliance requirements, but the end goal is for the userspace part to be very, very fast. 2. A userspace crypto stack that wants to avoid shooting itself in the foot due to inadvertently doing the same thing twice. For example, an AES-GCM stack does not want to reuse an IV, *expecially* if there is even the slightest chance that it might reuse the IV for different data. This use case doesn't necessarily involve random numbers, but, if anything, it needs to be even faster than #1. The threats here are not really the same. For #1, a userspace RNG should be able to recover from a scenario in which an adversary clones the entire process *and gets to own the clone*. For example, in Android, an adversary can often gain complete control of a fork of the zygote -- this shouldn't adversely affect the security properties of other forks. Similarly, a server farm could operate by having one booted server that is cloned to create more workers. Those clones could be provisioned with secrets and permissions post-clone, and at attacker gaining control of a fresh clone could be considered acceptable. For #2, in contrast, if an adversary gains control of a clone of an AES-GCM session, they learn the key outright -- the relevant attack scenario is that the adversary gets to interact with two clones without compromising either clone per se. It's worth noting that, in both cases, there could possibly be more than one instance of an RNG or an AES-GCM session in the same process. This means that using signals is awkward but not necessarily impossibly. (This is an area in which Linux, and POSIX in general, is much weaker than Windows.)