Received: by 2002:a05:7412:b10a:b0:f3:1519:9f41 with SMTP id az10csp1115094rdb; Fri, 1 Dec 2023 07:26:49 -0800 (PST) X-Google-Smtp-Source: AGHT+IHgYylC00ILxMf2O7yeg7i8usy+ga1Sl4oYNnsiCSnWip3TAejQvCZwY8iPoreALupXRsVe X-Received: by 2002:a17:90b:4a09:b0:27f:f61c:327d with SMTP id kk9-20020a17090b4a0900b0027ff61c327dmr25594593pjb.0.1701444409680; Fri, 01 Dec 2023 07:26:49 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1701444409; cv=none; d=google.com; s=arc-20160816; b=eR3SB3HnyOu5ewVqh5ZiDj/qnZX4e2lRoGO95bHM8z3y7lWF+YD3muh0saUIqodpig xauth1pPaQsXepCr/IwM/tmz2KCe6uoaUQ+rL9GEMMtkiwaYIgSLteaI3z2ySVBde3lC ic3BWJetsrJcTgtL2meQwJ7KfF2jgln4z+8oy4YUVDHVJkDOuW+H/Wx+5ci++D8szxYS EYX/Ko43m9ZiB8+nx7sLeMT6qhnPjYFPRpqftULn6h5h7Of/Sql9AYbMOPTINHJSp24K vIgJ6QEwuE6zqbbDFzrOVm65gn9C+VsXETykNyQy8PquXb5djbumrtqF0Hwx8A4WwojW YUYw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:references:to:from:cc:message-id :date:content-transfer-encoding:mime-version:subject:dkim-signature; bh=FfeiQvxbxXeqzRQdfjMLltmlruM6oJfGRqxgBK9e5cE=; fh=FKld/Y8VOWAPG04PeON03fJy8F6iB1fzWAXDFhdPJo8=; b=hh5SfO8Q/ACdbqaAD8mKNKSa2Wj4fQkYQXcVNWJ6EPE+9iNhX00YJWps1ho7exWP3K 4VgaDgQ/XZp5q+ovSnqxO9yAxJwoZJEG4WjHH7BdnUIi77NPjBD/2bdm3H9q3Akw2lv8 vWL89JeWHHzfwQ0V5ZsXDKqHiNzpoJ9M8de0SlTJ+l/xSOaOkeNX9170EnRacrE93ucN XTvnDUjjrw6vrfS23GpD+vK7hCQyAqvflDphvWRA0PYrNUkoT83TXEKlUK4A+PLzKvC/ aVSiTL8czgmWkwHipc6tpItXm7LCtUO7gT+p0d/ebvrrQhkozqwihtHAtrF/kfKllbuU zKow== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@amazon.com header.s=amazon201209 header.b="I/xaCcRr"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=amazon.com Return-Path: Received: from pete.vger.email (pete.vger.email. [23.128.96.36]) by mx.google.com with ESMTPS id s8-20020a17090aad8800b0028573fb25c5si3524454pjq.132.2023.12.01.07.26.20 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 01 Dec 2023 07:26:49 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) client-ip=23.128.96.36; Authentication-Results: mx.google.com; dkim=pass header.i=@amazon.com header.s=amazon201209 header.b="I/xaCcRr"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=amazon.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by pete.vger.email (Postfix) with ESMTP id DCDAF808BE63; Fri, 1 Dec 2023 07:25:36 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at pete.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1379374AbjLAPZS (ORCPT + 99 others); Fri, 1 Dec 2023 10:25:18 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57940 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1379358AbjLAPZQ (ORCPT ); Fri, 1 Dec 2023 10:25:16 -0500 Received: from smtp-fw-2101.amazon.com (smtp-fw-2101.amazon.com [72.21.196.25]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5621CD4A; Fri, 1 Dec 2023 07:25:21 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1701444322; x=1732980322; h=mime-version:content-transfer-encoding:date:message-id: cc:from:to:references:in-reply-to:subject; bh=FfeiQvxbxXeqzRQdfjMLltmlruM6oJfGRqxgBK9e5cE=; b=I/xaCcRrPXKIjSiwt98qVbcVSMQO/hJS4Va9yaJ3VS17rMjZIz0WRwjh KreR3EJJmbLv4em6WG1Gzd7id73v54pViD/vw6EyXFnVedyuvmYFnjQHc h1IAOmDc1jAzyg+KECyQkSdy0Itj58ZD3VedwzbbaV/FrH07e+aBmO6ld E=; X-IronPort-AV: E=Sophos;i="6.04,241,1695686400"; d="scan'208";a="366051200" Subject: Re: [RFC 02/33] KVM: x86: Introduce KVM_CAP_APIC_ID_GROUPS Received: from iad12-co-svc-p1-lb1-vlan3.amazon.com (HELO email-inbound-relay-pdx-2b-m6i4x-cadc3fbd.us-west-2.amazon.com) ([10.43.8.6]) by smtp-border-fw-2101.iad2.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Dec 2023 15:25:19 +0000 Received: from smtpout.prod.us-west-2.prod.farcaster.email.amazon.dev (pdx2-ws-svc-p26-lb5-vlan2.pdx.amazon.com [10.39.38.66]) by email-inbound-relay-pdx-2b-m6i4x-cadc3fbd.us-west-2.amazon.com (Postfix) with ESMTPS id D43E9A3638; Fri, 1 Dec 2023 15:25:16 +0000 (UTC) Received: from EX19MTAEUA002.ant.amazon.com [10.0.43.254:15743] by smtpin.naws.eu-west-1.prod.farcaster.email.amazon.dev [10.0.26.183:2525] with esmtp (Farcaster) id bc754cf4-d55e-458d-8bc8-0b3c97125d02; Fri, 1 Dec 2023 15:25:15 +0000 (UTC) X-Farcaster-Flow-ID: bc754cf4-d55e-458d-8bc8-0b3c97125d02 Received: from EX19D004EUC001.ant.amazon.com (10.252.51.190) by EX19MTAEUA002.ant.amazon.com (10.252.50.126) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.40; Fri, 1 Dec 2023 15:25:15 +0000 Received: from localhost (10.13.235.138) by EX19D004EUC001.ant.amazon.com (10.252.51.190) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.40; Fri, 1 Dec 2023 15:25:10 +0000 MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="UTF-8" Date: Fri, 1 Dec 2023 15:25:06 +0000 Message-ID: CC: , , , , , , , , , , , , , , , Anel Orazgaliyeva From: Nicolas Saenz Julienne To: Maxim Levitsky , X-Mailer: aerc 0.15.2-182-g389d89a9362e-dirty References: <20231108111806.92604-1-nsaenz@amazon.com> <20231108111806.92604-3-nsaenz@amazon.com> <98eee37ed7f4b7b9c16bccbe41737e47a116d1f1.camel@redhat.com> In-Reply-To: <98eee37ed7f4b7b9c16bccbe41737e47a116d1f1.camel@redhat.com> X-Originating-IP: [10.13.235.138] X-ClientProxiedBy: EX19D032UWB003.ant.amazon.com (10.13.139.165) To EX19D004EUC001.ant.amazon.com (10.252.51.190) X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE, UNPARSEABLE_RELAY autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on pete.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (pete.vger.email [0.0.0.0]); Fri, 01 Dec 2023 07:25:37 -0800 (PST) Hi Maxim, On Tue Nov 28, 2023 at 6:56 AM UTC, Maxim Levitsky wrote: > On Wed, 2023-11-08 at 11:17 +0000, Nicolas Saenz Julienne wrote: > > From: Anel Orazgaliyeva > > > > Introduce KVM_CAP_APIC_ID_GROUPS, this capability segments the VM's API= C > > ids into two. The lower bits, the physical APIC id, represent the part > > that's exposed to the guest. The higher bits, which are private to KVM, > > groups APICs together. APICs in different groups are isolated from each > > other, and IPIs can only be directed at APICs that share the same group > > as its source. Furthermore, groups are only relevant to IPIs, anything > > incoming from outside the local APIC complex: from the IOAPIC, MSIs, or > > PV-IPIs is targeted at the default APIC group, group 0. > > > > When routing IPIs with physical destinations, KVM will OR the source's > > vCPU APIC group with the ICR's destination ID and use that to resolve > > the target lAPIC. The APIC physical map is also made group aware in > > order to speed up this process. For the sake of simplicity, the logical > > map is not built while KVM_CAP_APIC_ID_GROUPS is in use and we defer IP= I > > routing to the slower per-vCPU scan method. > > > > This capability serves as a building block to implement virtualisation > > based security features like Hyper-V's Virtual Secure Mode (VSM). VSM > > introduces a para-virtualised switch that allows for guest CPUs to jump > > into a different execution context, this switches into a different CPU > > state, lAPIC state, and memory protections. We model this in KVM by > > using distinct kvm_vcpus for each context. Moreover, execution contexts > > are hierarchical and its APICs are meant to remain functional even when > > the context isn't 'scheduled in'. For example, we have to keep track of > > timers' expirations, and interrupt execution of lesser priority context= s > > when relevant. Hence the need to alias physical APIC ids, while keeping > > the ability to target specific execution contexts. > > > A few general remarks on this patch (assuming that we don't go with > the approach of a VM per VTL, in which case this patch is not needed) > > -> This feature has to be done in the kernel because vCPUs sharing same V= TL, > will have same APIC ID. > (In addition to that, APIC state is private to a VTL so each VTL > can even change its apic id). > > Because of this KVM has to have at least some awareness of this. > > -> APICv/AVIC should be supported with VTL eventually: > This is thankfully possible by having separate physid/pid tables per V= TL, > and will mostly just work but needs KVM awareness. > > -> I am somewhat against reserving bits in apic id, because that will lim= it > the number of apic id bits available to userspace. Currently this is n= ot > a problem but it might be in the future if for some reason the userspa= ce > will want an apic id with high bits set. > > But still things change, and with this being part of KVM's ABI, it mig= ht backfire. > A better idea IMHO is just to have 'APIC namespaces', which like say P= ID namespaces, > such as each namespace is isolated IPI wise on its own, and let each v= CPU belong to > a one namespace. > > In fact Intel's PRM has a brief mention of a 'hierarchical cluster' mo= de in which > roughly describes this situation in which there are multiple not inter= connected APIC buses, > and communication between them needs a 'cluster manager device' > > However I don't think that we need an explicit pairs of vCPUs and VTL = awareness in the kernel > all of this I think can be done in userspace. > > TL;DR: Lets have APIC namespace. a vCPU can belong to a single namespa= ce, and all vCPUs > in a namespace send IPIs to each other and know nothing about vCPUs fr= om other namespace. > > A vCPU sending IPI to a different VTL thankfully can only do this usin= g a hypercall, > and thus can be handled in the userspace. > > > Overall though IMHO the approach of a VM per VTL is better unless some sh= ow stoppers show up. > If we go with a VM per VTL, we gain APIC namespaces for free, together wi= th AVIC support and > such. Thanks, for the thorough review! I took note of all your design comments (here and in subsequent patches). I agree that the way to go is the VM per VTL approach. I'll prepare a PoC as soon as I'm back from the holidays and share my results. Nicolas