Received: by 2002:a25:4158:0:0:0:0:0 with SMTP id o85csp5531623yba; Mon, 13 May 2019 12:31:13 -0700 (PDT) X-Google-Smtp-Source: APXvYqw1Mi58byf+wueQg5aFMpT2HxWCyizmfLAZUi9lhH/RN0kQuPyjrAe6ZdoMZ22csRfNaHni X-Received: by 2002:a62:86c4:: with SMTP id x187mr35422917pfd.34.1557775873293; Mon, 13 May 2019 12:31:13 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1557775873; cv=none; d=google.com; s=arc-20160816; b=RYK1SGZ22C7SU/9QjRneVrcN4PJuGtFeEQOFXLeL4uzTCoTWO99z/Vj6OnRy8uYHgs pDiXiwYPJNWc2cC2d+51CznN60MG5aUO2d/HHbOM45L1udWurAEbUKL6v9AVbsIj5lu8 x3VOyO9Uo58rNTQ7bJ7mlk4zAzKUI5hITwgw8ipd5eJJyCWcJ1AZPwOnzy609ZnbeBFy Qcb5bar2xhfzObDthTHUR0FaILWaYVDyiEM3Bp/GJkYb7sw51t8A2EXIbslE9KzagBOZ KVOqGMWImZLPmhmPe121x4YnE/KIPCxSo6gl47bNqgQcxzq8kKs4UZ9IvH2qTLIox94t ALeA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:organization:from:references:cc:to:subject :dkim-signature; bh=4xiDPOzSHcQpk0xHvpIXFduYKLuE1ShLeQljNETPlH8=; b=nPLgU2FV5ysnusZQTj9SxZ+FJwPFGjczvj86/rcwcC/BaiuJVZmWKXYbbPAiQ6ugFk E+X30JYAD3mNNdMmhC4Bki7wBKaz0DV8h/Uv+V1seRb0v9nNGT6aKYbaEUz/yHfoZw5Z Gwy9xDnusMNPLVXKJIvlKQjuE43wyXDXue6Od9IoAEm3EdIrzoPsQBAMFWqUGPWuQbmV eDg7J22iAUN1e3DtYmdKhRJIJX4VE+e9UqoPXAaX8WYTP5NwK8KUWT8RuiJhBEjZEmPb xD19af0btK1kSlOcdGpei43TfPBbKIs/I2sy5aqJiM1ISR3PDB7e551rqO+Oq3zPT9is 0dZA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2018-07-02 header.b=bIbPRVQM; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id p188si19097342pfp.112.2019.05.13.12.30.57; Mon, 13 May 2019 12:31:13 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2018-07-02 header.b=bIbPRVQM; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731247AbfEMRCU (ORCPT + 99 others); Mon, 13 May 2019 13:02:20 -0400 Received: from aserp2130.oracle.com ([141.146.126.79]:39344 "EHLO aserp2130.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731045AbfEMRCU (ORCPT ); Mon, 13 May 2019 13:02:20 -0400 Received: from pps.filterd (aserp2130.oracle.com [127.0.0.1]) by aserp2130.oracle.com (8.16.0.27/8.16.0.27) with SMTP id x4DGwgch132176; Mon, 13 May 2019 17:00:37 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=subject : to : cc : references : from : message-id : date : mime-version : in-reply-to : content-type : content-transfer-encoding; s=corp-2018-07-02; bh=4xiDPOzSHcQpk0xHvpIXFduYKLuE1ShLeQljNETPlH8=; b=bIbPRVQMgiUHCVQwzQuqCiiZIMI38WFqpvE61wvkwdHcTyhzPfLq4oD3+D8wgtXVJoxY 8pC1Vh0vbaJ30yemQlGtGrVuLAlyjkghNmAbqrM8fv46zQ94FvB757novPa3rKLV2imw 0Rsln4Ilxb01BbtFroGEkx5twV4K/0dcJwz27KIIJVo9abCRYhOxxhQrvBsDBZRIO9Ty fRd5tFBbwgLIO5OYE3zC4KfSunv09l19QUIWVVDyhKWbZBQOOX4YplDNz7gkK6j5i6n7 j89WdQO6nRc6bsocVSBH3ck1exCVyYHV40l8U51e46C4tjj5CNcjBeBCdDLOJw5BdN4S vA== Received: from aserp3020.oracle.com (aserp3020.oracle.com [141.146.126.70]) by aserp2130.oracle.com with ESMTP id 2sdkwdgm9w-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 13 May 2019 17:00:37 +0000 Received: from pps.filterd (aserp3020.oracle.com [127.0.0.1]) by aserp3020.oracle.com (8.16.0.27/8.16.0.27) with SMTP id x4DGxHjO090835; Mon, 13 May 2019 17:00:36 GMT Received: from aserv0121.oracle.com (aserv0121.oracle.com [141.146.126.235]) by aserp3020.oracle.com with ESMTP id 2se0tvp1hb-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Mon, 13 May 2019 17:00:36 +0000 Received: from abhmp0014.oracle.com (abhmp0014.oracle.com [141.146.116.20]) by aserv0121.oracle.com (8.14.4/8.13.8) with ESMTP id x4DH0ZfW031069; Mon, 13 May 2019 17:00:35 GMT Received: from [10.166.106.34] (/10.166.106.34) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Mon, 13 May 2019 10:00:35 -0700 Subject: Re: [RFC KVM 19/27] kvm/isolation: initialize the KVM page table with core mappings To: Andy Lutomirski , Dave Hansen Cc: Paolo Bonzini , Radim Krcmar , Thomas Gleixner , Ingo Molnar , Borislav Petkov , "H. Peter Anvin" , Dave Hansen , Peter Zijlstra , kvm list , X86 ML , Linux-MM , LKML , Konrad Rzeszutek Wilk , jan.setjeeilers@oracle.com, Liran Alon , Jonathan Adams References: <1557758315-12667-1-git-send-email-alexandre.chartre@oracle.com> <1557758315-12667-20-git-send-email-alexandre.chartre@oracle.com> From: Alexandre Chartre Organization: Oracle Corporation Message-ID: Date: Mon, 13 May 2019 19:00:31 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.5.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=9256 signatures=668686 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1810050000 definitions=main-1905130116 X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=9256 signatures=668686 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1810050000 definitions=main-1905130116 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 5/13/19 6:00 PM, Andy Lutomirski wrote: > On Mon, May 13, 2019 at 8:50 AM Dave Hansen wrote: >> >>> + /* >>> + * Copy the mapping for all the kernel text. We copy at the PMD >>> + * level since the PUD is shared with the module mapping space. >>> + */ >>> + rv = kvm_copy_mapping((void *)__START_KERNEL_map, KERNEL_IMAGE_SIZE, >>> + PGT_LEVEL_PMD); >>> + if (rv) >>> + goto out_uninit_page_table; >> >> Could you double-check this? We (I) have had some repeated confusion >> with the PTI code and kernel text vs. kernel data vs. __init. >> KERNEL_IMAGE_SIZE looks to be 512MB which is quite a bit bigger than >> kernel text. >> >>> + /* >>> + * Copy the mapping for cpu_entry_area and %esp fixup stacks >>> + * (this is based on the PTI userland address space, but probably >>> + * not needed because the KVM address space is not directly >>> + * enterered from userspace). They can both be copied at the P4D >>> + * level since they each have a dedicated P4D entry. >>> + */ >>> + rv = kvm_copy_mapping((void *)CPU_ENTRY_AREA_PER_CPU, P4D_SIZE, >>> + PGT_LEVEL_P4D); >>> + if (rv) >>> + goto out_uninit_page_table; >> >> cpu_entry_area is used for more than just entry from userspace. The gdt >> mapping, for instance, is needed everywhere. You might want to go look >> at 'struct cpu_entry_area' in some more detail. >> >>> +#ifdef CONFIG_X86_ESPFIX64 >>> + rv = kvm_copy_mapping((void *)ESPFIX_BASE_ADDR, P4D_SIZE, >>> + PGT_LEVEL_P4D); >>> + if (rv) >>> + goto out_uninit_page_table; >>> +#endif >> >> Why are these mappings *needed*? I thought we only actually used these >> fixup stacks for some crazy iret-to-userspace handling. We're certainly >> not doing that from KVM context. >> >> Am I forgetting something? >> >>> +#ifdef CONFIG_VMAP_STACK >>> + /* >>> + * Interrupt stacks are vmap'ed with guard pages, so we need to >>> + * copy mappings. >>> + */ >>> + for_each_possible_cpu(cpu) { >>> + stack = per_cpu(hardirq_stack_ptr, cpu); >>> + pr_debug("IRQ Stack %px\n", stack); >>> + if (!stack) >>> + continue; >>> + rv = kvm_copy_ptes(stack - IRQ_STACK_SIZE, IRQ_STACK_SIZE); >>> + if (rv) >>> + goto out_uninit_page_table; >>> + } >>> + >>> +#endif >> >> I seem to remember that the KVM VMENTRY/VMEXIT context is very special. >> Interrupts (and even NMIs?) are disabled. Would it be feasible to do >> the switching in there so that we never even *get* interrupts in the KVM >> context? > > That would be nicer. > > Looking at this code, it occurs to me that mapping the IRQ stacks > seems questionable. As it stands, this series switches to a normal > CR3 in some C code somewhere moderately deep in the APIC IRQ code. By > that time, I think you may have executed traceable code, and, if that > happens, you lose. i hate to say this, but any shenanigans like this > patch does might need to happen in the entry code *before* even > switching to the IRQ stack. Or perhaps shortly thereafter. > > We've talked about moving context tracking to C. If we go that route, > then this KVM context mess could go there, too -- we'd have a > low-level C wrapper for each entry that would deal with getting us > ready to run normal C code. > > (We need to do something about terminology. This kvm_mm thing isn't > an mm in the normal sense. An mm has normal kernel mappings and > varying user mappings. For example, the PTI "userspace" page tables > aren't an mm. And we really don't want a situation where the vmalloc > fault code runs with the "kvm_mm" mm active -- it will totally > malfunction.) > One of my next step is to try to put the KVM page table in the PTI userspace page tables, and not switch CR3 on KVM_RUN ioctl. That way, we will run with a regular mm (but using the userspace page table). Then interrupt would switch CR3 to kernel page table (like paranoid idtentry currently do it). alex.