Received: by 2002:a25:4158:0:0:0:0:0 with SMTP id o85csp6491553yba; Tue, 14 May 2019 08:25:39 -0700 (PDT) X-Google-Smtp-Source: APXvYqxVVDktbwBrbjE4f63G85G+QPku2sQvcW0oHfeGV1TDeIo1gZ0q7M4CWfnMiKADsppK5uk+ X-Received: by 2002:a63:4045:: with SMTP id n66mr24155473pga.386.1557847539689; Tue, 14 May 2019 08:25:39 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1557847539; cv=none; d=google.com; s=arc-20160816; b=KktUhldp3+K4A9cVRQs9xwkwsMk1q15wtq6H44T6qzzPJShkexe7O7qRMk1OcNmAyR 9QGZyPe7guoOuudm4+lE6Xq1jQ+k4QdvlMaxkvYz8O0vcxJg5rx3u0/mQwKD1ZnwUf4v WgHcQ60te7qS9ByLqemrvxIqjb9oFkVOrxFhR8wJczptQUxK6yGIwDI5eJBuOaqP4AEB EhVhdJ53Zpekkiw96YOne6rXFblCdJY2Kmh0pNlJ9bEAzXyqrwzHgspb43VSFzkTRAtA VayfRCX+9kr7uTrRc4hXXKpGcGzF6EB0EgQu857T6u5+G+vl/+n6EQGAgSmdEH/1xj02 OPGA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=GoLz+QSqIyffgCrNMjUVAMh3xuEjh/l8T0RItu42FqQ=; b=pAonn8Se0R/UQQZhxI8VzmcccB3eG0MNFsGvrnLjTZTR4pEy2hRuNyTAw/FdyNm+Zt K0OpQKFkG0mcZr8JsFViPn9rQa4TPb7QC6dRNf1um0WJIYDzd5c1BHlcH/cr+CA3cMvk hDbCkVIAsjQ//D9EiDmgZnQGPlOdv1mGlEihzFGQUAM1uGD+MhJMNnarjZJRS9E9wqZv ZOYzMVuLZGFt+SNKYcM/YlENAaNuAg1ItE5dwl7mRtLkiTUpF0F+zxzskEeUbM9AH3cJ kaxWLYyuyV0cwc9hrJopnlRpAvBNYDof9OgaCc04SobonQliNFeAFvoSu5jNVrn1ywtP ehiw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b="eVCLg4/X"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id l22si19333886pgc.523.2019.05.14.08.25.24; Tue, 14 May 2019 08:25:39 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b="eVCLg4/X"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726319AbfENPYD (ORCPT + 99 others); Tue, 14 May 2019 11:24:03 -0400 Received: from mail.kernel.org ([198.145.29.99]:50684 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725854AbfENPYD (ORCPT ); Tue, 14 May 2019 11:24:03 -0400 Received: from mail-wr1-f41.google.com (mail-wr1-f41.google.com [209.85.221.41]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 562672168B for ; Tue, 14 May 2019 15:24:01 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1557847441; bh=lEnwwmU1XbKwYI+07sS2BHHk1ukbBjXSFcW1oRTxeAo=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=eVCLg4/XxcMGOLsPFn6OR8b5+D0fSJ3C73EsiGv+ZP1UExvdsqIN8jEcRCYerQv8E JiJQI0nJ+Wm5vH2G6Wc56NLBhB0dSfY8L+kVXyatxe3O+MvvIlYBmHROcAADcXdyMu gmFnLgnQ5/GErDUj0tVtJM/DgF+Mw/f+QUS627hc= Received: by mail-wr1-f41.google.com with SMTP id s17so3237259wru.3 for ; Tue, 14 May 2019 08:24:01 -0700 (PDT) X-Gm-Message-State: APjAAAWf9t/IIDW5FJjlonyDOf1XEehg7/U42avxM9+aWj+f+5r34WjO 5MALYIzYhK7oPxpVNp6GS98wZt0yyPrVpE6RXTGakQ== X-Received: by 2002:adf:ec42:: with SMTP id w2mr21163913wrn.77.1557847439920; Tue, 14 May 2019 08:23:59 -0700 (PDT) MIME-Version: 1.0 References: <1557758315-12667-1-git-send-email-alexandre.chartre@oracle.com> <1557758315-12667-19-git-send-email-alexandre.chartre@oracle.com> <20190514070941.GE2589@hirez.programming.kicks-ass.net> <4e7d52d7-d4d2-3008-b967-c40676ed15d2@oracle.com> In-Reply-To: <4e7d52d7-d4d2-3008-b967-c40676ed15d2@oracle.com> From: Andy Lutomirski Date: Tue, 14 May 2019 08:23:48 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [RFC KVM 18/27] kvm/isolation: function to copy page table entries for percpu buffer To: Alexandre Chartre Cc: Peter Zijlstra , Andy Lutomirski , Paolo Bonzini , Radim Krcmar , Thomas Gleixner , Ingo Molnar , Borislav Petkov , "H. Peter Anvin" , Dave Hansen , kvm list , X86 ML , Linux-MM , LKML , Konrad Rzeszutek Wilk , jan.setjeeilers@oracle.com, Liran Alon , Jonathan Adams Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, May 14, 2019 at 2:42 AM Alexandre Chartre wrote: > > > On 5/14/19 10:34 AM, Andy Lutomirski wrote: > > > > > >> On May 14, 2019, at 1:25 AM, Alexandre Chartre wrote: > >> > >> > >>> On 5/14/19 9:09 AM, Peter Zijlstra wrote: > >>>> On Mon, May 13, 2019 at 11:18:41AM -0700, Andy Lutomirski wrote: > >>>> On Mon, May 13, 2019 at 7:39 AM Alexandre Chartre > >>>> wrote: > >>>>> > >>>>> pcpu_base_addr is already mapped to the KVM address space, but this > >>>>> represents the first percpu chunk. To access a per-cpu buffer not > >>>>> allocated in the first chunk, add a function which maps all cpu > >>>>> buffers corresponding to that per-cpu buffer. > >>>>> > >>>>> Also add function to clear page table entries for a percpu buffer. > >>>>> > >>>> > >>>> This needs some kind of clarification so that readers can tell wheth= er > >>>> you're trying to map all percpu memory or just map a specific > >>>> variable. In either case, you're making a dubious assumption that > >>>> percpu memory contains no secrets. > >>> I'm thinking the per-cpu random pool is a secrit. IOW, it demonstrabl= y > >>> does contain secrits, invalidating that premise. > >> > >> The current code unconditionally maps the entire first percpu chunk > >> (pcpu_base_addr). So it assumes it doesn't contain any secret. That is > >> mainly a simplification for the POC because a lot of core information > >> that we need, for example just to switch mm, are stored there (like > >> cpu_tlbstate, current_task...). > > > > I don=E2=80=99t think you should need any of this. > > > > At the moment, the current code does need it. Otherwise it can't switch f= rom > kvm mm to kernel mm: switch_mm_irqs_off() will fault accessing "cpu_tlbst= ate", > and then the page fault handler will fail accessing "current" before call= ing > the kvm page fault handler. So it will double fault or loop on page fault= s. > There are many different places where percpu variables are used, and I ha= ve > experienced many double fault/page fault loop because of that. Now you're experiencing what working on the early PTI code was like :) This is why I think you shouldn't touch current in any of this. > > >> > >> If the entire first percpu chunk effectively has secret then we will > >> need to individually map only buffers we need. The kvm_copy_percpu_map= ping() > >> function is added to copy mapping for a specified percpu buffer, so > >> this used to map percpu buffers which are not in the first percpu chun= k. > >> > >> Also note that mapping is constrained by PTE (4K), so mapped buffers > >> (percpu or not) which do not fill a whole set of pages can leak adjace= nt > >> data store on the same pages. > >> > >> > > > > I would take a different approach: figure out what you need and put it = in its > > own dedicated area, kind of like cpu_entry_area. > > That's certainly something we can do, like Julian proposed with "Process-= local > memory allocations": https://lkml.org/lkml/2018/11/22/1240 > > That's fine for buffers allocated from KVM, however, we will still need s= ome > core kernel mappings so the thread can run and interrupts can be handled. > > > One nasty issue you=E2=80=99ll have is vmalloc: the kernel stack is in = the > > vmap range, and, if you allow access to vmap memory at all, you=E2=80= =99ll > > need some way to ensure that *unmap* gets propagated. I suspect the > > right choice is to see if you can avoid using the kernel stack at all > > in isolated mode. Maybe you could run on the IRQ stack instead. > > I am currently just copying the task stack mapping into the KVM page tabl= e > (patch 23) when a vcpu is created: > > err =3D kvm_copy_ptes(tsk->stack, THREAD_SIZE); > > And this seems to work. I am clearing the mapping when the VM vcpu is fre= ed, > so I am making the assumption that the same task is used to create and fr= ee > a vcpu. > vCPUs are bound to an mm but not a specific task, right? So I think this is wrong in both directions. Suppose a vCPU is created, then the task exits, the stack mapping gets freed (the core code tries to avoid this, but it does happen), and a new stack gets allocated at the same VA with different physical pages. Now you're toast :) On the flip side, wouldn't you crash if a vCPU is created and then run on a different thread? How important is the ability to enable IRQs while running with the KVM page tables?