Received: by 2002:ac0:a594:0:0:0:0:0 with SMTP id m20-v6csp802803imm; Tue, 15 May 2018 09:22:38 -0700 (PDT) X-Google-Smtp-Source: AB8JxZr7sajplLGYJkc/eW8ihjVvTht9cmkIW//I9w2i06yX4y31x0WMqaS5u1DqApyybQir9RB3 X-Received: by 2002:a17:902:205:: with SMTP id 5-v6mr14701163plc.301.1526401358242; Tue, 15 May 2018 09:22:38 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1526401358; cv=none; d=google.com; s=arc-20160816; b=S0U+10ZaZPh64nd+lh9NSgiHGcj+cOC3tTsmP4nqSpxrbJ+ebYLdLLtHWSkpzihFNK nDoVoN6qQKrXZDzMaqJN6ml4EWYhfsp4/mwHoMDyO1XRcmL48o2D3lTqMBp1mtyCIxO7 ICiQAaTlWB/48HhS44B7JARBNfxiekNbI5txvaws4Dwhr4yUkFEMhRgzsu/akoAxbP5d z6CEpw8zHlGlIbx0nHeFkNSAtfShl7s1KkeDsLoxB9joK1Dhd+we5kmOv1R0DHrJWOEz 75qvH8LA4v+micwLTIfOPpJCcqZh6PR1jVNB52QCiPrZ8QotQA4gIRQYvXF9odUpuXPF HYmw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject:arc-authentication-results; bh=5++H3pcwJoKbE/CdwvkgFheVSzbUr5WH64sD6d0Drms=; b=zXRzDHSMv4MM+elnfGqr98g7PfQo77Ob6HvvIhzzK74Gad/WyNzYHpKjSrmxd+wUSa rMKI0OONjGiokn1Z+zxyh8uiEO/j8SZ/6zqpXIw8v7BsOkmlwGPy/IqYHW9aVp7NUKuH Ls+QYV2NHnVW87Hm0eZu98AHU03eLVOa7NCvDQvFz/eDPqWxE6ed3BiOiW4gvNbZiFuq PFoVB7N4xbKGJ7tsgyF7/V8sHMCtkS6yqXmP70Qxy17KpXD9SwlvWIx7Lmc4okJ/ehCw fkrbEYBiSDv72zGZoMJdGtbjUgMsix9RFfrPeH06/VEkv6QBYBMBl/b4mnJjlSyyQA0K aG6Q== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id e71-v6si408441pfj.250.2018.05.15.09.22.23; Tue, 15 May 2018 09:22:38 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932099AbeEOQV2 (ORCPT + 99 others); Tue, 15 May 2018 12:21:28 -0400 Received: from foss.arm.com ([217.140.101.70]:35178 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753064AbeEOQVZ (ORCPT ); Tue, 15 May 2018 12:21:25 -0400 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 96C2B15AB; Tue, 15 May 2018 09:21:25 -0700 (PDT) Received: from [10.1.206.73] (en101.cambridge.arm.com [10.1.206.73]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id B7F543F53D; Tue, 15 May 2018 09:21:23 -0700 (PDT) Subject: Re: [PATCH] KVM: arm/arm64: fix unaligned hva start and end in handle_hva_to_gpa To: Jia He , Marc Zyngier , Christoffer Dall , linux-arm-kernel@lists.infradead.org, kvmarm@lists.cs.columbia.edu Cc: linux-kernel@vger.kernel.org, Jia He , li.zhang@hxt-semitech.com, hughd@google.com, Andrea Arcangeli ";" Minchan Kim ";" Claudio Imbrenda ";" Arvind Yadav ";" Mike Rapoport , akpm@linux-foundation.org References: <1525244911-5519-1-git-send-email-hejianet@gmail.com> <04e6109f-cbbd-24b0-03bb-9247b930d42c@arm.com> <85e04362-05dd-c697-e9c4-ad5824e63819@gmail.com> <0f134188-12d5-7184-3fbd-0ec8204cf649@arm.com> <695beacb-ff51-bb2c-72ef-d268f7d4e59d@arm.com> <1e065e9b-4dad-611d-fc5b-26fe6c031507@gmail.com> <14954b3a-6a2a-fe05-47b7-1890375ab8a4@arm.com> <5efb2da1-fed8-443b-d06b-cc03152b2c2e@gmail.com> From: Suzuki K Poulose Message-ID: <09c30e43-7107-395c-bde3-0310bbdbca91@arm.com> Date: Tue, 15 May 2018 17:21:22 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 MIME-Version: 1.0 In-Reply-To: <5efb2da1-fed8-443b-d06b-cc03152b2c2e@gmail.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Jia, On 15/05/18 14:15, Jia He wrote: > > > On 5/15/2018 8:38 PM, Jia He Wrote: >> Hi Suzuki >> >> On 5/15/2018 4:36 PM, Suzuki K Poulose Wrote: >>> >>> Hi Jia >>> >>> On 05/15/2018 03:03 AM, Jia He wrote: >>>> Hi Suzuki >>>> >>>> I will merge the other thread into this, and add the necessary CC list >>>> >>>> That WARN_ON call trace is very easy to reproduce in my armv8a server after I >>>> start 20 guests >>>> >>>> and run memhog in the host. Of course, ksm should be enabled >>>> >>>> For you question about my inject fault debug patch: >>> >>> >>> Thanks for the patch, comments below. >>> >>>> >>> >>> ... >>> >>>> index 7f6a944..ab8545e 100644 >>>> --- a/virt/kvm/arm/mmu.c >>>> +++ b/virt/kvm/arm/mmu.c >>>> @@ -290,12 +290,17 @@ static void unmap_stage2_puds(struct kvm *kvm, pgd_t *pgd, >>>>    * destroying the VM), otherwise another faulting VCPU may come in and mess >>>>    * with things behind our backs. >>>>    */ >>>> +extern int trigger_by_ksm; >>>>   static void unmap_stage2_range(struct kvm *kvm, phys_addr_t start, u64 size) >>>>   { >>>>          pgd_t *pgd; >>>>          phys_addr_t addr = start, end = start + size; >>>>          phys_addr_t next; >>>> >>>> +       if(trigger_by_ksm) { >>>> +               end -= 0x200; >>>> +       } >>>> + >>>>          assert_spin_locked(&kvm->mmu_lock); >>>>          pgd = kvm->arch.pgd + stage2_pgd_index(addr); >>>>          do { >>>> >>>> I need to point out that I never reproduced it without this debugging patch. >>> >>> That could trigger the panic iff your "size" <= 0x200, leading to the >>> condition (end < start), which can make the loop go forever, as we >>> do while(addr < end) and end up accessing something which may not be PGD entry >>> and thus get a bad page with bad numbers all around. This case could be hit only >>> with your change and the bug in the KSM which gives us an address near the page >>> boundary. >> No, I injected the fault on purpose to simulate the case when size is less than >> PAGE_SIZE(eg. PAGE_SIZE-0x200=65024) >> I ever got the panic info [1] *without* the debugging patch only once >> >> [1] https://lkml.org/lkml/2018/5/9/992 >>> >>> So, I think we can safely ignore the PANIC(). >>> More below. I am a bit confused now. Do you mean, the panic was triggered *without* the debugging patch ? If that is the case, it is really worrying. >>> >>> >>>>>> Suzuki, thanks for the comments. >>>>>> >>>>>> I proposed another ksm patch https://lkml.org/lkml/2018/5/3/1042 >>>>>> The root cause is ksm will add some extra flags to indicate that the page >>>>>> is in/not_in the stable tree. This makes address not be aligned with PAGE_SIZE. >>>>> Thanks for the pointer. In the future, please Cc the people relevant to the >>>>> discussion in the patches. >>>>> >>>>>>   From arm kvm mmu point of view, do you think handle_hva_to_gpa still need >>>>>> to handle >>>>>> the unalignment case? >>>>> I don't think we should do that. Had we done this, we would never have caught >>>>> this bug >>>>> in KSM. Eventually if some other new implementation comes up with the a new >>>>> notifier >>>>> consumer which doesn't check alignment and doesn't WARN, it could simply do >>>>> the wrong >>>>> thing. So I believe what we have is a good measure to make sure that things are >>>>> in the right order. >>>>> >>>>>> IMO, the PAGE_SIZE alignment is still needed because we should not let the >>>>>> bottom function >>>>>> kvm_age_hva_handler to handle the exception. Please refer to the >>>>>> implementation in X86 and >>>>>> powerpc kvm_handle_hva_range(). They both aligned the hva with >>>>>> hva_to_gfn_memslot. >>>>>> >>>>>   From an API perspective, you are passed on a "start" and "end" address. So, >>>>> you could potentially >>>>> do the wrong thing if you align the "start" and "end". May be those handlers >>>>> should also do the >>>>> same thing as we do. >>> >>>> But handle_hva_to_gpa has partially adjusted the alignment possibly: >>>>     1750         kvm_for_each_memslot(memslot, slots) { >>>>     1751                 unsigned long hva_start, hva_end; >>>>     1752                 gfn_t gpa; >>>>     1753 >>>>     1754                 hva_start = max(start, memslot->userspace_addr); >>>>     1755                 hva_end = min(end, memslot->userspace_addr + >>>>     1756                             (memslot->npages << PAGE_SHIFT)); >>>> >>>> at line 1755, let us assume that end=0x12340200 and >>>> memslot->userspace_addr + (memslot->npages << PAGE_SHIFT)=0x12340000 >>>> Then, hva_start is not page_size aligned and hva_end is aligned, and the size >>>> will be PAGE_SIZE-0x200, >>>> just as what I had done in the inject fault debugging patch. >>> >>> Thats because we want to limit the handling of the hva/gpa range by memslot. So, >>> we make sure we pass on the range within the given memslot >>> to hva_to_gfn_memslot(). But we do iterate over the next memslot if the >>> original range falls in to the next slot. So, in practice, there is no >>> alignment/trimming of the range. Its just that we pass on the appropriate range >>> for each slot. >>> >> Yes, I understand what the codes did in hva_to_gfn_memslot(). What I mean is >> hva_end may be changed and (hva_end - hva_start) will not be same as the >> parameter _size_ ? >> >>> ret |= handler(kvm, gpa, (u64)(hva_end - hva_start), data); >> >> Anyway, I have to admit that all the exceptions are originally caused by the >> STABLE_FLAG in ksm code. What I want to discuss here is how to make arm kvm >> handle the exception more gracefully. Thats my point. We need the fix in ksm. Once we have the fix in place, do we hit the WARN_ON() any more ? >> > Hi Suzuki > How about this patch (balance of avoiding the WARN_ON storm and debugging > convenience): The problem with WARN_ON_ONCE() is, it could potentially suppress the different call paths that could lead to the triggers. e.g, unmap_stage2_range() could be called from different paths and having a WARN_ON_ONCE() could potentially hide the other call paths. > diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c > index 7f6a944..4033946 100644 > --- a/virt/kvm/arm/mmu.c > +++ b/virt/kvm/arm/mmu.c > @@ -297,6 +297,8 @@ static void unmap_stage2_range(struct kvm *kvm, phys_addr_t > start, u64 size) > phys_addr_t next; > > assert_spin_locked(&kvm->mmu_lock); > + > + WARN_ON_ONCE(size & ~PAGE_MASK); > pgd = kvm->arch.pgd + stage2_pgd_index(addr); > do { > /* > @@ -1792,7 +1794,7 @@ static int kvm_set_spte_handler(struct kvm *kvm, gpa_t > gpa, u64 size, void *data > { > pte_t *pte = (pte_t *)data; > > - WARN_ON(size != PAGE_SIZE); > + WARN_ON_ONCE(size != PAGE_SIZE); > /* > * We can always call stage2_set_pte with KVM_S2PTE_FLAG_LOGGING_ACTIVE > * flag clear because MMU notifiers will have unmapped a huge PMD before > @@ -1823,7 +1825,7 @@ static int kvm_age_hva_handler(struct kvm *kvm, gpa_t gpa, > u64 size, void *data) > pmd_t *pmd; > pte_t *pte; > > - WARN_ON(size != PAGE_SIZE && size != PMD_SIZE); > + WARN_ON_ONCE(size != PAGE_SIZE && size != PMD_SIZE); > pmd = stage2_get_pmd(kvm, NULL, gpa); > if (!pmd || pmd_none(*pmd)) /* Nothing there */ > return 0; > @@ -1843,7 +1845,7 @@ static int kvm_test_age_hva_handler(struct kvm *kvm, gpa_t > gpa, u64 size, void * > pmd_t *pmd; > pte_t *pte; > > - WARN_ON(size != PAGE_SIZE && size != PMD_SIZE); > + WARN_ON_ONCE(size != PAGE_SIZE && size != PMD_SIZE); > pmd = stage2_get_pmd(kvm, NULL, gpa); > if (!pmd || pmd_none(*pmd)) /* Nothing there */ > return 0; > Cheers Suzuki