Received: by 2002:a05:6a10:a841:0:0:0:0 with SMTP id d1csp623870pxy; Wed, 28 Apr 2021 10:41:13 -0700 (PDT) X-Google-Smtp-Source: ABdhPJw2t5Q9fpETu2mNPBKK2wIhs2/ZB1AqPSxudmyV9xC9YS0ngQmv6oOUjtYsGb7I9SsXIezm X-Received: by 2002:a50:8fe6:: with SMTP id y93mr12765377edy.224.1619631673077; Wed, 28 Apr 2021 10:41:13 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1619631673; cv=none; d=google.com; s=arc-20160816; b=CyHOratWQL13ebQpZLXxdH4f4iNQuDzI1fiOhbzqxh7b20LSSvsz2ndCjd+fLv4Os4 F1YLFpUmhn9bseU5vMCx/2mx9jUXUMnoaUTsxRQgXi2oJgyXp5/OMw93a+lDf/wpaEW4 RU7cspN6s5zSqNubzp647dVOftJYKVwpWQuhSXWwT6DIaw45aAE4SER0k5H5UCVOliZE 0ZHepHAAJRAaxyvba1mr9uhBUpnX1LRHCsAORNVFYnbVfZgqiPzOFS6Kh/RDG/YXJKbQ MvOMJ2FiIrp2Q/1yM75kJ5Ql0AGsB7sU/uYgyNlXQrnficUaVyDQljvhPkfPn4ycJGks iiIw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=CcFkFCZphcrbQ6FjF32eWapcASxmYPR97+gW/jo5thE=; b=KxLMS1YT8fXXgTpmu5LJp8H3bzFciFzgvclcN+b9hRCjxNSsg6xOMBW+4I1h5YqpQY zVgHIq3zwhw2h929SPqADpl9jLKsOwVmdrmTQOnAoCUUMZKfdNPUtdNMB85DY785iDpo p1CUDBlYvu1nnI+0VDLS3bUoCMKKzpJweKSdbMIFtFB2eb/lXEq1ae/w5chvPpIJc/UX H0aLi3lHGy74vh82a4oN/eiy/+QuTz0rY3kVeWawJMgvFthzvU8o04kD4OTyDziBVCag Cglz8iJ1f8S5vPS6GzOX6vXSiioy2qrQrG7bVZs99RvIvTqJKhpTqtM8pgV8qpnGGDMF p/Nw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=LHvheWQk; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id y18si601790ejw.685.2021.04.28.10.40.49; Wed, 28 Apr 2021 10:41:13 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=LHvheWQk; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S240777AbhD1QXT (ORCPT + 99 others); Wed, 28 Apr 2021 12:23:19 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59182 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S240755AbhD1QXM (ORCPT ); Wed, 28 Apr 2021 12:23:12 -0400 Received: from mail-oi1-x22f.google.com (mail-oi1-x22f.google.com [IPv6:2607:f8b0:4864:20::22f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 20B15C061574 for ; Wed, 28 Apr 2021 09:22:24 -0700 (PDT) Received: by mail-oi1-x22f.google.com with SMTP id e25so33624648oii.2 for ; Wed, 28 Apr 2021 09:22:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=CcFkFCZphcrbQ6FjF32eWapcASxmYPR97+gW/jo5thE=; b=LHvheWQkgk7f2mqqA37lihjWniU2EUDKFAibSLpVNlJb87q/AwGxWsTqsiRKYTNNu/ SBIqbvhC2UL5jns72q7sFKmTj/Gy5hIFA7YBGFUnUl+0pl2a9FBDID75ffMwDKl24AAH H/SA76ea8WVvT6qUuZWIq698niHXr5tIlSdV8Yy9mca5ISeLo/TrTrAobFPHNCdVZsn2 ImqH/KlI3roiYI0oZdjwUXT3FA/gu5dzqwpAnhm3U0Pp9poOdAQURNoCgqSrdkH29dmk msGL7qd3L1oR5IhOHdfv3K6BaG8t2pXiyBaBJcYuzt2M52BR46/V6JByTYWVyLMj8wCB 2kSg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=CcFkFCZphcrbQ6FjF32eWapcASxmYPR97+gW/jo5thE=; b=lQSNQ0TkOrYtK7c4E+OzuHl1qufCcXSDp/CgIJQBrNmAb8bUXACkDDFB4kPEKMd+jH 9Hs+tQY3gJUvd1n81tiwh6hiQzg1RcOc+4Uz+x7NsuVscwRsb9NQsPsy15cAEZ7OORly 8EHzleK/L9ZMyVE4topK3kcNRkdzHIi7GyvA+AqcJuVGxmH23xmK6/G5ZNJ8kP8J0PoL ewpF0xRtWCwgQ+DPCYFOlFMUeFPx4KutrrrEUnDmU3GWnAWEVyllkiqq09sPwvsyKsEX l5Qb53jQTvoQG0FNLX+Yc8eXCQuqWDEOWdu7gKK0IyRMvoIybiZekLQeYY/FU6eE+pzW fJYA== X-Gm-Message-State: AOAM530i8HpuFJMCqun6F/DOoVQ8RMptg9TBRXW68ntIA6JCrghM6wau wHvJGEn/g5C3bomfQZMH2j6EXVhpWZLEAokc7wl6Eg== X-Received: by 2002:aca:5789:: with SMTP id l131mr3604019oib.164.1619626943161; Wed, 28 Apr 2021 09:22:23 -0700 (PDT) MIME-Version: 1.0 References: <20210416082511.2856-1-zhukeqian1@huawei.com> <20210416082511.2856-3-zhukeqian1@huawei.com> <49e6bf4f-0142-c9ea-a8c1-7cfe211c8d7b@huawei.com> <4f71baed-544b-81b2-dfa6-f04016966a5a@huawei.com> <60894846.1c69fb81.6e765.161bSMTPIN_ADDED_BROKEN@mx.google.com> In-Reply-To: <60894846.1c69fb81.6e765.161bSMTPIN_ADDED_BROKEN@mx.google.com> From: Ben Gardon Date: Wed, 28 Apr 2021 09:22:12 -0700 Message-ID: Subject: Re: [RFC PATCH v2 2/2] KVM: x86: Not wr-protect huge page with init_all_set dirty log To: zhukeqian Cc: LKML , kvm , Paolo Bonzini , Sean Christopherson , "Wanghaibin (D)" Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Apr 28, 2021 at 4:34 AM zhukeqian wrote: > > Oh, I have to correct myself. > > without this opt: > first round dirtying: write fault and split large mapping > second round: write fault > > with this opt: > first round dirtying: no write fault > second round: write fault and split large mapping > > the total test time is expected to be reduced. Oh yeah, good point. So we should really see the savings in the first round dirty memory time. Good catch. > > On 2021/4/28 0:33, Ben Gardon wrote: > > On Mon, Apr 26, 2021 at 10:04 PM Keqian Zhu < zhukeqian1@huawei.com> wrote: > >> > >> Hi Ben, > >> > >> Sorry for the delay reply! > >> > >> On 2021/4/21 0:30, Ben Gardon wrote: > >>> On Tue, Apr 20, 2021 at 12:49 AM Keqian Zhu < zhukeqian1@huawei.com> wrote: > >>>> > >>>> Hi Ben, > >>>> > >>>> On 2021/4/20 3:20, Ben Gardon wrote: > >>>>> On Fri, Apr 16, 2021 at 1:25 AM Keqian Zhu < zhukeqian1@huawei.com> wrote: > >>>>>> > >>>>>> Currently during start dirty logging, if we're with init-all-set, > >>>>>> we write protect huge pages and leave normal pages untouched, for > >>>>>> that we can enable dirty logging for these pages lazily. > >>>>>> > >>>>>> Actually enable dirty logging lazily for huge pages is feasible > >>>>>> too, which not only reduces the time of start dirty logging, also > >>>>>> greatly reduces side-effect on guest when there is high dirty rate. > >>>>>> > >>>>>> Signed-off-by: Keqian Zhu < zhukeqian1@huawei.com> > >>>>>> --- > >>>>>> arch/x86/kvm/mmu/mmu.c | 48 ++++++++++++++++++++++++++++++++++++++---- > >>>>>> arch/x86/kvm/x86.c | 37 +++++++++----------------------- > >>>>>> 2 files changed, 54 insertions(+), 31 deletions(-) > >>>>>> > >>>>>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > >>>>>> index 2ce5bc2ea46d..98fa25172b9a 100644 > >>>>>> --- a/arch/x86/kvm/mmu/mmu.c > >>>>>> +++ b/arch/x86/kvm/mmu/mmu.c > >>>>>> @@ -1188,8 +1188,7 @@ static bool __rmap_clear_dirty(struct kvm *kvm, struct kvm_rmap_head *rmap_head, > >>>>>> * @gfn_offset: start of the BITS_PER_LONG pages we care about > >>>>>> * @mask: indicates which pages we should protect > >>>>>> * > >>>>>> - * Used when we do not need to care about huge page mappings: e.g. during dirty > >>>>>> - * logging we do not have any such mappings. > >>>>>> + * Used when we do not need to care about huge page mappings. > >>>>>> */ > >>>>>> static void kvm_mmu_write_protect_pt_masked(struct kvm *kvm, > >>>>>> struct kvm_memory_slot *slot, > >>>>>> @@ -1246,13 +1245,54 @@ static void kvm_mmu_clear_dirty_pt_masked(struct kvm *kvm, > >>>>>> * It calls kvm_mmu_write_protect_pt_masked to write protect selected pages to > >>>>>> * enable dirty logging for them. > >>>>>> * > >>>>>> - * Used when we do not need to care about huge page mappings: e.g. during dirty > >>>>>> - * logging we do not have any such mappings. > >>>>>> + * We need to care about huge page mappings: e.g. during dirty logging we may > >>>>>> + * have any such mappings. > >>>>>> */ > >>>>>> void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm, > >>>>>> struct kvm_memory_slot *slot, > >>>>>> gfn_t gfn_offset, unsigned long mask) > >>>>>> { > >>>>>> + gfn_t start, end; > >>>>>> + > >>>>>> + /* > >>>>>> + * Huge pages are NOT write protected when we start dirty log with > >>>>>> + * init-all-set, so we must write protect them at here. > >>>>>> + * > >>>>>> + * The gfn_offset is guaranteed to be aligned to 64, but the base_gfn > >>>>>> + * of memslot has no such restriction, so the range can cross two large > >>>>>> + * pages. > >>>>>> + */ > >>>>>> + if (kvm_dirty_log_manual_protect_and_init_set(kvm)) { > >>>>>> + start = slot->base_gfn + gfn_offset + __ffs(mask); > >>>>>> + end = slot->base_gfn + gfn_offset + __fls(mask); > >>>>>> + kvm_mmu_slot_gfn_write_protect(kvm, slot, start, PG_LEVEL_2M); > >>>>>> + > >>>>>> + /* Cross two large pages? */ > >>>>>> + if (ALIGN(start << PAGE_SHIFT, PMD_SIZE) != > >>>>>> + ALIGN(end << PAGE_SHIFT, PMD_SIZE)) > >>>>>> + kvm_mmu_slot_gfn_write_protect(kvm, slot, end, > >>>>>> + PG_LEVEL_2M); > >>>>>> + } > >>>>>> + > >>>>>> + /* > >>>>>> + * RFC: > >>>>>> + * > >>>>>> + * 1. I don't return early when kvm_mmu_slot_gfn_write_protect() returns > >>>>>> + * true, because I am not very clear about the relationship between > >>>>>> + * legacy mmu and tdp mmu. AFAICS, the code logic is NOT an if/else > >>>>>> + * manner. > >>>>>> + * > >>>>>> + * The kvm_mmu_slot_gfn_write_protect() returns true when we hit a > >>>>>> + * writable large page mapping in legacy mmu mapping or tdp mmu mapping. > >>>>>> + * Do we still have normal mapping in that case? (e.g. We have large > >>>>>> + * mapping in legacy mmu and normal mapping in tdp mmu). > >>>>> > >>>>> Right, we can't return early because the two MMUs could map the page > >>>>> in different ways, but each MMU could also map the page in multiple > >>>>> ways independently. > >>>>> For example, if the legacy MMU was being used and we were running a > >>>>> nested VM, a page could be mapped 2M in EPT01 and 4K in EPT02, so we'd > >>>>> still need kvm_mmu_slot_gfn_write_protect calls for both levels. > >>>>> I don't think there's a case where we can return early here with the > >>>>> information that the first calls to kvm_mmu_slot_gfn_write_protect > >>>>> access. > >>>> Thanks for the detailed explanation. > >>>> > >>>>> > >>>>>> + * > >>>>>> + * 2. kvm_mmu_slot_gfn_write_protect() doesn't tell us whether the large > >>>>>> + * page mapping exist. If it exists but is clean, we can return early. > >>>>>> + * However, we have to do invasive change. > >>>>> > >>>>> What do you mean by invasive change? > >>>> We need the kvm_mmu_slot_gfn_write_protect to report whether all mapping are large > >>>> and clean, so we can return early. However it's not a part of semantics of this function. > >>>> > >>>> If this is the final code, compared to old code, we have an extra gfn_write_protect(), > >>>> I don't whether it's acceptable? > >>> > >>> Ah, I see. Please correct me if I'm wrong, but I think that in order > >>> to check that the only mappings on the GFN range are large, we'd still > >>> have to go over the rmap for the 4k mappings, at least for the legacy > >>> MMU. In that case, we're doing about as much work as the extra > >>> gfn_write_protect and I don't think that we'd get any efficiency gain > >>> for the change in semantics. > >>> > >>> Likewise for the TDP MMU, if the GFN range is mapped both large and > >>> 4k, it would have to be in different TDP structures, so the efficiency > >>> gains would again not be very big. > >> I am not familiar with the MMU virtualization of x86 arch, but I think > >> you are right. > >> > >>> > >>> I'm really just guessing about those performance characteristics > >>> though. It would definitely help to have some performance data to back > >>> all this up. Even just a few runs of the dirty_log_perf_test (in > >>> selftests) could provide some interesting results, and I'd be happy to > >>> help review any improvements you might make to that test. > >>> > >>> Regardless, I'd be inclined to keep this change as simple as possible > >>> for now and the early return optimization could happen in a follow-up > >>> patch. I think the extra gfn_write_protect is acceptable, especially > >>> if you can show that it doesn't cause a big hit in performance when > >>> running the dirty_log_perf_test with 4k and 2m backing memory. > >> I tested it using dirty_log_perf_test, the result shows that performance > >> of clear_dirty_log different within 2%. > > > > I think there are a couple obstacles which make the stock > > dirty_log_perf_test less useful for measuring this optimization. > > > > 1. Variance between runs > > With only 16 vCPUs and whatever the associated default guest memory > > size is, random system events and daemons introduce a lot of variance, > > at least in my testing. I usually try to run the biggest VM I can to > > smooth that out, but even with a 96 vCPU VM, a 2% difference is often > > not statistically significant. CPU pinning for the vCPU threads would > > help a lot to reduce variance. I don't remember if anyone has > > implemented this yet. > Yes, this makes sense. > > > > > 2. The guest dirty pattern > > By default, each guest vCPU will dirty it's entire partition of guest > > memory on each iteration. This means that instead of amortizing out > > the cost of write-protecting and splitting large pages, we simply move > > the burden later in the process. I see you didn't include the time for > > each iteration below, but I would expect this patch to move some of > > the time from "Enabling dirty logging time" and "Dirtying memory time" > > for pass 1 to "Clear dirty log time" and "Dirtying memory time" for > > pass 2. I wouldn't expect the total time over 5 iterations to change > > for this test. > If we have large page mapping and are with this optimization, the "Enabling dirty logging time" > and the first round "Dirtying memory time" will be greatly reduced. > > However, I don't think other times (dirty_memory except first round, get_log, clear_log) are > expected to change compared to w/o optimization. Because after the first round "Dirtying memory", > all mappings have been split to normal mappings, so the situation is same as w/o this optimization. > > Maybe I miss something? > > > > > It would probably also serve us well to have some kind of "hot" subset > > of memory for each vCPU, since some of the benefit of lazy large page > > splitting depend on that access pattern. > > > > 3. Lockstep dirtying and dirty log collection > > While this test is currently great for timing dirty logging > > operations, it's not great for trickier analysis, especially > > reductions to guest degradation. In order to measure that we'd need to > > change the test to collect the dirty log as quickly as possible, > > independent of what the guest is doing and then also record how much > > "progress" the guest is able to make while all that is happening. > Yes, make sense. > > Does the "dirty log collection" contains "dirty log clear"? As I understand, the dirty log > collection is very fast, just some memory copy. But for "dirty log clear", we should modify mappings > and perform TLBI, the time is much longer. Yeah, sorry. By dirty log collection I meant get + clear since the test does both before it waits for the guest to dirty all memory again. > > > > > I'd be happy to help review any improvements to the test which you > > feel like making. > Thanks, Ben. emm... I feel very sorry that perhaps I don't have enough time to do this, many works are queued... > On the other hand, I think the "Dirtying memory time" of first round can show us the optimization. No worries, I think this is a good patch either way. No need to block on test improvements, from my perspective. > > > > >> > >> *Without this patch* > >> > >> ./dirty_log_perf_test -i 5 -v 16 -s anonymous > >> > >> Testing guest mode: PA-bits:ANY, VA-bits:48, 4K pages > >> guest physical test memory offset: 0xffbfffff000 > >> Populate memory time: 3.105203579s > >> Enabling dirty logging time: 0.000323444s > >> [...] > >> Get dirty log over 5 iterations took 0.000595033s. (Avg 0.000119006s/iteration) > >> Clear dirty log over 5 iterations took 0.713212922s. (Avg 0.142642584s/iteration) > >> > >> ./dirty_log_perf_test -i 5 -v 16 -s anonymous_hugetlb > >> > >> Testing guest mode: PA-bits:ANY, VA-bits:48, 4K pages > >> guest physical test memory offset: 0xffbfffff000 > >> Populate memory time: 3.922764235s > >> Enabling dirty logging time: 0.000316473s > >> [...] > >> Get dirty log over 5 iterations took 0.000485459s. (Avg 0.000097091s/iteration) > >> Clear dirty log over 5 iterations took 0.603749670s. (Avg 0.120749934s/iteration) > >> > >> > >> *With this patch* > >> > >> ./dirty_log_perf_test -i 5 -v 16 -s anonymous > >> > >> Testing guest mode: PA-bits:ANY, VA-bits:48, 4K pages > >> guest physical test memory offset: 0xffbfffff000 > >> Populate memory time: 3.244515198s > >> Enabling dirty logging time: 0.000280207s > >> [...] > >> Get dirty log over 5 iterations took 0.000484953s. (Avg 0.000096990s/iteration) > >> Clear dirty log over 5 iterations took 0.727620114s. (Avg 0.145524022s/iteration) > >> > >> ./dirty_log_perf_test -i 5 -v 16 -s anonymous_hugetlb > >> > >> Testing guest mode: PA-bits:ANY, VA-bits:48, 4K pages > >> guest physical test memory offset: 0xffbfffff000 > >> Populate memory time: 3.244294061s > >> Enabling dirty logging time: 0.000273590s > >> [...] > >> Get dirty log over 5 iterations took 0.000474244s. (Avg 0.000094848s/iteration) > >> Clear dirty log over 5 iterations took 0.600593672s. (Avg 0.120118734s/iteration) > >> > >> > >> I faced a problem that there is no huge page mapping when test with > >> "-s anonymous_hugetlb", both for TDP enabled or disabled. > > > > Do you mean that even before dirty logging was enabled, KVM didn't > > create any large mappings? That's odd. I would assume the backing > > memory allocation would just fail if there aren't enough hugepages > > available. > It's odd indeed. I can see there are large mapping when I do normal migration, but I > don't see large mapping when run this test. > > I have proofed the time of "clear dirty log" is not effected, what about send a > formal patch? That sounds good to me. > > Thanks, > Keqian >