Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp157002imu; Tue, 8 Jan 2019 16:49:32 -0800 (PST) X-Google-Smtp-Source: ALg8bN5ilmkGPqCCFIehs+VW1CsP6JXhNAw/DowhWlVGwleMg8gxBH5YorLMKTmJqKTd2cf9L5ci X-Received: by 2002:a62:44d8:: with SMTP id m85mr3956186pfi.164.1546994972365; Tue, 08 Jan 2019 16:49:32 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1546994972; cv=none; d=google.com; s=arc-20160816; b=feJxWG2s5SaJv9+4ilWTFxFn4uGM6Z3oqF7aGmDMROxUX6D0hOrOQlm1ak1mF1GWlR orUoeikHGxnknjs0vJ1b9NdBYkW+6hf53rj9qCwzBImOrSp3qsqP9CDIWrfK/oTsSHKp CRH0k3L8UjrzYaZaiZW1oshMy+PzsrDlbFeiUfpvy4f/1fNtyRiW7efrgnbk/DkM3lJH 71824Hb5aV+y+qtnQViwk8CZ2uUEYawDBA8LwXnJzUPwgGa605hy5WP/Zngzj/codW7l jZgM2ZwS+SVrMPotzzlnBCUOGhqao/7dMWMzYaLyyxnD3/9FT0yQ+PUo4A0qJoJDNr1J TO/g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=mJ5Cg8m5AuiE6k8X70rqtyhMp/Uyh1+t2vNRSuissbo=; b=jWvF8W+a3Fu95YAxkid2JJNu6Sp3qqTsjm2U0Pt0atT99v+PkiGjr5kH7lvWgBgrhe Y5O2GjkP9H3gutKH26Df2GsLW526MNfND/qFHBW1X4lqzfYLrHZPtH+TYjm2COFqxjAM 559pn8aT2BViCKRqVoc+T9FmH+QV1xgJ51UcSQFy5Oc5LRzAMj/JMhzUhRlEpFn5/Y8k fzX8LWBhU1sTEVI98vdtRgufIzDrt5XNvNOJPuOlhvHYVxjvXGgXmf8LHWUvKbN47Sql ch9vzol564N35TEOUTHnU8h/ch3aQ9fVvqM4+DGfvIVGGwSwGAGyNPrMdsCXkThNhVSv qSeg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=biTJKxV9; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id w16si64306435pga.328.2019.01.08.16.49.17; Tue, 08 Jan 2019 16:49:32 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=biTJKxV9; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729201AbfAIArl (ORCPT + 99 others); Tue, 8 Jan 2019 19:47:41 -0500 Received: from mail-pf1-f195.google.com ([209.85.210.195]:39791 "EHLO mail-pf1-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728112AbfAIArk (ORCPT ); Tue, 8 Jan 2019 19:47:40 -0500 Received: by mail-pf1-f195.google.com with SMTP id r136so2756885pfc.6; Tue, 08 Jan 2019 16:47:38 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=mJ5Cg8m5AuiE6k8X70rqtyhMp/Uyh1+t2vNRSuissbo=; b=biTJKxV915/Gp2IzkYRWpIgNjxRVrZ7BjPUstBMjYfzT1f0DEzZ7/t9FyyZkXBcf2n o/brH4ZzbhEVOhPxVtTx2I77KwYudn+xsSo6VX0498eeXaW7+RtaDbQaqOuVQWMmfwvk R35x+yDJmdnawnn8VO6WfrL/mTFHjTIyU2sDeFP9KEZn2mpIYx6PsLxQ07ZWSBxE5ZeG Q/gLcU3OGoCQ9khLEhzwc7sxC5rGkhiX5DJuV3HULowQw4UUEbSxU8uAwB6bbOB/EF88 mToIZx1QtyKVvDbZbfy/nlaD7hBrdcl/oPVBDpLxFUZHGu5rAPcYIta/6tBFnlPSlLcr DJGA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=mJ5Cg8m5AuiE6k8X70rqtyhMp/Uyh1+t2vNRSuissbo=; b=mI0g5chzVfftoUiuXdA3ebvcU8mvOAK+hIR3FwniGxnNdTmOW4GT+VrR9jAcQGkRAC ImieZbKB2GvlLlHonaBasunMz9wmtxYJIveYwZwXGuHyey7Bv9PyaQ0qxlg1OtkAqQBk p7yYWM19VUZPvjsk90C237kwIgX5OLXi0osU9D3kp/zRC9k1CQ2voSTEdUNtU2s4gWcO gDutwwNzy8Ild5OQxJv4gDTvPAU3m42gk4hzW5AHgq9v73Kwe091IDTxWs1jzq9RV8gL JPzYd/EAU2n0C1EN5JrmLIKZXQ/EMx3qO2Hfz07g6ypKcXuPoMSdtJrjD80gwi1ZvSHv erBA== X-Gm-Message-State: AJcUuke3AXZYNAl7cS6uWh98Gn7Kceocp0cNKxLsfKa0a7Wgu28px9c4 sDPPBkg1UYE870MygCkj/CUwImVJ X-Received: by 2002:a63:6cc8:: with SMTP id h191mr3356149pgc.366.1546994858404; Tue, 08 Jan 2019 16:47:38 -0800 (PST) Received: from localhost.localdomain ([203.205.141.123]) by smtp.googlemail.com with ESMTPSA id w88sm122704725pfk.11.2019.01.08.16.47.36 (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Tue, 08 Jan 2019 16:47:37 -0800 (PST) From: Wanpeng Li X-Google-Original-From: Wanpeng Li To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: Paolo Bonzini , =?UTF-8?q?Radim=20Kr=C4=8Dm=C3=A1=C5=99?= Subject: [PATCH RESEND] KVM: MMU: Introduce single thread to zap collapsible sptes Date: Wed, 9 Jan 2019 08:47:33 +0800 Message-Id: <1546994853-22302-1-git-send-email-wanpengli@tencent.com> X-Mailer: git-send-email 2.7.4 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Wanpeng Li Last year guys from huawei reported that the call of memory_global_dirty_log_start/stop() takes 13s for 4T memory and cause guest freeze too long which increases the unacceptable migration downtime. [1] [2] Guangrong pointed out: | collapsible_sptes zaps 4k mappings to make memory-read happy, it is not | required by the semanteme of KVM_SET_USER_MEMORY_REGION and it is not | urgent for vCPU's running, it could be done in a separate thread and use | lock-break technology. Several TB memory guest is common now after NVDIMM is deployed in cloud environment. This patch utilizes worker thread to zap collapsible sptes in order to lazy collapse small sptes into large sptes during roll-back after live migration fails. [1] https://lists.gnu.org/archive/html/qemu-devel/2017-04/msg05249.html [2] https://www.mail-archive.com/qemu-devel@nongnu.org/msg449994.html Cc: Paolo Bonzini Cc: Radim Krčmář Signed-off-by: Wanpeng Li --- Note: I ever consider to add a list of memslots to be zapped, delete from the list and add in kvm_mmu_zap_collapsible_sptes(). However, i observe a lot of races here, the memslot can disappear/modify underneath before the worker thread start to zap even if i introduce lock to protect the list. This patch delays the worker thread by 60s(to assume memory_global_dirty_log_stop can absolutely complete) to coalesce all the zap requirements after live migration fails. arch/x86/include/asm/kvm_host.h | 3 +++ arch/x86/kvm/mmu.c | 37 ++++++++++++++++++++++++++++++++----- arch/x86/kvm/x86.c | 4 ++++ 3 files changed, 39 insertions(+), 5 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index fbda5a9..dde32f9 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -892,6 +892,8 @@ struct kvm_arch { u64 master_cycle_now; struct delayed_work kvmclock_update_work; struct delayed_work kvmclock_sync_work; + struct delayed_work kvm_mmu_zap_collapsible_sptes_work; + bool zap_in_progress; struct kvm_xen_hvm_config xen_hvm_config; @@ -1247,6 +1249,7 @@ void kvm_mmu_zap_all(struct kvm *kvm); void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, struct kvm_memslots *slots); unsigned int kvm_mmu_calculate_mmu_pages(struct kvm *kvm); void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned int kvm_nr_mmu_pages); +void zap_collapsible_sptes_fn(struct work_struct *work); int load_pdptrs(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, unsigned long cr3); bool pdptrs_changed(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 7c03c0f..fe87dd3 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -5679,14 +5679,41 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm, return need_tlb_flush; } +void zap_collapsible_sptes_fn(struct work_struct *work) +{ + struct kvm_memory_slot *memslot; + struct kvm_memslots *slots; + struct delayed_work *dwork = to_delayed_work(work); + struct kvm_arch *ka = container_of(dwork, struct kvm_arch, + kvm_mmu_zap_collapsible_sptes_work); + struct kvm *kvm = container_of(ka, struct kvm, arch); + int i; + + mutex_lock(&kvm->slots_lock); + for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) { + spin_lock(&kvm->mmu_lock); + slots = __kvm_memslots(kvm, i); + kvm_for_each_memslot(memslot, slots) { + slot_handle_leaf(kvm, (struct kvm_memory_slot *)memslot, + kvm_mmu_zap_collapsible_spte, true); + if (need_resched() || spin_needbreak(&kvm->mmu_lock)) + cond_resched_lock(&kvm->mmu_lock); + } + spin_unlock(&kvm->mmu_lock); + } + kvm->arch.zap_in_progress = false; + mutex_unlock(&kvm->slots_lock); +} + +#define KVM_MMU_ZAP_DELAYED (60 * HZ) void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm, const struct kvm_memory_slot *memslot) { - /* FIXME: const-ify all uses of struct kvm_memory_slot. */ - spin_lock(&kvm->mmu_lock); - slot_handle_leaf(kvm, (struct kvm_memory_slot *)memslot, - kvm_mmu_zap_collapsible_spte, true); - spin_unlock(&kvm->mmu_lock); + if (!kvm->arch.zap_in_progress) { + kvm->arch.zap_in_progress = true; + schedule_delayed_work(&kvm->arch.kvm_mmu_zap_collapsible_sptes_work, + KVM_MMU_ZAP_DELAYED); + } } void kvm_mmu_slot_leaf_clear_dirty(struct kvm *kvm, diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index d029377..c2af289 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -9019,6 +9019,9 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type) INIT_DELAYED_WORK(&kvm->arch.kvmclock_update_work, kvmclock_update_fn); INIT_DELAYED_WORK(&kvm->arch.kvmclock_sync_work, kvmclock_sync_fn); + INIT_DELAYED_WORK(&kvm->arch.kvm_mmu_zap_collapsible_sptes_work, + zap_collapsible_sptes_fn); + kvm->arch.zap_in_progress = false; kvm_hv_init_vm(kvm); kvm_page_track_init(kvm); @@ -9064,6 +9067,7 @@ void kvm_arch_sync_events(struct kvm *kvm) { cancel_delayed_work_sync(&kvm->arch.kvmclock_sync_work); cancel_delayed_work_sync(&kvm->arch.kvmclock_update_work); + cancel_delayed_work_sync(&kvm->arch.kvm_mmu_zap_collapsible_sptes_work); kvm_free_pit(kvm); } -- 2.7.4