Received: by 2002:a05:6a10:9848:0:0:0:0 with SMTP id x8csp443546pxf; Thu, 8 Apr 2021 06:25:08 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzo/E0vAlYL7RR6Q2ACFUHipJpmICLM8S5ZAryAYztsy+9xz82uPONI7oA5UzfvPvGP/EO9 X-Received: by 2002:a05:6402:485:: with SMTP id k5mr3803061edv.211.1617888308253; Thu, 08 Apr 2021 06:25:08 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1617888308; cv=none; d=google.com; s=arc-20160816; b=QzDF0RmfIbrtLruH5mL4CVSCPJi+N6G9DLpEzsYJwDYc5DcpAOoYFZiU90r082L+EY xPhg/jIIMDsRgKbgx5xYsiSazqHUU+ZHw6iXyzhvc+/NO6bQ9/MJrv5q59jXwhzi7rRG o3JIH0HQyPK2e6i8KRY/JzefcDhcxgJxt+n9CK0uqoH7V6enddTl+4l8t9f/ubs/E9WH TT3V5L2Z2x0ptRAaxm04ZzaDUfBTY+aObwCOfmQ3GdhKb1ijh8PrecYaXqmN8WVncock pSptV2/6wZkvVTO021u9Hs2C341VSIzzCIIrsqSR7YMd55hP6JkDMVg6z5iLQFVxwcOc 14ig== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:ironport-sdr:ironport-sdr; bh=YO3J5/qpJBgC3i5C7um0ma5lkxt5p83N80Ftl3LA/9E=; b=FY/fVimPqX+MUh1AjEx0F6CADiT/D1BZRIMYkxjjriI3TUipqnRyexl0kURb7ItleV 3PiCo7AgOiWBrjtZ2Oa48atkRChvl8fhF8m1DXyd3zjNplvKoHrPh7+rvPXM5vVN/PwF FGFs8B9xjfiiKqEIokxP5JA9Zu0Bx/kazNiY+nXKJ0G0tLb0dUkQ7yBgapM77WeLez3o qJtmIau1poQmfP3LPKleQqoI2pR5hqJsjdh3t9Mlm/gvJ5h9RbSGm3WQndESIJ22NsfF aI3DsPlhF6SUPs2KU0jDeyOAFMe4ux/7FZaJmhO6FL4aQw4QOGL6luXZD5HdQkJ1R8b/ 5fIw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id qt10si5844375ejb.256.2021.04.08.06.24.44; Thu, 08 Apr 2021 06:25:08 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231557AbhDHNXf (ORCPT + 99 others); Thu, 8 Apr 2021 09:23:35 -0400 Received: from mga06.intel.com ([134.134.136.31]:15485 "EHLO mga06.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229741AbhDHNXf (ORCPT ); Thu, 8 Apr 2021 09:23:35 -0400 IronPort-SDR: e2c/91KE98qJrqlll4JuuW4HqUygmy3TRoEoD5bLRmtnOZjQU0nF3eeQtwZr16skcF2SnAhNuY ZzYfe4/ew+9w== X-IronPort-AV: E=McAfee;i="6000,8403,9948"; a="254875972" X-IronPort-AV: E=Sophos;i="5.82,206,1613462400"; d="scan'208";a="254875972" Received: from orsmga002.jf.intel.com ([10.7.209.21]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Apr 2021 06:23:23 -0700 IronPort-SDR: ycLUebdPG0tiR8hTtuNiYvuvb3wGKNHGSs+mKVHlfxBqiFVsMy30KBZiMmVaEG4I0nahdPYDLk mCN4UxWyhI9g== X-IronPort-AV: E=Sophos;i="5.82,206,1613462400"; d="scan'208";a="397083972" Received: from yhuang6-desk1.sh.intel.com ([10.239.13.1]) by orsmga002-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 08 Apr 2021 06:23:19 -0700 From: Huang Ying To: Andrew Morton Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Huang Ying , Mel Gorman , Peter Zijlstra , Peter Xu , Johannes Weiner , Vlastimil Babka , "Matthew Wilcox" , Will Deacon , Michel Lespinasse , Arjun Roy , "Kirill A. Shutemov" Subject: [PATCH -V3] NUMA balancing: reduce TLB flush via delaying mapping on hint page fault Date: Thu, 8 Apr 2021 21:22:36 +0800 Message-Id: <20210408132236.1175607-1-ying.huang@intel.com> X-Mailer: git-send-email 2.30.2 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org With NUMA balancing, in hint page fault handler, the faulting page will be migrated to the accessing node if necessary. During the migration, TLB will be shot down on all CPUs that the process has run on recently. Because in the hint page fault handler, the PTE will be made accessible before the migration is tried. The overhead of TLB shooting down can be high, so it's better to be avoided if possible. In fact, if we delay mapping the page until migration, that can be avoided. This is what this patch doing. For the multiple threads applications, it's possible that a page is accessed by multiple threads almost at the same time. In the original implementation, because the first thread will install the accessible PTE before migrating the page, the other threads may access the page directly before the page is made inaccessible again during migration. While with the patch, the second thread will go through the page fault handler too. And because of the PageLRU() checking in the following code path, migrate_misplaced_page() numamigrate_isolate_page() isolate_lru_page() the migrate_misplaced_page() will return 0, and the PTE will be made accessible in the second thread. This will introduce a little more overhead. But we think the possibility for a page to be accessed by the multiple threads at the same time is low, and the overhead difference isn't too large. If this becomes a problem in some workloads, we need to consider how to reduce the overhead. To test the patch, we run a test case as follows on a 2-socket Intel server (1 NUMA node per socket) with 128GB DRAM (64GB per socket). 1. Run a memory eater on NUMA node 1 to use 40GB memory before running pmbench. 2. Run pmbench (normal accessing pattern) with 8 processes, and 8 threads per process, so there are 64 threads in total. The working-set size of each process is 8960MB, so the total working-set size is 8 * 8960MB = 70GB. The CPU of all pmbench processes is bound to node 1. The pmbench processes will access some DRAM on node 0. 3. After the pmbench processes run for 10 seconds, kill the memory eater. Now, some pages will be migrated from node 0 to node 1 via NUMA balancing. Test results show that, with the patch, the pmbench throughput (page accesses/s) increases 5.5%. The number of the TLB shootdowns interrupts reduces 98% (from ~4.7e7 to ~9.7e5) with about 9.2e6 pages (35.8GB) migrated. From the perf profile, it can be found that the CPU cycles spent by try_to_unmap() and its callees reduces from 6.02% to 0.47%. That is, the CPU cycles spent by TLB shooting down decreases greatly. Signed-off-by: "Huang, Ying" Reviewed-by: Mel Gorman Cc: Peter Zijlstra Cc: Peter Xu Cc: Johannes Weiner Cc: Vlastimil Babka Cc: "Matthew Wilcox" Cc: Will Deacon Cc: Michel Lespinasse Cc: Arjun Roy Cc: "Kirill A. Shutemov" --- mm/memory.c | 54 +++++++++++++++++++++++++++++++---------------------- 1 file changed, 32 insertions(+), 22 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index cc71a445c76c..7e9d4e55089c 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4159,29 +4159,17 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) goto out; } - /* - * Make it present again, depending on how arch implements - * non-accessible ptes, some can allow access by kernel mode. - */ - old_pte = ptep_modify_prot_start(vma, vmf->address, vmf->pte); + /* Get the normal PTE */ + old_pte = ptep_get(vmf->pte); pte = pte_modify(old_pte, vma->vm_page_prot); - pte = pte_mkyoung(pte); - if (was_writable) - pte = pte_mkwrite(pte); - ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, pte); - update_mmu_cache(vma, vmf->address, vmf->pte); page = vm_normal_page(vma, vmf->address, pte); - if (!page) { - pte_unmap_unlock(vmf->pte, vmf->ptl); - return 0; - } + if (!page) + goto out_map; /* TODO: handle PTE-mapped THP */ - if (PageCompound(page)) { - pte_unmap_unlock(vmf->pte, vmf->ptl); - return 0; - } + if (PageCompound(page)) + goto out_map; /* * Avoid grouping on RO pages in general. RO pages shouldn't hurt as @@ -4191,7 +4179,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) * pte_dirty has unpredictable behaviour between PTE scan updates, * background writeback, dirty balancing and application behaviour. */ - if (!pte_write(pte)) + if (!was_writable) flags |= TNF_NO_GROUP; /* @@ -4205,23 +4193,45 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) page_nid = page_to_nid(page); target_nid = numa_migrate_prep(page, vma, vmf->address, page_nid, &flags); - pte_unmap_unlock(vmf->pte, vmf->ptl); if (target_nid == NUMA_NO_NODE) { put_page(page); - goto out; + goto out_map; } + pte_unmap_unlock(vmf->pte, vmf->ptl); /* Migrate to the requested node */ if (migrate_misplaced_page(page, vma, target_nid)) { page_nid = target_nid; flags |= TNF_MIGRATED; - } else + } else { flags |= TNF_MIGRATE_FAIL; + vmf->pte = pte_offset_map(vmf->pmd, vmf->address); + spin_lock(vmf->ptl); + if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) { + pte_unmap_unlock(vmf->pte, vmf->ptl); + goto out; + } + goto out_map; + } out: if (page_nid != NUMA_NO_NODE) task_numa_fault(last_cpupid, page_nid, 1, flags); return 0; +out_map: + /* + * Make it present again, depending on how arch implements + * non-accessible ptes, some can allow access by kernel mode. + */ + old_pte = ptep_modify_prot_start(vma, vmf->address, vmf->pte); + pte = pte_modify(old_pte, vma->vm_page_prot); + pte = pte_mkyoung(pte); + if (was_writable) + pte = pte_mkwrite(pte); + ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, pte); + update_mmu_cache(vma, vmf->address, vmf->pte); + pte_unmap_unlock(vmf->pte, vmf->ptl); + goto out; } static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf) -- 2.30.2