Received: by 10.223.176.46 with SMTP id f43csp48206wra; Tue, 23 Jan 2018 15:55:10 -0800 (PST) X-Google-Smtp-Source: AH8x2267cvFqDKdqpT9HvQ5H+c2afvn7i9UpxRVWJdx7csap/gwhlKADnIwnSrMUP7uofY3O1cWG X-Received: by 2002:a17:902:7182:: with SMTP id b2-v6mr6936021pll.38.1516751710236; Tue, 23 Jan 2018 15:55:10 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1516751710; cv=none; d=google.com; s=arc-20160816; b=RGgq7OOMOL1po0RNcvor/ZVQQhprajMMKVvrSj0bx9lSeUs38R0xpFtLLwlEgH7+LN e1lf52Q1jpn9rjtkPzc8MTQ2Q0PhjKihUHb2nwFtXtmPrL/fKYc10mmEfFlEzcgzsaS1 JyqLuPGF0iBJeisrHYPquk2cvD/KHIdmPNGGHLSok2inN2TcbkwnRMlQj1hljrn9Awrf vo4qbJ9r4Hs1efMeFaVCsarT+ZSTv6tl9QT3UrKlREDQ/TQY1rxE5j6kRCU+LNl98eHw /EZvXS5/fLyRglsbUfZVNimulxpK4NmneL5H/NAJ3w5KgcTpYsgeAqGFmENkfbCYQAp7 8eUg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature :arc-authentication-results; bh=zALOIuxVj77tGxE/WR0FpuP01ElcUGSdjjbLLeKmdhc=; b=PvHFqnLKFJDk19S78qP+VJKrQ/nxOyeXhRUnY3YJZb+295/IzgA1pLKBuwh5rq9tig Zv14/e6YIEV/kjv9NLV08o//DSs+vaWDWF0oen1qZ1Gu3E9GGwEcwWSzC1U0uI7IsEwK ujOrNFCIH/PJN1KvTW8UijUZ0P363fbldq5qLpbwaAfQg8emDKTsHZA7Z+fHiLHySooB a1UczcYx7sG/4gj0EuEeLEYZuMhzK4mCxHfC+TD7l21uvgmBj4xvjd/HYs5xKM1UISqu zUqXNAj/2o81t0WQaRwfAbO5UeYtcKhMJqH96dVjPqgFSXoQi18ZTfNqxJPCutdg9u7Q cXeA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2017-10-26 header.b=ITERGwYe; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id c7-v6si5145262plo.545.2018.01.23.15.54.54; Tue, 23 Jan 2018 15:55:10 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2017-10-26 header.b=ITERGwYe; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752545AbeAWXya (ORCPT + 99 others); Tue, 23 Jan 2018 18:54:30 -0500 Received: from aserp2130.oracle.com ([141.146.126.79]:38714 "EHLO aserp2130.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751993AbeAWXy3 (ORCPT ); Tue, 23 Jan 2018 18:54:29 -0500 Received: from pps.filterd (aserp2130.oracle.com [127.0.0.1]) by aserp2130.oracle.com (8.16.0.22/8.16.0.22) with SMTP id w0NNqIfG144454; Tue, 23 Jan 2018 23:54:16 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : cc : subject : date : message-id : mime-version : content-type : content-transfer-encoding; s=corp-2017-10-26; bh=zALOIuxVj77tGxE/WR0FpuP01ElcUGSdjjbLLeKmdhc=; b=ITERGwYeuO25um+PyKkWB00k47keVuZfCD5gI/cxp+19Y0SlZRqwMK/serAihHgYECSr yo/8LRpcDiNwNkC9W/GPsuwP7mFsYgthkTW4OnG/eFzQRRjpH6wpfGWF2pUMgr8jQ5K9 7uqRJg1Wx2xuRGj4lVZ8NzDmaLyTjXGC7qvnskMEWieTSREzrugJYFv7HXsYoha0TRiS cnW9ISr0LQ+OLt7kGVwe5BGfF++g1VjOt9qCcZfgeIM5vUGkmSC3II3nyNu3ckwdzjNe ufVrgrk3jQFHDJYrJAWorEpueHqLPmYTE+/D05+92nhaGDSAsbw7ThQiqPjgcGRp+Qqq 5Q== Received: from userv0021.oracle.com (userv0021.oracle.com [156.151.31.71]) by aserp2130.oracle.com with ESMTP id 2fpf02g18j-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 23 Jan 2018 23:54:16 +0000 Received: from aserv0121.oracle.com (aserv0121.oracle.com [141.146.126.235]) by userv0021.oracle.com (8.14.4/8.14.4) with ESMTP id w0NNsEMc004771 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL); Tue, 23 Jan 2018 23:54:15 GMT Received: from abhmp0006.oracle.com (abhmp0006.oracle.com [141.146.116.12]) by aserv0121.oracle.com (8.14.4/8.13.8) with ESMTP id w0NNsDrt014290; Tue, 23 Jan 2018 23:54:14 GMT Received: from hwillard-linux.us.oracle.com (/10.211.52.73) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Tue, 23 Jan 2018 15:54:13 -0800 From: Henry Willard To: akpm@linux-foundation.org Cc: mgorman@suse.de, kstewart@linuxfoundation.org, zi.yan@cs.rutgers.edu, pombredanne@nexb.com, aarcange@redhat.com, gregkh@linuxfoundation.org, aneesh.kumar@linux.vnet.ibm.com, kirill.shutemov@linux.intel.com, jglisse@redhat.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [PATCH v2] mm: numa: Do not trap faults on shared data section pages. Date: Tue, 23 Jan 2018 15:53:37 -0800 Message-Id: <1516751617-7369-1-git-send-email-henry.willard@oracle.com> X-Mailer: git-send-email 1.8.3.1 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=8783 signatures=668655 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=1 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=580 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1711220000 definitions=main-1801230305 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Workloads consisting of a large number of processes running the same program with a very large shared data segment may experience performance problems when numa balancing attempts to migrate the shared cow pages. This manifests itself with many processes or tasks in TASK_UNINTERRUPTIBLE state waiting for the shared pages to be migrated. The program listed below simulates the conditions with these results when run with 288 processes on a 144 core/8 socket machine. Average throughput Average throughput Average throughput with numa_balancing=0 with numa_balancing=1 with numa_balancing=1 without the patch with the patch --------------------- --------------------- --------------------- 2118782 2021534 2107979 Complex production environments show less variability and fewer poorly performing outliers accompanied with a smaller number of processes waiting on NUMA page migration with this patch applied. In some cases, %iowait drops from 16%-26% to 0. // SPDX-License-Identifier: GPL-2.0 /* * Copyright (c) 2017 Oracle and/or its affiliates. All rights reserved. */ #include #include #include #include int a[1000000] = {13}; int main(int argc, const char **argv) { int n = 0; int i; pid_t pid; int stat; int *count_array; int cpu_count = 288; long total = 0; struct timeval t1, t2 = {(argc > 1 ? atoi(argv[1]) : 10), 0}; if (argc > 2) cpu_count = atoi(argv[2]); count_array = mmap(NULL, cpu_count * sizeof(int), (PROT_READ|PROT_WRITE), (MAP_SHARED|MAP_ANONYMOUS), 0, 0); if (count_array == MAP_FAILED) { perror("mmap:"); return 0; } for (i = 0; i < cpu_count; ++i) { pid = fork(); if (pid <= 0) break; if ((i & 0xf) == 0) usleep(2); } if (pid != 0) { if (i == 0) { perror("fork:"); return 0; } for (;;) { pid = wait(&stat); if (pid < 0) break; } for (i = 0; i < cpu_count; ++i) total += count_array[i]; printf("Total %ld\n", total); munmap(count_array, cpu_count * sizeof(int)); return 0; } gettimeofday(&t1, 0); timeradd(&t1, &t2, &t1); while (timercmp(&t2, &t1, <)) { int b = 0; int j; for (j = 0; j < 1000000; j++) b += a[j]; gettimeofday(&t2, 0); n++; } count_array[i] = n; return 0; } This patch changes change_pte_range() to skip shared copy-on-write pages when called from change_prot_numa(). NOTE: change_prot_numa() is nominally called from task_numa_work() and queue_pages_test_walk(). task_numa_work() is the auto NUMA balancing path, and queue_pages_test_walk() is part of explicit NUMA policy management. However, queue_pages_test_walk() only calls change_prot_numa() when MPOL_MF_LAZY is specified and currently that is not allowed, so change_prot_numa() is only called from auto NUMA balancing. In the case of explicit NUMA policy management, shared pages are not migrated unless MPOL_MF_MOVE_ALL is specified, and MPOL_MF_MOVE_ALL depends on CAP_SYS_NICE. Currently, there is no way to pass information about MPOL_MF_MOVE_ALL to change_pte_range. This will have to be fixed if MPOL_MF_LAZY is enabled and MPOL_MF_MOVE_ALL is to be honored in lazy migration mode. task_numa_work() skips the read-only VMAs of programs and shared libraries. V2: - Combined patch and cover letter - Added note about applicability of MPOL_MF_MOVE_ALL Signed-off-by: Henry Willard Reviewed-by: HÃ¥kon Bugge Reviewed-by: Steve Sistare steven.sistare@oracle.com Acked-by: Mel Gorman --- mm/mprotect.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/mm/mprotect.c b/mm/mprotect.c index ec39f730a0bf..fbbb3ab70818 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -84,6 +84,11 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, if (!page || PageKsm(page)) continue; + /* Also skip shared copy-on-write pages */ + if (is_cow_mapping(vma->vm_flags) && + page_mapcount(page) != 1) + continue; + /* Avoid TLB flush if possible */ if (pte_protnone(oldpte)) continue; -- 1.8.3.1