Received: by 2002:a25:1985:0:0:0:0:0 with SMTP id 127csp580841ybz; Wed, 15 Apr 2020 14:29:06 -0700 (PDT) X-Google-Smtp-Source: APiQypL/eje/94sXaI43RBJ2Ongxv9zWW7jiDQmVQgp0VSnp9cs0dQtnNN7uwNiJSBOe+wlgg0Gy X-Received: by 2002:a17:906:5fd2:: with SMTP id k18mr6705020ejv.243.1586986145964; Wed, 15 Apr 2020 14:29:05 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1586986145; cv=none; d=google.com; s=arc-20160816; b=q3Jx/whd9nJNAHNpP1N8fVHoyWmpxCWSYetE6+OQgRjUzpOMLK8tEb2eYnz40xdBdU NsAzetGfnKSTiA6kUmkY3HEj3s/AhUhlPHFYAJdGc/OwE07RiLaMVL+ntEv/nrqfLtAj 5OtdsZzDR71mwXZuAsrOqeB675WAv79WmW57PaYF8l1fQ38yqJxoNLbiAR98+XCrlOnr p40LDanLIQMvS4C/I2VlJ+EekLpTzJkmofJdPjsXzcB2CKks2jiQi9IzpBjRpk0PEZ24 ojL/3mA5fv2xQ9Qki/HDdOVbVucb95Tk2euW9wF/xddsRTE30F1lAOhnKjQSRUi3dIl9 /LDg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=QFUB6fD+SFxvwxlA18NGAS4DlKb5JEIi03GQeqe7nY8=; b=DHT5VYCPtJMbskG5aScuIGouxepLYI/sXT36lY6nfGz+BEC9ga4bNXeL8BZwLyI90I ifAl+mcCwfTfKa7eSxM6D6+gbvXdad/KiZna5TNtgkQjOAy8r13Wb/nkkXkOHuAqMQXo 3kPP+jRLrOR75K4NiG94DAKVQqxdz3t0gsPxWlQ0nEgaL1coahPsgPKNvIooxO057wft US6nazpl5UG52HpM78M9nY6zMGF9JW6Ykgy9Z1+14hNTRg3BIHm2hr8Ik33hZel4s4Dk 86DEwdApCpbDKEDgwAzVlc7E6zKq3ok1fnvhBtgVzD5t12yH8aHWNWJ6PKT5inwmUG4G i2VQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id jr4si10737234ejb.114.2020.04.15.14.28.42; Wed, 15 Apr 2020 14:29:05 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2439861AbgDNMGy (ORCPT + 99 others); Tue, 14 Apr 2020 08:06:54 -0400 Received: from outbound-smtp50.blacknight.com ([46.22.136.234]:40927 "EHLO outbound-smtp50.blacknight.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2439855AbgDNMGv (ORCPT ); Tue, 14 Apr 2020 08:06:51 -0400 Received: from mail.blacknight.com (pemlinmail04.blacknight.ie [81.17.254.17]) by outbound-smtp50.blacknight.com (Postfix) with ESMTPS id 2B940FAAFA for ; Tue, 14 Apr 2020 13:06:48 +0100 (IST) Received: (qmail 16607 invoked from network); 14 Apr 2020 12:06:48 -0000 Received: from unknown (HELO techsingularity.net) (mgorman@techsingularity.net@[84.203.18.57]) by 81.17.254.9 with ESMTPSA (AES256-SHA encrypted, authenticated); 14 Apr 2020 12:06:47 -0000 Date: Tue, 14 Apr 2020 13:06:46 +0100 From: Mel Gorman To: Huang Ying Cc: Peter Zijlstra , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Ingo Molnar , Mel Gorman , Rik van Riel , Daniel Jordan , Tejun Heo , Dave Hansen , Tim Chen , Aubrey Li Subject: Re: [RFC] autonuma: Support to scan page table asynchronously Message-ID: <20200414120646.GN3818@techsingularity.net> References: <20200414081951.297676-1-ying.huang@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20200414081951.297676-1-ying.huang@intel.com> User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Apr 14, 2020 at 04:19:51PM +0800, Huang Ying wrote: > In current AutoNUMA implementation, the page tables of the processes > are scanned periodically to trigger the NUMA hint page faults. The > scanning runs in the context of the processes, so will delay the > running of the processes. In a test with 64 threads pmbench memory > accessing benchmark on a 2-socket server machine with 104 logical CPUs > and 256 GB memory, there are more than 20000 latency outliers that are > > 1 ms in 3600s run time. These latency outliers are almost all > caused by the AutoNUMA page table scanning. Because they almost all > disappear after applying this patch to scan the page tables > asynchronously. > > Because there are idle CPUs in system, the asynchronous running page > table scanning code can run on these idle CPUs instead of the CPUs the > workload is running on. > > So on system with enough idle CPU time, it's better to scan the page > tables asynchronously to take full advantages of these idle CPU time. > Another scenario which can benefit from this is to scan the page > tables on some service CPUs of the socket, so that the real workload > can run on the isolated CPUs without the latency outliers caused by > the page table scanning. > > But it's not perfect to scan page tables asynchronously too. For > example, on system without enough idle CPU time, the CPU time isn't > scheduled fairly because the page table scanning is charged to the > workqueue thread instead of the process/thread it works for. And > although the page tables are scanned for the target process, it may > run on a CPU that is not in the cpuset of the target process. > > One possible solution is to let the system administrator to choose the > better behavior for the system via a sysctl knob (implemented in the > patch). But it's not perfect too. Because every user space knob adds > maintenance overhead. > > A better solution may be to back-charge the CPU time to scan the page > tables to the process/thread, and find a way to run the work on the > proper cpuset. After some googling, I found there's some discussion > about this as in the following thread, > > https://lkml.org/lkml/2019/6/13/1321 > > So this patch may be not ready to be merged by upstream yet. It > quantizes the latency outliers caused by the page table scanning in > AutoNUMA. And it provides a possible way to resolve the issue for > users who cares about it. And it is a potential customer of the work > related to the cgroup-aware workqueue or other asynchronous execution > mechanisms. > The caveats you list are the important ones and the reason why it was not done asynchronously. In an earlier implementation all the work was done by a dedicated thread and ultimately abandoned. There is no guarantee there is an idle CPU available and one that is local to the thread that should be doing the scanning. Even if there is, it potentially prevents another task from scheduling on an idle CPU and similarly other workqueue tasks may be delayed waiting on the scanner. The hiding of the cost is also problematic because the CPU cost is hidden and mixed with other unrelated workqueues. It also has the potential to mask bugs. Lets say for example there is a bug whereby a task is scanning excessively, that can be easily missed when the work is done by a workqueue. While it's just an opinion, my preference would be to focus on reducing the cost and amount of scanning done -- particularly for threads. For example, all threads operate on the same address space but there can be significant overlap where all threads are potentially scanning the same areas or regions that the thread has no interest in. One option would be to track the highest and lowest pages accessed and only scan within those regions for example. The tricky part is that library pages may create very wide windows that render the tracking useless but it could at least be investigated. -- Mel Gorman SUSE Labs