Received: by 2002:a25:1985:0:0:0:0:0 with SMTP id 127csp253028ybz; Fri, 17 Apr 2020 00:07:27 -0700 (PDT) X-Google-Smtp-Source: APiQypIbe29IxgYnfIyc8vUIF2ul33wiLyzuAemjokrXniINiSgOban5wy2qroJOtyh1+w8iHa96 X-Received: by 2002:a17:906:1c97:: with SMTP id g23mr1629492ejh.229.1587107246841; Fri, 17 Apr 2020 00:07:26 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1587107246; cv=none; d=google.com; s=arc-20160816; b=g8QNK7qDvXYwm9lH0O749iyESKdGmX+9nKJ8QZa8nwz/YlM1+ZHuXD+xrs9r7nT4oc oP04HXRRXyBlIqueKaQy0Q838SOA3RMWi0Ifjx66qTNIRMGyVF8d/EkQLho5LQ3MV1mv Vx7q+52rCM9vFX7XouJDtHkfNLEfejlGdXLtmgTEjW9PfRTSQ42yWIF2YCGRfaeihwgk 8hwCgybtOAxEoAxwQwgWOdvsT6LWdafEgElmCpVgBZYEQP0UFBN6h4LO2eEZ0SZInH9v ZNZCgA7ZhGvZ+IM9BEcwIDFJGc7GrwDL3qwLL73PLFr3b1dmDbrDFRQJTAXs7QpPGoDH IJ4w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:in-reply-to:message-id:date :subject:cc:to:from:ironport-sdr:dkim-signature; bh=XOqBKAS6mZdabqMz+4mSgKnmFRPA/0mQeLYHxuXQFyM=; b=Kdl4hkUCXkOl6WJcZDMPK6RNthNEySxU8CKnVN50TQeO8nWfykFj/mwgdkBVtCJuSO 4lS8LIyibY43il4NPFchsdeeLx0vvkETZOfMRGlVTbDhG1NoCs7G2jsU3Wgr+/yT2IVn l1cTjbUwL1i4RGreinjxY5FO4HcrldtATyIC49mOigJJaDuGlpZ2V6cIcWi7codycNbF /K8csRemZw25Nb5Lf1jLplAcVbweaWwzxrWhoJSSrmlVmO6nH3kpmOYfPSo95EvSMi6w 0HPTdWvaIXIl4eBF7KVdm8Vkipd831pj52lc+Vy6pgHRNx/fR/0GrHNsH2Mix8CtPRa5 obLg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@amazon.com header.s=amazon201209 header.b=c3aVoAsw; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=amazon.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id b4si8057328ejj.77.2020.04.17.00.07.03; Fri, 17 Apr 2020 00:07:26 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@amazon.com header.s=amazon201209 header.b=c3aVoAsw; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=amazon.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728575AbgDQHGA (ORCPT + 99 others); Fri, 17 Apr 2020 03:06:00 -0400 Received: from smtp-fw-4101.amazon.com ([72.21.198.25]:1830 "EHLO smtp-fw-4101.amazon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728338AbgDQHF7 (ORCPT ); Fri, 17 Apr 2020 03:05:59 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1587107159; x=1618643159; h=from:to:cc:subject:date:message-id:in-reply-to: mime-version; bh=XOqBKAS6mZdabqMz+4mSgKnmFRPA/0mQeLYHxuXQFyM=; b=c3aVoAswThsxNL12muIji5Zap0A3sHXX4qzo03WX3hte4K4DCMEisJBL W8AqRytyX8AofuJgAmd5XOsCcka9DqNML/vNGDineVpoU/pVGPzzJbTCR sAk2ExbDpUwRN5KfiElnWLB1x/NJghtDkmP8/mdv52oTxduHEhv2uUmRO c=; IronPort-SDR: 2e3i0skHvYdIwTWJpLGKbaJjT2XI2on6Twzq3YSZF1yb62BIrI+tS31A0OAOzSH6ZcJjYkZaev WD1PR4VMlEog== X-IronPort-AV: E=Sophos;i="5.72,394,1580774400"; d="scan'208";a="26002062" Received: from iad12-co-svc-p1-lb1-vlan3.amazon.com (HELO email-inbound-relay-2a-1c1b5cdd.us-west-2.amazon.com) ([10.43.8.6]) by smtp-border-fw-out-4101.iad4.amazon.com with ESMTP; 17 Apr 2020 07:05:39 +0000 Received: from EX13MTAUEA002.ant.amazon.com (pdx4-ws-svc-p6-lb7-vlan2.pdx.amazon.com [10.170.41.162]) by email-inbound-relay-2a-1c1b5cdd.us-west-2.amazon.com (Postfix) with ESMTPS id E8F2AA2212; Fri, 17 Apr 2020 07:05:36 +0000 (UTC) Received: from EX13D31EUA001.ant.amazon.com (10.43.165.15) by EX13MTAUEA002.ant.amazon.com (10.43.61.77) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Fri, 17 Apr 2020 07:05:36 +0000 Received: from u886c93fd17d25d.ant.amazon.com (10.43.162.239) by EX13D31EUA001.ant.amazon.com (10.43.165.15) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Fri, 17 Apr 2020 07:05:30 +0000 From: SeongJae Park To: "Huang, Ying" CC: Mel Gorman , Peter Zijlstra , , , Ingo Molnar , Mel Gorman , Rik van Riel , Daniel Jordan , Tejun Heo , Dave Hansen , Tim Chen , Aubrey Li Subject: Re: Re: [RFC] autonuma: Support to scan page table asynchronously Date: Fri, 17 Apr 2020 09:05:08 +0200 Message-ID: <20200417070508.32243-1-sjpark@amazon.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <87eespyxld.fsf@yhuang-dev.intel.com> (raw) MIME-Version: 1.0 Content-Type: text/plain X-Originating-IP: [10.43.162.239] X-ClientProxiedBy: EX13D40UWC003.ant.amazon.com (10.43.162.246) To EX13D31EUA001.ant.amazon.com (10.43.165.15) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 15 Apr 2020 16:14:38 +0800 "Huang\, Ying" wrote: > Mel Gorman writes: > > > On Tue, Apr 14, 2020 at 04:19:51PM +0800, Huang Ying wrote: > >> In current AutoNUMA implementation, the page tables of the processes > >> are scanned periodically to trigger the NUMA hint page faults. The > >> scanning runs in the context of the processes, so will delay the > >> running of the processes. In a test with 64 threads pmbench memory > >> accessing benchmark on a 2-socket server machine with 104 logical CPUs > >> and 256 GB memory, there are more than 20000 latency outliers that are > >> > 1 ms in 3600s run time. These latency outliers are almost all > >> caused by the AutoNUMA page table scanning. Because they almost all > >> disappear after applying this patch to scan the page tables > >> asynchronously. > >> > >> Because there are idle CPUs in system, the asynchronous running page > >> table scanning code can run on these idle CPUs instead of the CPUs the > >> workload is running on. > >> > >> So on system with enough idle CPU time, it's better to scan the page > >> tables asynchronously to take full advantages of these idle CPU time. > >> Another scenario which can benefit from this is to scan the page > >> tables on some service CPUs of the socket, so that the real workload > >> can run on the isolated CPUs without the latency outliers caused by > >> the page table scanning. > >> > >> But it's not perfect to scan page tables asynchronously too. For > >> example, on system without enough idle CPU time, the CPU time isn't > >> scheduled fairly because the page table scanning is charged to the > >> workqueue thread instead of the process/thread it works for. And > >> although the page tables are scanned for the target process, it may > >> run on a CPU that is not in the cpuset of the target process. > >> > >> One possible solution is to let the system administrator to choose the > >> better behavior for the system via a sysctl knob (implemented in the > >> patch). But it's not perfect too. Because every user space knob adds > >> maintenance overhead. > >> > >> A better solution may be to back-charge the CPU time to scan the page > >> tables to the process/thread, and find a way to run the work on the > >> proper cpuset. After some googling, I found there's some discussion > >> about this as in the following thread, > >> > >> https://lkml.org/lkml/2019/6/13/1321 > >> > >> So this patch may be not ready to be merged by upstream yet. It > >> quantizes the latency outliers caused by the page table scanning in > >> AutoNUMA. And it provides a possible way to resolve the issue for > >> users who cares about it. And it is a potential customer of the work > >> related to the cgroup-aware workqueue or other asynchronous execution > >> mechanisms. > >> > > > > The caveats you list are the important ones and the reason why it was > > not done asynchronously. In an earlier implementation all the work was > > done by a dedicated thread and ultimately abandoned. > > > > There is no guarantee there is an idle CPU available and one that is > > local to the thread that should be doing the scanning. Even if there is, > > it potentially prevents another task from scheduling on an idle CPU and > > similarly other workqueue tasks may be delayed waiting on the scanner. The > > hiding of the cost is also problematic because the CPU cost is hidden > > and mixed with other unrelated workqueues. It also has the potential > > to mask bugs. Lets say for example there is a bug whereby a task is > > scanning excessively, that can be easily missed when the work is done by > > a workqueue. > > Do you think something like cgroup-aware workqueue is a solution deserve > to be tried when it's available? It will not hide the scanning cost, > because the CPU time will be charged to the original cgroup or task. > Although the other tasks may be disturbed, cgroup can provide some kind > of management via cpusets. > > > While it's just an opinion, my preference would be to focus on reducing > > the cost and amount of scanning done -- particularly for threads. For > > example, all threads operate on the same address space but there can be > > significant overlap where all threads are potentially scanning the same > > areas or regions that the thread has no interest in. One option would be > > to track the highest and lowest pages accessed and only scan within > > those regions for example. The tricky part is that library pages may > > create very wide windows that render the tracking useless but it could > > at least be investigated. > > In general, I think it's good to reduce the scanning cost. I think the main idea of DAMON[1] might be able to applied here. Have you considered it? [1] https://lore.kernel.org/linux-mm/20200406130938.14066-1-sjpark@amazon.com/ Thanks, SeongJae Park > > Why do you think there will be overlap between the threads of a process? > If my understanding were correctly, the threads will scan one by one > instead of simultaneously. And how to determine whether a vma need to > be scanned or not? For example, there may be only a small portion of > pages been accessed in a vma, but they may be accessed remotely and > consumes quite some inter-node bandwidth, so need to be migrated. > > Best Regards, > Huang, Ying >