Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18;
IronPort-SDR: qINdq/1g+u4vad8mEQSvJ7SvJYWFoWFM3Qyp8n05sbmCMEGrjl248RS2UhZ58Hf3AXayVPvYfR
 ODrsPe3zxmMw==
IronPort-SDR: vQtDQBeAxJ8iTSsboqe294WKrUKBTv38qRDUS8kY74WNtIPJna71MNpoXdK4HxmF6gXiuz40rn
 pAVwgS8aO3kA==
From:   "Huang\, Ying" <ying.huang@intel.com>
To:     Mel Gorman <mgorman@techsingularity.net>
Cc:     Peter Zijlstra <peterz@infradead.org>, <linux-mm@kvack.org>,
        <linux-kernel@vger.kernel.org>, Ingo Molnar <mingo@redhat.com>,
        Mel Gorman <mgorman@suse.de>, Rik van Riel <riel@surriel.com>,
        Daniel Jordan <daniel.m.jordan@oracle.com>,
        Tejun Heo <tj@kernel.org>, Dave Hansen <dave.hansen@intel.com>,
        Tim Chen <tim.c.chen@intel.com>,
        Aubrey Li <aubrey.li@intel.com>
Subject: Re: [RFC] autonuma: Support to scan page table asynchronously
References: <20200414081951.297676-1-ying.huang@intel.com>
        <20200414120646.GN3818@techsingularity.net>
Date:   Wed, 15 Apr 2020 16:14:38 +0800
In-Reply-To: <20200414120646.GN3818@techsingularity.net> (Mel Gorman's message
        of "Tue, 14 Apr 2020 13:06:46 +0100")
Message-ID: <87eespyxld.fsf@yhuang-dev.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.1 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=ascii
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk

Mel Gorman <mgorman@techsingularity.net> writes:

> On Tue, Apr 14, 2020 at 04:19:51PM +0800, Huang Ying wrote:
>> In current AutoNUMA implementation, the page tables of the processes
>> are scanned periodically to trigger the NUMA hint page faults.  The
>> scanning runs in the context of the processes, so will delay the
>> running of the processes.  In a test with 64 threads pmbench memory
>> accessing benchmark on a 2-socket server machine with 104 logical CPUs
>> and 256 GB memory, there are more than 20000 latency outliers that are
>> > 1 ms in 3600s run time.  These latency outliers are almost all
>> caused by the AutoNUMA page table scanning.  Because they almost all
>> disappear after applying this patch to scan the page tables
>> asynchronously.
>> 
>> Because there are idle CPUs in system, the asynchronous running page
>> table scanning code can run on these idle CPUs instead of the CPUs the
>> workload is running on.
>> 
>> So on system with enough idle CPU time, it's better to scan the page
>> tables asynchronously to take full advantages of these idle CPU time.
>> Another scenario which can benefit from this is to scan the page
>> tables on some service CPUs of the socket, so that the real workload
>> can run on the isolated CPUs without the latency outliers caused by
>> the page table scanning.
>> 
>> But it's not perfect to scan page tables asynchronously too.  For
>> example, on system without enough idle CPU time, the CPU time isn't
>> scheduled fairly because the page table scanning is charged to the
>> workqueue thread instead of the process/thread it works for.  And
>> although the page tables are scanned for the target process, it may
>> run on a CPU that is not in the cpuset of the target process.
>> 
>> One possible solution is to let the system administrator to choose the
>> better behavior for the system via a sysctl knob (implemented in the
>> patch).  But it's not perfect too.  Because every user space knob adds
>> maintenance overhead.
>> 
>> A better solution may be to back-charge the CPU time to scan the page
>> tables to the process/thread, and find a way to run the work on the
>> proper cpuset.  After some googling, I found there's some discussion
>> about this as in the following thread,
>> 
>> https://lkml.org/lkml/2019/6/13/1321
>> 
>> So this patch may be not ready to be merged by upstream yet.  It
>> quantizes the latency outliers caused by the page table scanning in
>> AutoNUMA.  And it provides a possible way to resolve the issue for
>> users who cares about it.  And it is a potential customer of the work
>> related to the cgroup-aware workqueue or other asynchronous execution
>> mechanisms.
>> 
>
> The caveats you list are the important ones and the reason why it was
> not done asynchronously. In an earlier implementation all the work was
> done by a dedicated thread and ultimately abandoned.
>
> There is no guarantee there is an idle CPU available and one that is
> local to the thread that should be doing the scanning. Even if there is,
> it potentially prevents another task from scheduling on an idle CPU and
> similarly other workqueue tasks may be delayed waiting on the scanner. The
> hiding of the cost is also problematic because the CPU cost is hidden
> and mixed with other unrelated workqueues. It also has the potential
> to mask bugs. Lets say for example there is a bug whereby a task is
> scanning excessively, that can be easily missed when the work is done by
> a workqueue.

Do you think something like cgroup-aware workqueue is a solution deserve
to be tried when it's available?  It will not hide the scanning cost,
because the CPU time will be charged to the original cgroup or task.
Although the other tasks may be disturbed, cgroup can provide some kind
of management via cpusets.

> While it's just an opinion, my preference would be to focus on reducing
> the cost and amount of scanning done -- particularly for threads. For
> example, all threads operate on the same address space but there can be
> significant overlap where all threads are potentially scanning the same
> areas or regions that the thread has no interest in. One option would be
> to track the highest and lowest pages accessed and only scan within
> those regions for example. The tricky part is that library pages may
> create very wide windows that render the tracking useless but it could
> at least be investigated. 

In general, I think it's good to reduce the scanning cost.

Why do you think there will be overlap between the threads of a process?
If my understanding were correctly, the threads will scan one by one
instead of simultaneously.  And how to determine whether a vma need to
be scanned or not?  For example, there may be only a small portion of
pages been accessed in a vma, but they may be accessed remotely and
consumes quite some inter-node bandwidth, so need to be migrated.

Best Regards,
Huang, Ying