Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752439AbdHRSzB (ORCPT ); Fri, 18 Aug 2017 14:55:01 -0400 Received: from outbound-smtp09.blacknight.com ([46.22.139.14]:35921 "EHLO outbound-smtp09.blacknight.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751706AbdHRSy5 (ORCPT ); Fri, 18 Aug 2017 14:54:57 -0400 Date: Fri, 18 Aug 2017 19:54:55 +0100 From: Mel Gorman To: Linus Torvalds Cc: "Liang, Kan" , Mel Gorman , "Kirill A. Shutemov" , Tim Chen , Peter Zijlstra , Ingo Molnar , Andi Kleen , Andrew Morton , Johannes Weiner , Jan Kara , linux-mm , Linux Kernel Mailing List Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk Message-ID: <20170818185455.qol3st2nynfa47yc@techsingularity.net> References: <37D7C6CF3E00A74B8858931C1DB2F07753786CE9@SHSMSX103.ccr.corp.intel.com> <37D7C6CF3E00A74B8858931C1DB2F0775378761B@SHSMSX103.ccr.corp.intel.com> <20170818122339.24grcbzyhnzmr4qw@techsingularity.net> <37D7C6CF3E00A74B8858931C1DB2F077537879BB@SHSMSX103.ccr.corp.intel.com> <20170818144622.oabozle26hasg5yo@techsingularity.net> <37D7C6CF3E00A74B8858931C1DB2F07753787AE4@SHSMSX103.ccr.corp.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: User-Agent: NeoMutt/20170421 (1.8.2) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3870 Lines: 78 On Fri, Aug 18, 2017 at 10:48:23AM -0700, Linus Torvalds wrote: > On Fri, Aug 18, 2017 at 9:53 AM, Liang, Kan wrote: > > > >> On Fri, Aug 18, 2017 Mel Gorman wrote: > >> > >> That indicates that it may be a hot page and it's possible that the page is > >> locked for a short time but waiters accumulate. What happens if you leave > >> NUMA balancing enabled but disable THP? > > > > No, disabling THP doesn't help the case. > > Interesting. That particular code sequence should only be active for > THP. What does the profile look like with THP disabled but with NUMA > balancing still enabled? > While that specific code sequence is active in the example, the problem is fundamental to what NUMA balancing does. If many threads share a single page, base page or THP, then any thread accessing the data during migration will block on page lock. The symptoms will be difference but I am willing to bet it'll be a wake on a page lock either way. NUMA balancing is somewhat unique in that it's relatively easy to have lots of threads depend on a single pages lock. > Just asking because maybe that different call chain could give us some > other ideas of what the commonality here is that triggers out > behavioral problem. > I am reasonably confident that the commonality is multiple threads sharing a page. Maybe it's a single hot structure that is shared between threads. Maybe it's parallel buffers that are aligned on a sub-page boundary. Multiple threads accessing buffers aligned by cache lines would do it which is reasonable behaviour for a parallelised compute load for example. If I'm right, writing a test case for it is straight-forward and I'll get to it on Monday when I'm back near my normal work machine. The initial guess that it may be allocation latency was obviously way off. I didn't follow through properly but the fact it's not THP specific means the length of time it takes to migrate is possibly irrelevant. If the page is hot enough, threads will block once migration starts even if the migration completes quickly. > I was really hoping that we'd root-cause this and have a solution (and > then apply Tim's patch as a "belt and suspenders" kind of thing), but > it's starting to smell like we may have to apply Tim's patch as a > band-aid, and try to figure out what the trigger is longer-term. > I believe the trigger is simply because a hot page gets unmapped and then threads lock on it. One option to mitigate (but not eliminate) the problem is to record when the page lock is contended and pass in TNF_PAGE_CONTENDED (new flag) to task_numa_fault(). For each time it's passed in, shift numa_scan_period << 1 which will slow the scanner and reduce the frequency contention occurs at. If it's heavily contended, the period will quickly reach numa_scan_period_max. That is simple with the caveat that a single hot contended page will slow all balancing. The main problem is that this mitigates and not eliminates the problem. No matter how slow the scanner is, it'll still hit the case where many threads contend on a single page. An alternative is to set a VMA flag on VMAs if many contentions are detected and stop scanning that VMA entirely. It would need a VMA flag which right now might mean making vm_flags u64 and increasing the size of vm_area_struct on 32-bit. The downside is that it is permanent. If heavy contention happens then scanning that VMA is disabled for the lifetime of the process because there would no longer be a way to detect that re-enabling is appropriate. A variation would be to record contentions in struct numa_group and return false in should_numa_migrate_memory if contentions are high and scale it down over time. It wouldn't be perfect as processes sharing hot pages are not guaranteed to have a numa_group in common but it may be sufficient. -- Mel Gorman SUSE Labs