Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753787AbdHUSch (ORCPT ); Mon, 21 Aug 2017 14:32:37 -0400 Received: from outbound-smtp07.blacknight.com ([46.22.139.12]:33403 "EHLO outbound-smtp07.blacknight.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753524AbdHUScg (ORCPT ); Mon, 21 Aug 2017 14:32:36 -0400 Date: Mon, 21 Aug 2017 19:32:34 +0100 From: Mel Gorman To: Linus Torvalds Cc: "Liang, Kan" , Mel Gorman , "Kirill A. Shutemov" , Tim Chen , Peter Zijlstra , Ingo Molnar , Andi Kleen , Andrew Morton , Johannes Weiner , Jan Kara , linux-mm , Linux Kernel Mailing List Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk Message-ID: <20170821183234.kzennaaw2zt2rbwz@techsingularity.net> References: <37D7C6CF3E00A74B8858931C1DB2F0775378761B@SHSMSX103.ccr.corp.intel.com> <20170818122339.24grcbzyhnzmr4qw@techsingularity.net> <37D7C6CF3E00A74B8858931C1DB2F077537879BB@SHSMSX103.ccr.corp.intel.com> <20170818144622.oabozle26hasg5yo@techsingularity.net> <37D7C6CF3E00A74B8858931C1DB2F07753787AE4@SHSMSX103.ccr.corp.intel.com> <20170818185455.qol3st2nynfa47yc@techsingularity.net> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: User-Agent: NeoMutt/20170421 (1.8.2) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4294 Lines: 134 On Fri, Aug 18, 2017 at 12:14:12PM -0700, Linus Torvalds wrote: > On Fri, Aug 18, 2017 at 11:54 AM, Mel Gorman > wrote: > > > > One option to mitigate (but not eliminate) the problem is to record when > > the page lock is contended and pass in TNF_PAGE_CONTENDED (new flag) to > > task_numa_fault(). > > Well, finding it contended is fairly easy - just look at the page wait > queue, and if it's not empty, assume it's due to contention. > Yes. > I also wonder if we could be even *more* hacky, and in the whole > __migration_entry_wait() path, change the logic from: > > - wait on page lock before retrying the fault > > to > > - yield() > > which is hacky, but there's a rationale for it: > > (a) avoid the crazy long wait queues ;) > > (b) we know that migration is *supposed* to be CPU-bound (not IO > bound), so yielding the CPU and retrying may just be the right thing > to do. > Potentially. I spent a few hours trying to construct a test case that would migrate constantly that could be used as a basis for evaluating a patch or alternative. Unfortunately it was not as easy as I thought and I still have to construct a case that causes migration storms that would result in multiple threads waiting on a single page. > Because that code sequence doesn't actually depend on > "wait_on_page_lock()" for _correctness_ anyway, afaik. Anybody who > does "migration_entry_wait()" _has_ to retry anyway, since the page > table contents may have changed by waiting. > > So I'm not proud of the attached patch, and I don't think it's really > acceptable as-is, but maybe it's worth testing? And maybe it's > arguably no worse than what we have now? > > Comments? > The transhuge migration path for numa balancing doesn't go through the migration_entry_wait patch despite similarly named functions that suggest it does so this may only has the most effect when THP is disabled. It's worth trying anyway. Covering both paths would be something like the patch below which spins until the page is unlocked or it should reschedule. It's not even boot tested as I spent what time I had on the test case that I hoped would be able to prove it really works. diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 79b36f57c3ba..31cda1288176 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -517,6 +517,13 @@ static inline void wait_on_page_locked(struct page *page) wait_on_page_bit(compound_head(page), PG_locked); } +void __spinwait_on_page_locked(struct page *page); +static inline void spinwait_on_page_locked(struct page *page) +{ + if (PageLocked(page)) + __spinwait_on_page_locked(page); +} + static inline int wait_on_page_locked_killable(struct page *page) { if (!PageLocked(page)) diff --git a/mm/filemap.c b/mm/filemap.c index a49702445ce0..c9d6f49614bc 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1210,6 +1210,15 @@ int __lock_page_or_retry(struct page *page, struct mm_struct *mm, } } +void __spinwait_on_page_locked(struct page *page) +{ + do { + cpu_relax(); + } while (PageLocked(page) && !cond_resched()); + + wait_on_page_locked(page); +} + /** * page_cache_next_hole - find the next hole (not-present entry) * @mapping: mapping diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 90731e3b7e58..c7025c806420 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1443,7 +1443,7 @@ int do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd) if (!get_page_unless_zero(page)) goto out_unlock; spin_unlock(vmf->ptl); - wait_on_page_locked(page); + spinwait_on_page_locked(page); put_page(page); goto out; } @@ -1480,7 +1480,7 @@ int do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd) if (!get_page_unless_zero(page)) goto out_unlock; spin_unlock(vmf->ptl); - wait_on_page_locked(page); + spinwait_on_page_locked(page); put_page(page); goto out; } diff --git a/mm/migrate.c b/mm/migrate.c index e84eeb4e4356..9b6c3fc5beac 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -308,7 +308,7 @@ void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep, if (!get_page_unless_zero(page)) goto out; pte_unmap_unlock(ptep, ptl); - wait_on_page_locked(page); + spinwait_on_page_locked(page); put_page(page); return; out: