Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753149AbdHROUp convert rfc822-to-8bit (ORCPT ); Fri, 18 Aug 2017 10:20:45 -0400 Received: from mga07.intel.com ([134.134.136.100]:5715 "EHLO mga07.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752910AbdHROUn (ORCPT ); Fri, 18 Aug 2017 10:20:43 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.41,393,1498546800"; d="scan'208";a="1184270327" From: "Liang, Kan" To: Mel Gorman , Linus Torvalds CC: Mel Gorman , "Kirill A. Shutemov" , Tim Chen , Peter Zijlstra , Ingo Molnar , "Andi Kleen" , Andrew Morton , Johannes Weiner , Jan Kara , linux-mm , Linux Kernel Mailing List Subject: RE: [PATCH 1/2] sched/wait: Break up long wake list walk Thread-Topic: [PATCH 1/2] sched/wait: Break up long wake list walk Thread-Index: AQHTFWNBYSKZKyu5OE6Y+fM96SxNwqKEIDEAgASaOPD//398AIAAsKCA//+X4wCAAQZZgIAAplUA Date: Fri, 18 Aug 2017 14:20:38 +0000 Message-ID: <37D7C6CF3E00A74B8858931C1DB2F077537879BB@SHSMSX103.ccr.corp.intel.com> References: <84c7f26182b7f4723c0fe3b34ba912a9de92b8b7.1502758114.git.tim.c.chen@linux.intel.com> <37D7C6CF3E00A74B8858931C1DB2F07753786CE9@SHSMSX103.ccr.corp.intel.com> <37D7C6CF3E00A74B8858931C1DB2F0775378761B@SHSMSX103.ccr.corp.intel.com> <20170818122339.24grcbzyhnzmr4qw@techsingularity.net> In-Reply-To: <20170818122339.24grcbzyhnzmr4qw@techsingularity.net> Accept-Language: zh-CN, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsMyIsImlkIjoiNWRmYmRjZTItNzhjNC00NzIwLWFjMzctYzI0NTVjZDczYzEzIiwicHJvcHMiOlt7Im4iOiJDVFBDbGFzc2lmaWNhdGlvbiIsInZhbHMiOlt7InZhbHVlIjoiQ1RQX0lDIn1dfV19LCJTdWJqZWN0TGFiZWxzIjpbXSwiVE1DVmVyc2lvbiI6IjE2LjUuOS4zIiwiVHJ1c3RlZExhYmVsSGFzaCI6IjJOYVpIUWx4bWRIekp5T0ZhREswTkhNYzFqcmc3emRWSGNwME9BYko5RWs9In0= x-ctpclassification: CTP_IC dlp-product: dlpe-windows dlp-version: 10.0.102.7 dlp-reaction: no-action x-originating-ip: [10.239.127.40] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 8BIT MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3536 Lines: 99 > On Thu, Aug 17, 2017 at 01:44:40PM -0700, Linus Torvalds wrote: > > On Thu, Aug 17, 2017 at 1:18 PM, Liang, Kan wrote: > > > > > > Here is the call stack of wait_on_page_bit_common when the queue is > > > long (entries >1000). > > > > > > # Overhead Trace output > > > # ........ .................. > > > # > > > 100.00% (ffffffff931aefca) > > > | > > > ---wait_on_page_bit > > > __migration_entry_wait > > > migration_entry_wait > > > do_swap_page > > > __handle_mm_fault > > > handle_mm_fault > > > __do_page_fault > > > do_page_fault > > > page_fault > > > > Hmm. Ok, so it does seem to very much be related to migration. Your > > wake_up_page_bit() profile made me suspect that, but this one seems to > > pretty much confirm it. > > > > So it looks like that wait_on_page_locked() thing in > > __migration_entry_wait(), and what probably happens is that your load > > ends up triggering a lot of migration (or just migration of a very hot > > page), and then *every* thread ends up waiting for whatever page that > > ended up getting migrated. > > > > Agreed. > > > And so the wait queue for that page grows hugely long. > > > > It's basically only bounded by the maximum number of threads that can exist. > > > Looking at the other profile, the thing that is locking the page (that > > everybody then ends up waiting on) would seem to be > > migrate_misplaced_transhuge_page(), so this is _presumably_ due to > > NUMA balancing. > > > > Yes, migrate_misplaced_transhuge_page requires NUMA balancing to be > part of the picture. > > > Does the problem go away if you disable the NUMA balancing code? > > > > Adding Mel and Kirill to the participants, just to make them aware of > > the issue, and just because their names show up when I look at blame. > > > > I'm not imagining a way of dealing with this that would reliably detect when > there are a large number of waiters without adding a mess. We could adjust > the scanning rate to reduce the problem but it would be difficult to target > properly and wouldn't prevent the problem occurring with the added hassle > that it would now be intermittent. > > Assuming the problem goes away by disabling NUMA then it would be nice if > it could be determined that the page lock holder is trying to allocate a page > when the queue is huge. That is part of the operation that potentially takes a > long time and may be why so many callers are stacking up. If so, I would > suggest clearing __GFP_DIRECT_RECLAIM from the GFP flags in > migrate_misplaced_transhuge_page and assume that a remote hit is always > going to be cheaper than compacting memory to successfully allocate a THP. > That may be worth doing unconditionally because we'd have to save a > *lot* of remote misses to offset compaction cost. > > Nothing fancy other than needing a comment if it works. > No, the patch doesn't work. Thanks, Kan > diff --git a/mm/migrate.c b/mm/migrate.c index > 627671551873..87b0275ddcdb 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -1926,7 +1926,7 @@ int migrate_misplaced_transhuge_page(struct > mm_struct *mm, > goto out_dropref; > > new_page = alloc_pages_node(node, > - (GFP_TRANSHUGE_LIGHT | __GFP_THISNODE), > + (GFP_TRANSHUGE_LIGHT | __GFP_THISNODE) & > ~__GFP_DIRECT_RECLAIM, > HPAGE_PMD_ORDER); > if (!new_page) > goto out_fail; > > -- > Mel Gorman > SUSE Labs