From: "Liang, Kan" <kan.liang@intel.com>
To: Linus Torvalds <torvalds@linux-foundation.org>,
        Mel Gorman <mgorman@suse.de>,
        "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
CC: Tim Chen <tim.c.chen@linux.intel.com>,
        Peter Zijlstra <peterz@infradead.org>, Ingo Molnar <mingo@elte.hu>,
        Andi Kleen <ak@linux.intel.com>,
        Andrew Morton <akpm@linux-foundation.org>,
        "Johannes Weiner" <hannes@cmpxchg.org>, Jan Kara <jack@suse.cz>,
        linux-mm <linux-mm@kvack.org>,
        Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: RE: [PATCH 1/2] sched/wait: Break up long wake list walk
Thread-Topic: [PATCH 1/2] sched/wait: Break up long wake list walk
Thread-Index: AQHTFWNBYSKZKyu5OE6Y+fM96SxNwqKEIDEAgASaOPD//398AIAAsKCA//+X4wCAAZgMEA==
Date: Fri, 18 Aug 2017 13:06:04 +0000
Message-ID: <37D7C6CF3E00A74B8858931C1DB2F07753787920@SHSMSX103.ccr.corp.intel.com>
References: <84c7f26182b7f4723c0fe3b34ba912a9de92b8b7.1502758114.git.tim.c.chen@linux.intel.com>
 <CA+55aFznC1wqBSfYr8=92LGqz5-F6fHMzdXoqM4aOYx8sT1Dhg@mail.gmail.com>
 <37D7C6CF3E00A74B8858931C1DB2F07753786CE9@SHSMSX103.ccr.corp.intel.com>
 <CA+55aFwzTMrZwh7TE_VeZt8gx5Syoop-kA=Xqs56=FkyakrM6g@mail.gmail.com>
 <37D7C6CF3E00A74B8858931C1DB2F0775378761B@SHSMSX103.ccr.corp.intel.com>
 <CA+55aFy_RNx5TQ8esjPPOKuW-o+fXbZgWapau2MHyexcAZtqsw@mail.gmail.com>
In-Reply-To: <CA+55aFy_RNx5TQ8esjPPOKuW-o+fXbZgWapau2MHyexcAZtqsw@mail.gmail.com>
Accept-Language: zh-CN, en-US
Content-Language: en-US
dlp-product: dlpe-windows
dlp-version: 10.0.102.7
dlp-reaction: no-action
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Sender: linux-kernel-owner@vger.kernel.org
Content-Transfer-Encoding: 8bit
Content-Length: 1525
Lines: 46


> On Thu, Aug 17, 2017 at 1:18 PM, Liang, Kan <kan.liang@intel.com> wrote:
> >
> > Here is the call stack of wait_on_page_bit_common when the queue is
> > long (entries >1000).
> >
> > # Overhead  Trace output
> > # ........  ..................
> > #
> >    100.00%  (ffffffff931aefca)
> >             |
> >             ---wait_on_page_bit
> >                __migration_entry_wait
> >                migration_entry_wait
> >                do_swap_page
> >                __handle_mm_fault
> >                handle_mm_fault
> >                __do_page_fault
> >                do_page_fault
> >                page_fault
> 
> Hmm. Ok, so it does seem to very much be related to migration. Your
> wake_up_page_bit() profile made me suspect that, but this one seems to
> pretty much confirm it.
> 
> So it looks like that wait_on_page_locked() thing in __migration_entry_wait(),
> and what probably happens is that your load ends up triggering a lot of
> migration (or just migration of a very hot page), and then *every* thread
> ends up waiting for whatever page that ended up getting migrated.
> 
> And so the wait queue for that page grows hugely long.
> 
> Looking at the other profile, the thing that is locking the page (that everybody
> then ends up waiting on) would seem to be
> migrate_misplaced_transhuge_page(), so this is _presumably_ due to NUMA
> balancing.
> 
> Does the problem go away if you disable the NUMA balancing code?
> 

Yes, the problem goes away when NUMA balancing is disabled.


Thanks,
Kan