Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932115AbdHWOvS (ORCPT ); Wed, 23 Aug 2017 10:51:18 -0400 Received: from mga03.intel.com ([134.134.136.65]:32162 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754053AbdHWOvQ (ORCPT ); Wed, 23 Aug 2017 10:51:16 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.41,417,1498546800"; d="scan'208";a="1187345702" From: "Liang, Kan" To: Linus Torvalds , Andi Kleen CC: Christopher Lameter , Peter Zijlstra , Mel Gorman , Mel Gorman , "Kirill A. Shutemov" , Tim Chen , Ingo Molnar , Andrew Morton , Johannes Weiner , Jan Kara , linux-mm , Linux Kernel Mailing List Subject: RE: [PATCH 1/2] sched/wait: Break up long wake list walk Thread-Topic: [PATCH 1/2] sched/wait: Break up long wake list walk Thread-Index: AQHTFWNBYSKZKyu5OE6Y+fM96SxNwqKEIDEAgASaOPD//398AIAAsKCA//+X4wCAAQZZgIAAplUA//+BiwAAFSW90P//ia2AgAASl4CAAAVjAIAEq10A//91k2D//XJ4kIAEpo4AgAANwwCAAAYbgIAAAe8AgAAZmgCAAAREAIAAGKGA//5uVCA= Date: Wed, 23 Aug 2017 14:51:13 +0000 Message-ID: <37D7C6CF3E00A74B8858931C1DB2F0775378A8BB@SHSMSX103.ccr.corp.intel.com> References: <20170818185455.qol3st2nynfa47yc@techsingularity.net> <20170821183234.kzennaaw2zt2rbwz@techsingularity.net> <37D7C6CF3E00A74B8858931C1DB2F07753788B58@SHSMSX103.ccr.corp.intel.com> <37D7C6CF3E00A74B8858931C1DB2F0775378A24A@SHSMSX103.ccr.corp.intel.com> <20170822190828.GO32112@worktop.programming.kicks-ass.net> <20170822193714.GZ28715@tassilo.jf.intel.com> <20170822212408.GC28715@tassilo.jf.intel.com> In-Reply-To: Accept-Language: zh-CN, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsMyIsImlkIjoiZDQ5MWY0OTgtNjg5Yi00MjJmLTk5ODktYjliNDM5NjExYzg3IiwicHJvcHMiOlt7Im4iOiJDVFBDbGFzc2lmaWNhdGlvbiIsInZhbHMiOlt7InZhbHVlIjoiQ1RQX0lDIn1dfV19LCJTdWJqZWN0TGFiZWxzIjpbXSwiVE1DVmVyc2lvbiI6IjE2LjUuOS4zIiwiVHJ1c3RlZExhYmVsSGFzaCI6Iko5amx2RDNUSWtOVzlLSFwvS0xrTW5SQ2FCZlpweENDRTB1WTFWSmF4SHZVPSJ9 x-ctpclassification: CTP_IC dlp-product: dlpe-windows dlp-version: 10.0.102.7 dlp-reaction: no-action x-originating-ip: [10.239.127.40] Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from base64 to 8bit by nfs id v7NEpNJq021454 Content-Length: 2088 Lines: 51 > Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk > > On Tue, Aug 22, 2017 at 2:24 PM, Andi Kleen wrote: > > > > I believe in this case it's used by threads, so a reference count > > limit wouldn't help. > > For the first migration try, yes. But if it's some kind of "try and try again" > pattern, the second time you try and there are people waiting for the page, > the page count (not the map count) would be elevanted. > > So it's possible that depending on exactly what the deeper problem is, the > "this page is very busy, don't migrate" case might be discoverable, and the > page count might be part of it. > > However, after PeterZ made that comment that page migration should have > that should_numa_migrate_memory() filter, I am looking at that > mpol_misplaced() code. > > And honestly, that MPOL_PREFERRED / MPOL_F_LOCAL case really looks like > complete garbage to me. > > It looks like garbage exactly because it says "always migrate to the current > node", but that's crazy - if it's a group of threads all running together on the > same VM, that obviously will just bounce the page around for absolute zero > good ewason. > > The *other* memory policies look fairly sane. They basically have a fairly > well-defined preferred node for the policy (although the > "MPOL_INTERLEAVE" looks wrong for a hugepage). But > MPOL_PREFERRED/MPOL_F_LOCAL really looks completely broken. > > Maybe people expected that anybody who uses MPOL_F_LOCAL will also > bind all threads to one single node? > > Could we perhaps make that "MPOL_PREFERRED / MPOL_F_LOCAL" case just > do the MPOL_F_MORON policy, which *does* use that "should I migrate to > the local node" filter? > > IOW, we've been looking at the waiters (because the problem shows up due > to the excessive wait queues), but maybe the source of the problem comes > from the numa balancing code just insanely bouncing pages back-and-forth if > you use that "always balance to local node" thing. > > Untested (as always) patch attached. The patch doesn’t work. Thanks, Kan