Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751484AbdINCMb (ORCPT ); Wed, 13 Sep 2017 22:12:31 -0400 Received: from mga09.intel.com ([134.134.136.24]:22710 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751328AbdINCM0 (ORCPT ); Wed, 13 Sep 2017 22:12:26 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.42,390,1500966000"; d="scan'208";a="1218547913" Subject: Re: [PATCH 2/2 v2] sched/wait: Introduce lock breaker in wake_up_page_bit To: Linus Torvalds Cc: "Liang, Kan" , Mel Gorman , Peter Zijlstra , Ingo Molnar , Andi Kleen , Andrew Morton , Johannes Weiner , Jan Kara , Christopher Lameter , "Eric W . Biederman" , Davidlohr Bueso , linux-mm , Linux Kernel Mailing List References: <83f675ad385d67760da4b99cd95ee912ca7c0b44.1503677178.git.tim.c.chen@linux.intel.com> <37D7C6CF3E00A74B8858931C1DB2F077537A07E9@SHSMSX103.ccr.corp.intel.com> <37D7C6CF3E00A74B8858931C1DB2F077537A1C19@SHSMSX103.ccr.corp.intel.com> From: Tim Chen Message-ID: Date: Wed, 13 Sep 2017 19:12:23 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.1.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 1241 Lines: 34 On 08/29/2017 09:24 AM, Linus Torvalds wrote: > On Tue, Aug 29, 2017 at 9:13 AM, Tim Chen wrote: >> >> It is affecting not a production use, but the customer's acceptance >> test for their systems. So I suspect it is a stress test. > > Can you gently poke them and ask if they might make theie stress test > code available? > > Tell them that we have a fix, but right now it's delayed into 4.14 > because we have no visibility into what it is that it actually fixes, > and whether it's all that critical or just some microbenchmark. > > Linus, Here's what the customer think happened and is willing to tell us. They have a parent process that spawns off 10 children per core and kicked them to run. The child processes all access a common library. We have 384 cores so 3840 child processes running. When migration occur on a page in the common library, the first child that access the page will page fault and lock the page, with the other children also page faulting quickly and pile up in the page wait list, till the first child is done. Probably some kind of access pattern of the common library induces the page migration to happen. BTW, will you be merging these 2 patches in 4.14? Thanks. Tim