Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751458AbdINQj7 (ORCPT ); Thu, 14 Sep 2017 12:39:59 -0400 Received: from resqmta-po-08v.sys.comcast.net ([96.114.154.167]:51266 "EHLO resqmta-po-08v.sys.comcast.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751277AbdINQj5 (ORCPT ); Thu, 14 Sep 2017 12:39:57 -0400 Date: Thu, 14 Sep 2017 11:39:53 -0500 (CDT) From: Christopher Lameter X-X-Sender: cl@nuc-kabylake To: Tim Chen cc: Linus Torvalds , "Liang, Kan" , Mel Gorman , Peter Zijlstra , Ingo Molnar , Andi Kleen , Andrew Morton , Johannes Weiner , Jan Kara , "Eric W . Biederman" , Davidlohr Bueso , linux-mm , Linux Kernel Mailing List Subject: Re: [PATCH 2/2 v2] sched/wait: Introduce lock breaker in wake_up_page_bit In-Reply-To: Message-ID: References: <83f675ad385d67760da4b99cd95ee912ca7c0b44.1503677178.git.tim.c.chen@linux.intel.com> <37D7C6CF3E00A74B8858931C1DB2F077537A07E9@SHSMSX103.ccr.corp.intel.com> <37D7C6CF3E00A74B8858931C1DB2F077537A1C19@SHSMSX103.ccr.corp.intel.com> User-Agent: Alpine 2.20 (DEB 67 2015-01-07) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-CMAE-Envelope: MS4wfMgkxv1rZn4EbgWTkrtmbMbE4Psv8eXhaDi1xnGcOYHa5zx0J4C34s/5x5ozpfd+DCi0P3/lwAL7t3iQPFb/S4Z/RR5JpwSgKLYviO4X3248Gzw5QVyO 9ICksYDOHeCFPoZQu0C/IPXFgP7Xbq2X1rbtB+0v0PDBq7wBkYDU0jTu8j9amweZpNQZzJNwcAnWXE8rTX6wKVjLGHtC/f1RIMFqSoAn3vqrhtgBQ8MYIs+g XCprwuVqRCX75EPlDqR9GTuEkPSarVeeyKl64uzi1FG567sQEbcqt5lq63SxPzxclnSf2UUtwP1ERZewKmm3Tk0+Z/uOhCFCF8+QbnuBttZT7445hcCiE2YM m0Hc3XfBPLlad4anOnTaA3QiLBQKklkCMB3y4J93SoIyDbGAaFNJfat538XN2bD4u5XbniSf+5IPRyl6WHWoD8A42KSwf7j7XYn8HOjAVIHY+fF6Bp0Rj39i FUf+u1xz4zTQaR1RT0Qu0GhpCPFdLHuPWiN19Uv5NOPwWe+qT7RWBXsbjM7pWGQNP6fsnM23j47RXsiikbJrL5mNYvvSdfZMHuM91QcLJbVpgmHANr/ae3pQ fp8= Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 695 Lines: 12 On Wed, 13 Sep 2017, Tim Chen wrote: > Here's what the customer think happened and is willing to tell us. > They have a parent process that spawns off 10 children per core and > kicked them to run. The child processes all access a common library. > We have 384 cores so 3840 child processes running. When migration occur on > a page in the common library, the first child that access the page will > page fault and lock the page, with the other children also page faulting > quickly and pile up in the page wait list, till the first child is done. I think we need some way to avoid migration in cases like this. This is crazy. Page migration was not written to deal with something like this.