Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752024AbdFKX2O (ORCPT ); Sun, 11 Jun 2017 19:28:14 -0400 Received: from mail-pg0-f50.google.com ([74.125.83.50]:33629 "EHLO mail-pg0-f50.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751917AbdFKX2N (ORCPT ); Sun, 11 Jun 2017 19:28:13 -0400 Date: Sun, 11 Jun 2017 16:28:11 -0700 (PDT) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: Michal Hocko cc: Matthew Wilcox , Vlastimil Babka , Larry Finger , Andrew Morton , LKML , linux-mm@kvack.org Subject: Re: Sleeping BUG in khugepaged for i586 In-Reply-To: <20170610080941.GA12347@dhcp22.suse.cz> Message-ID: References: <20170605144401.5a7e62887b476f0732560fa0@linux-foundation.org> <1e883924-9766-4d2a-936c-7a49b337f9e2@lwfinger.net> <9ab81c3c-e064-66d2-6e82-fc9bac125f56@suse.cz> <20170608144831.GA19903@dhcp22.suse.cz> <20170608170557.GA8118@bombadil.infradead.org> <20170608201822.GA5535@dhcp22.suse.cz> <20170608203046.GB5535@dhcp22.suse.cz> <20170610080941.GA12347@dhcp22.suse.cz> User-Agent: Alpine 2.10 (DEB 1266 2009-07-14) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 1204 Lines: 27 On Sat, 10 Jun 2017, Michal Hocko wrote: > > > I would just pull the cond_resched out of __collapse_huge_page_copy > > > right after pte_unmap. But I am not really sure why this cond_resched is > > > really needed because the changelog of the patch which adds is is quite > > > terse on details. > > > > I'm not sure what could possibly be added to the changelog. We have > > encountered need_resched warnings during the iteration. > > Well, the part the changelog is not really clear about is whether the > HPAGE_PMD_NR loops itself is the source of the stall. This would be > quite surprising because doing 512 iterations taking up to 20+s sounds > way to much. I have no idea where you come up with 20+ seconds. These are not soft lockups, these are need_resched warnings. We monitor how long need_resched has been set and when a thread takes an excessive amount of time to reschedule after it has been set. A loop of 512 pages with ptl contention and doing {clear,copy}_user_highpage() shows that need_resched can sit without scheduling for an excessive amount of time. > So is it possible that we are missing a cond_resched > somewhere up the __collapse_huge_page_copy call path? No.