Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752445AbbDYSzw (ORCPT ); Sat, 25 Apr 2015 14:55:52 -0400 Received: from resqmta-ch2-08v.sys.comcast.net ([69.252.207.40]:53318 "EHLO resqmta-ch2-08v.sys.comcast.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751937AbbDYSzu (ORCPT ); Sat, 25 Apr 2015 14:55:50 -0400 Message-ID: <553BE2A9.2090500@gentoo.org> Date: Sat, 25 Apr 2015 14:53:29 -0400 From: Joshua Kinard User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Thunderbird/31.6.0 MIME-Version: 1.0 To: LKML , Linux MIPS List Subject: Re: MIPS: BUG() in isolate_lru_pages in mm/vmscan.c? References: <553BB91C.3010308@gentoo.org> In-Reply-To: <553BB91C.3010308@gentoo.org> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3260 Lines: 87 On 04/25/2015 11:56, Joshua Kinard wrote: > I keep tripping up a BUG() in isolate_lru_pages in mm/vmscan.c:1345: > > switch (__isolate_lru_page(page, mode)) { > case 0: > nr_pages = hpage_nr_pages(page); > mem_cgroup_update_lru_size(lruvec, lru, -nr_pages); > list_move(&page->lru, dst); > nr_taken += nr_pages; > break; > > case -EBUSY: > /* else it is being freed elsewhere */ > list_move(&page->lru, src); > continue; > > default: > BUG(); > } > > This is on an SGI Onyx2 platform (MIPS, IP27), two node boards (4x R14000 > CPUs), and 8G of RAM. The problem appears tied to heavy disk I/O, typically > writes. I can reproduce sometimes with a long bonnie++ run, but I haven't > gotten a recent panic() message under 4.0 yet. Most of the time, it silently > hardlocks. I only have serial console access at 9600bps, so it may lock too > fast before the serial driver can dump the panic. > > Is there any information behind the purpose or triggers of this BUG()? I went > back in git all the way to the initial 2006 commit that added this function, > but could not find any comments or explanation of just what it's protecting > against. That makes it hard to know where to start debugging. > > I've already tried switching filesystems, first ext4, now XFS. Enabling > CONFIG_NUMA seems to make it harder to trigger, but that's not an objective > observation. An md RAID resync doesn't appear to trigger it either. This patch seems to explain things a little bit (from 20070316): http://marc.info/?l=linux-mm-commits&m=117401513810763&w=2 > Subject: lumpy: back out removal of active check in isolate_lru_pages > From: Andy Whitcroft > > As pointed out by Christop Lameter it should not be possible for a page to > change its active/inactive state without taking the lru_lock. Reinstate this > safety net. > > Signed-off-by: Andy Whitcroft > Acked-by: Mel Gorman > Signed-off-by: Andrew Morton > --- > > mm/vmscan.c | 7 +++++-- > 1 file changed, 5 insertions(+), 2 deletions(-) > > diff -puN mm/vmscan.c~lumpy-back-out-removal-of-active-check-in-isolate_lru_pages mm/vmscan.c > --- a/mm/vmscan.c~lumpy-back-out-removal-of-active-check-in-isolate_lru_pages > +++ a/mm/vmscan.c > @@ -686,10 +686,13 @@ static unsigned long isolate_lru_pages(u > nr_taken++; > break; > > - default: > - /* page is being freed, or is a missmatch */ > + case -EBUSY: > + /* else it is being freed elsewhere */ > list_move(&page->lru, src); > continue; > + > + default: > + BUG(); > } > > if (!order) So if my reading is correct, the BUG() is being triggered because a page might be changing its active/inactive state w/o taking the lru_lock. Given that the SGI IP27 platform is an early NUMA machine and nodes can have a bit of physical distance between them (thus some latency), could this be a sign of some kind of SMP race condition specific to this platform? --J -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/