Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753317Ab3FQQyR (ORCPT ); Mon, 17 Jun 2013 12:54:17 -0400 Received: from mail-la0-f46.google.com ([209.85.215.46]:56628 "EHLO mail-la0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752256Ab3FQQyP (ORCPT ); Mon, 17 Jun 2013 12:54:15 -0400 Date: Mon, 17 Jun 2013 20:54:10 +0400 From: Glauber Costa To: Michal Hocko Cc: Dave Chinner , Andrew Morton , linux-mm@kvack.org, LKML Subject: Re: linux-next: slab shrinkers: BUG at mm/list_lru.c:92 Message-ID: <20130617165409.GA10764@localhost.localdomain> References: <20130617141822.GF5018@dhcp22.suse.cz> <20130617151403.GA25172@localhost.localdomain> <20130617153302.GI5018@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130617153302.GI5018@dhcp22.suse.cz> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2878 Lines: 66 On Mon, Jun 17, 2013 at 05:33:02PM +0200, Michal Hocko wrote: > On Mon 17-06-13 19:14:12, Glauber Costa wrote: > > On Mon, Jun 17, 2013 at 04:18:22PM +0200, Michal Hocko wrote: > > > Hi, > > > > Hi, > > > > > I managed to trigger: > > > [ 1015.776029] kernel BUG at mm/list_lru.c:92! > > > [ 1015.776029] invalid opcode: 0000 [#1] SMP > > > with Linux next (next-20130607) with https://lkml.org/lkml/2013/6/17/203 > > > on top. > > > > > > This is obviously BUG_ON(nlru->nr_items < 0) and > > > ffffffff81122d0b: 48 85 c0 test %rax,%rax > > > ffffffff81122d0e: 49 89 44 24 18 mov %rax,0x18(%r12) > > > ffffffff81122d13: 0f 84 87 00 00 00 je ffffffff81122da0 > > > ffffffff81122d19: 49 83 7c 24 18 00 cmpq $0x0,0x18(%r12) > > > ffffffff81122d1f: 78 7b js ffffffff81122d9c > > > [...] > > > ffffffff81122d9c: 0f 0b ud2 > > > > > > RAX is -1UL. > > > > Yes, fearing those kind of imbalances, we decided to leave the counter > > as a signed quantity and BUG, instead of an unsigned quantity. > > > > > I assume that the current backtrace is of no use and it would most > > > probably be some shrinker which doesn't behave. > > > > > There are currently 3 users of list_lru in tree: dentries, inodes and xfs. > > Assuming you are not using xfs, we are left with dentries and inodes. > > > > The first thing to do is to find which one of them is misbehaving. You > > can try finding this out by the address of the list_lru, and where it > > lays in the superblock. > > I am not sure I understand. Care to prepare a debugging patch for me? > > > Once we know each of them is misbehaving, then we'll have to figure > > out why. > > > > Any special filesystem workload ? > > This is two parallel kernel builds with separate kernel trees running > under 2 hard unlimitted groups (with 0 soft limit) followed by rm -rf > source trees + drop caches. Sometimes I have to repeat this multiple > times. I can also see some timer specific crashes which are most > probably not related so I am getting back to my mm tree and will hope > the tree is healthy. > > I have seen some other traces as well (mentioning ext3 dput paths) but I > cannot reproduce them anymore. > Do you have those traces? If there is a bug in the ext3 dput, then it is most likely the culprit. dput() is when we insert things into the LRU. So if we are not fully inserting an element that we should have - and later on try to remove it, we'll go negative. Can we see those traces? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/