Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753744Ab1FHRkK (ORCPT ); Wed, 8 Jun 2011 13:40:10 -0400 Received: from fnarfbargle.com ([93.93.131.67]:60549 "EHLO fnarfbargle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751373Ab1FHRkI (ORCPT ); Wed, 8 Jun 2011 13:40:08 -0400 X-Greylist: delayed 64558 seconds by postgrey-1.27 at vger.kernel.org; Wed, 08 Jun 2011 13:40:07 EDT Message-ID: <4DEFAB15.2060905@fnarfbargle.com> Date: Thu, 09 Jun 2011 01:02:13 +0800 From: Brad Campbell User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.17) Gecko/20110424 Lightning/1.0b2 Thunderbird/3.1.10 MIME-Version: 1.0 To: Eric Dumazet CC: Patrick McHardy , Bart De Schuymer , kvm@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, netdev@vger.kernel.org, netfilter-devel@vger.kernel.org Subject: Re: KVM induced panic on 2.6.38[2367] & 2.6.39 References: <20110601011527.GN19505@random.random> <4DE5DCA8.7070704@fnarfbargle.com> <4DE5E29E.7080009@redhat.com> <4DE60669.9050606@fnarfbargle.com> <4DE60918.3010008@redhat.com> <4DE60940.1070107@redhat.com> <4DE61A2B.7000008@fnarfbargle.com> <20110601111841.GB3956@zip.com.au> <4DE62801.9080804@fnarfbargle.com> <20110601230342.GC3956@zip.com.au> <4DE8E3ED.7080004@fnarfbargle.com> <4DE906C0.6060901@fnarfbargle.com> <4DED344D.7000005@pandora.be> <4DED9C23.2030408@fnarfbargle.com> <4DEE27DE.7060004@trash.net> <4DEE3859.6070808@fnarfbargle.com> <4DEE4538.1020404@trash.net> <1307471484.3091.43.camel@edumazet-laptop> <4DEEACC3.3030509@trash.net> <4DEEBFC2.4060102@fnarfbargle.com> <1307505541.3102.12.camel@edumazet-laptop> In-Reply-To: <1307505541.3102.12.camel@edumazet-laptop> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3028 Lines: 80 On 08/06/11 11:59, Eric Dumazet wrote: > Well, a bisection definitely should help, but needs a lot of time in > your case. Yes. compile, test, crash, walk out to the other building to press reset, lather, rinse, repeat. I need a reset button on the end of a 50M wire, or a hardware watchdog! Actually it's not so bad. If I turn off slub debugging the kernel panics and reboots itself. This.. : [ 2.913034] netconsole: remote ethernet address 00:16:cb:a7:dd:d1 [ 2.913066] netconsole: device eth0 not up yet, forcing it [ 3.660062] Refined TSC clocksource calibration: 3213.422 MHz. [ 3.660118] Switching to clocksource tsc [ 63.200273] r8169 0000:03:00.0: eth0: unable to load firmware patch rtl_nic/rtl8168e-1.fw (-2) [ 63.223513] r8169 0000:03:00.0: eth0: link down [ 63.223556] r8169 0000:03:00.0: eth0: link down ..is slowing down reboots considerably. 3.0-rc does _not_ like some timing hardware in my machine. Having said that, at least it does not randomly panic on SCSI like 2.6.39 does. Ok, I've ruled out TCPMSS. Found out where it was being set and neutered it. I've replicated it with only the single DNAT rule. > Could you try following patch, because this is the 'usual suspect' I had > yesterday : > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index 46cbd28..9f548f9 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -792,6 +792,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail, > fastpath = atomic_read(&skb_shinfo(skb)->dataref) == delta; > } > > +#if 0 > if (fastpath&& > size + sizeof(struct skb_shared_info)<= ksize(skb->head)) { > memmove(skb->head + size, skb_shinfo(skb), > @@ -802,7 +803,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail, > off = nhead; > goto adjust_others; > } > - > +#endif > data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask); > if (!data) > goto nodata; > > > Nope.. that's not it. That might have changed the characteristic of the fault slightly, but unfortunately I got caught with a couple of fsck's, so I only got to test it 3 times tonight. It's unfortunate that this is a production system, so I can only take it down between about 9pm and 1am. That would normally be pretty productive, except that an fsck of a 14TB ext4 can take 30 minutes if it panics at the wrong time. I'm out of time tonight, but I'll have a crack at some bisection tomorrow night. Now I just have to go back far enough that it works, and be near enough not to have to futz around with /proc /sys or drivers. I really, really, really appreciate you guys helping me with this. It has been driving me absolutely bonkers. If I'm ever in the same town as any of you, dinner and drinks are on me. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/