Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S941400AbcLVUmq (ORCPT ); Thu, 22 Dec 2016 15:42:46 -0500 Received: from ipmail05.adl6.internode.on.net ([150.101.137.143]:13980 "EHLO ipmail05.adl6.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756096AbcLVUmp (ORCPT ); Thu, 22 Dec 2016 15:42:45 -0500 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: A2DeIgDUOVxYIOyiLHlYBhoBAQEBAgEBAQEIAQEBAYM1AQEBAQEfgWaCfoN5h0yUOAEBAQEBB4EcjDiED4ROggmGHAICAQECgWtBEwECAQEBAQEBAQYBAQEBAQE5RUIShBUBBTocIxAIAxgJJQ8FJQMHGhOIYAytIIp/AQEBBwIBJSCFY4UngT8BgnAMAYVuBZp4kS2QX0mNXYQPIAE1gQgWDYQjJ4FZKjSCLIQCgjsBAQE Date: Fri, 23 Dec 2016 07:42:40 +1100 From: Dave Chinner To: Linus Torvalds Cc: Thomas Gleixner , Ingo Molnar , Peter Anvin , Linux Kernel Mailing List , the arch/x86 maintainers Subject: Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0 Message-ID: <20161222204240.GJ4758@dastard> References: <20161214222411.GH4326@dastard> <20161214222953.GI4326@dastard> <20161216185906.t2wmrr6wqjdsrduw@straylight.hirudinean.org> <20161221221638.GD4758@dastard> <20161222001303.nvrtm22szn3hgxar@straylight.hirudinean.org> <20161222051322.GF4758@dastard> <20161222062858.GG4758@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2990 Lines: 74 On Thu, Dec 22, 2016 at 09:24:12AM -0800, Linus Torvalds wrote: > On Wed, Dec 21, 2016 at 10:28 PM, Dave Chinner wrote: > > > > This sort of thing is normally indicative of a memory reclaim or > > lock contention problem. Profile showed unusual spinlock contention, > > but then I realised there was only one kswapd thread running. > > Yup, sure enough, it's caused by a major change in memory reclaim > > behaviour: > > > > [ 0.000000] Zone ranges: > > [ 0.000000] DMA [mem 0x0000000000001000-0x0000000000ffffff] > > [ 0.000000] DMA32 [mem 0x0000000001000000-0x00000000ffffffff] > > [ 0.000000] Normal [mem 0x0000000100000000-0x000000083fffffff] > > [ 0.000000] Movable zone start for each node > > [ 0.000000] Early memory node ranges > > [ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff] > > [ 0.000000] node 0: [mem 0x0000000000100000-0x00000000bffdefff] > > [ 0.000000] node 0: [mem 0x0000000100000000-0x00000003bfffffff] > > [ 0.000000] node 0: [mem 0x00000005c0000000-0x00000005ffffffff] > > [ 0.000000] node 0: [mem 0x0000000800000000-0x000000083fffffff] > > [ 0.000000] Initmem setup node 0 [mem 0x0000000000001000-0x000000083fffffff] > > > > the numa=fake=4 CLI option is broken. > > Ok, I think that is independent of anything else. Removing block > people and adding the x86 people. > > I'm not seeing anything at all that would change the fake numa stuff, > but maybe the cpu hotplug changes? > > Thomas/Ingo/Peter - Dave is going away for several months, so you > won't get feedback from him, but can you look at this? Or maybe point > me towards the right people - I'm seeing no possible relevant changes > at all fir x85 numa since 4.9, so it must be some indirect breakage. > > Dave is using fake-numa to do performance testing in a VM, and it's a > big deal for the node optimizations for writeback etc. Do you have any > ideas? > > Dave, if you're still around, can you send out the kernel config file > you used... Looking at this fresh this morning (i.e. not pissed off by having everything I tried to do fail in different ways all afternoon) I found this: $ grep NUMA .config CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y # CONFIG_NUMA is not set $ The .config I was using for 4.9 got 'make oldconfig' upgraded, and looking at it there's a bunch of stuff that has been turned off that I know was set: # CONFIG_EXPERT is not set # CONFIG_PARAVIRT_SPINLOCKS is not set # CONFIG_COMPACTION is not set and stuff I never use so don't set was set, like kernel crash dump, a bunch of stuff for AMD CPUs, susp/resume and power management debug, every partition type and filesystem under the sun was selected, heaps of network devices enabled, etc. So it looks like the problem has occurred during oldconfig, meaning I have no idea exactly WTF I was testing. Rebuilding now with a saner config, see what happens. Cheers, Dave. -- Dave Chinner david@fromorbit.com