Return-Path: Received: from mail-io0-f177.google.com ([209.85.223.177]:45109 "EHLO mail-io0-f177.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750736AbdKIUEU (ORCPT ); Thu, 9 Nov 2017 15:04:20 -0500 MIME-Version: 1.0 In-Reply-To: <40ad7c6e-f0d7-959a-bf29-d3e3843f5d31@gentoo.org> References: <20171109193715.GB21978@ZenIV.linux.org.uk> <40ad7c6e-f0d7-959a-bf29-d3e3843f5d31@gentoo.org> From: Linus Torvalds Date: Thu, 9 Nov 2017 12:04:19 -0800 Message-ID: Subject: Re: [nfsd4] potentially hardware breaking regression in 4.14-rc and 4.13.11 To: Patrick McLean Cc: Al Viro , Bruce Fields , "Darrick J. Wong" , Linux Kernel Mailing List , Linux NFS Mailing List , stable , Thorsten Leemhuis Content-Type: text/plain; charset="UTF-8" Sender: linux-nfs-owner@vger.kernel.org List-ID: On Thu, Nov 9, 2017 at 11:51 AM, Patrick McLean wrote: > > We do have CONFIG_GCC_PLUGIN_STRUCTLEAK and > CONFIG_GCC_PLUGIN_STRUCTLEAK_BYREF_ALL enabled on these boxes as well as > CONFIG_GCC_PLUGIN_RANDSTRUCT as you pointed out before. It might be worth just verifying without RANDSTRUCT in particular. That case has probably not gotten a huge amount of testing. As Al points out, it can cause absolutely horrendous cache access pattern changes, but it might also be triggering some corruption in case there's a problem with the plugin, or with some piece of kernel code that gets confused by it. And most obviously: if there is some module or part of the kernel that got compiled with a different seed for the randstruct hashing, that will break in nasty nasty ways. Your out-of-kernel module is the obvious suspect for something like that, but honestly, it could be some missing build dependency, or simply a missing special case in the plugin itself a missing __no_randomize_layout or any number of things. We've hit gcc bugs many times before - and the plugins are just new opportunities to hit cases that have gotten a lot less testing than the "normal" code flow has. The structleak plugin is much less likely to be a problem (simply because it's a much simpler plugin), but hey, something being NULL when it shouldn't possibly be might be a stray "leak initialization". So since you seem to be able to reproduce this _reasonably_ easily, it's definitely worth checking that it still reproduces even without the gcc plugins. Just to narrow it down a bit. Linus