Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S936216AbcLTQ7o (ORCPT ); Tue, 20 Dec 2016 11:59:44 -0500 Received: from caffeine.csclub.uwaterloo.ca ([129.97.134.17]:58753 "EHLO caffeine.csclub.uwaterloo.ca" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932357AbcLTQ7l (ORCPT ); Tue, 20 Dec 2016 11:59:41 -0500 Date: Tue, 20 Dec 2016 11:59:40 -0500 To: linux-kernel@vger.kernel.org Subject: Re: Debug hints for fpu state NULL pointer dereference on context switch during core dump in 3.0.101 Message-ID: <20161220165940.GC17367@csclub.uwaterloo.ca> References: <20161219180939.GA17367@csclub.uwaterloo.ca> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20161219180939.GA17367@csclub.uwaterloo.ca> User-Agent: Mutt/1.5.23 (2014-03-12) From: lsorense@csclub.uwaterloo.ca (Lennart Sorensen) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 1334 Lines: 28 On Mon, Dec 19, 2016 at 01:09:39PM -0500, Lennart Sorensen wrote: > I am trying to debug a problem that has been happening occationally for > years on some of our systems running 3.0.101 kernel (yes I know it is > old, we are moving to 4.9 at the moment but I would like older releases > to be fixed too, assuming 4.9 makes this problem disappear). > > What is happening is that once in a while a process does something wrong > and segfaults, and dumps core. We have a handler to process the core dump > to name it and compress it and make sure we don't keep to many around, > so the core_pattern uses the pipe option to pipe the dump to a shell > script that saves it with the pid and current timestamp and gzips it. > > Once in a while when this happens, the kernel hits a null pointer > dereference in fpu.state->xsave while doing __switch_to. > > The system ix x86_64 with dual E5-2620 CPUs (6 cores each with > hyperthreading). Some people think they have seen it on other systems, > but are not sure. I have not been able to trigger it on other systems > yet. > > It used to take about a week of running tests to trigger it, but I have > now managed to hit it in a few minutes pretty reliably. If the core_pattern is not set to use a pipe, but just save as core.%e.%p then the problem does not happen. -- Len Sorensen