Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751427AbdGYVyE (ORCPT ); Tue, 25 Jul 2017 17:54:04 -0400 Received: from mail-it0-f45.google.com ([209.85.214.45]:35472 "EHLO mail-it0-f45.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750996AbdGYVyC (ORCPT ); Tue, 25 Jul 2017 17:54:02 -0400 MIME-Version: 1.0 From: Satoru Takeuchi Date: Wed, 26 Jul 2017 06:54:01 +0900 Message-ID: Subject: [FYI] GCC segfaults under heavy multithreaded compilation with AMD Ryzen To: LKML Cc: x86@kernel.org Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2574 Lines: 52 # I'm a LKML subscriber, but not a x86 list subscriber I found the following new linux kernel bugzilla about Ryzen related problem. Since many developers don't check this bugzilla and I've also encountered this problem, I decided to introduce this problem here. https://bugzilla.kernel.org/show_bug.cgi?id=196481: > I am running Ubuntu and installed the mainline kernel from the mainline PPA. > It seems like the Ryzen processor has some bug that leads to gcc crashing > when compiling a very large program under heavy load. This is easily reproduced > in my system using the script from > > https://github.com/suaefar/ryzen-test > > (It assumes that you are running Ubuntu, maybe Debian also works. Just clone it and run the > script kill_ryzen.sh. It downloads the gcc 7.1 code and start multiple compilations of it. If any > compilations fails its warns the user giving the time to detect failure). > > There is already a bug report about this in the FreeBSD bugzilla > (https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=219399#c89). > There is also a thread on the subject in AMD community forum > (https://community.amd.com/thread/215773?start=300&tstart=0) > and Phoronix (https://www.phoronix.com/forums/forum/hardware/processors-memory/955368-some-ryzen-linux-users-are-facing-issues-with-heavy-compilation-loads). > > This is probably a processor bug. But I thought that I should try to call the attention of > the kernel developers to this issue as it may be possible to workaround it in the kernel. > > Obs: If I disable SMT in BIOS the problem gets much better moving from failures > after a couple of minute to one failure in 3 to 4 hours) What I want here is that this problem is known by many people, especially by x86 experts, asking the hint to find the root cause, and making the reliable workaround patch. Summary of this problem from my point of view: - gcc sometimes fails with SEGV at random - at least part of this problem is caused by running instructions at "RIP - 0x40" - tens of people encountered this problem - probably it is a hardware problem: many OSes WSL, NetBSD, and FreeBSD encountered the very similar problem. In addition, this problem happens with ECC memory and memtest86 clean memory - the root cause is not found yet. AMD have seemed to try to find it for several months, but there have been no update from AMD yet - There are workaround patch in FreeBSD, but it's not sure that it's a reliable one since the root cause is not sure Fore more detail, please refer to the links at the above mentioned bugzilla. Regards, Satoru