Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754020AbdIDVtz (ORCPT ); Mon, 4 Sep 2017 17:49:55 -0400 Received: from connotech.com ([76.10.176.241]:55993 "EHLO mail.connotech.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751358AbdIDVty (ORCPT ); Mon, 4 Sep 2017 17:49:54 -0400 X-Greylist: delayed 542 seconds by postgrey-1.27 at vger.kernel.org; Mon, 04 Sep 2017 17:49:54 EDT Message-ID: <59ADC86C.4040307@connotech.com> Date: Mon, 04 Sep 2017 21:41:00 +0000 From: Thierry Moreau User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.0 MIME-Version: 1.0 To: Linux Kernel Mailing List Subject: Spurious Fatal exception in interrupt, 4.1.43 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2421 Lines: 67 Let me report a difficulty with the Linux kernel -- unknown root cause. Basically, this is a FYI message, unless someone sees appropriate to investigate -- thanks in advance. This is an on-line server with services like DNS, postfix mail, Apache, local wifi, IP forwarding. No GUI. Distribution is Crux, which means a customized installation (e.g. kernel manually configured). The system ran for two years without significant problems (except for an episode of instability in disk access). Then it started to crash about two months ago and does so once or twice a week. Difficult to pinpoint any environmental factor from the occurrence pattern. Here is the trace I get from the console. blk_done_softirq+0x73/0x90 __do_softirq+0xd4/0x1e0 irq_exit+0x7e/0xa0 do_IRQ+0x4b/0xe0 common_interrupt+0x6e/0x6e lapnic_next_deadline+0x2b/0x40 cpuidle_enter_state+0x9e/0x150 cpuidle_enter_state+0x94/0x150 cpu_startup_entry+0x221/0x2b0 start_kernel+0x405/0x410 set_init_arg+0x4e/0x4e early_init_idt_handler_array+0x120/0x120 early_init_idt_handler_array+0x120/0x120 x86_64_start_kernel+0xe5/0xf2 Code: ff e0 0f 1f 80 00 00 00 00 48 8b 5f 50 89 74 24 0c e8 c3 fc ff ff 8b 74 24 84 00 00 00 00 00 f0 ff 47 44 e9 57 ff RIP [] bio_endio+0x92/0xa0 RSP ---[ end trace b293c5209809c889 ]--- Kernel panic - not syncing: Fatal exception in interrupt Kernel Offset: disabled ---[ end Kernel panic - not syncing: Fatal exception in interrupt I upgraded the kernel from 4.1.3 to 4.1.43 and got exactly the same trace after a few days of up-time. Here are some options I see for the next step: (A) Upgrade once more to 4.4.x, 4.9.x, or 4.12.x (or 4.13). (B) Investigate kernel configuration for reliability-impacting options. (C) Review the system BIOS configuration (e.g. under the hypothesis that interrupt processing calls for a special memory/cache access cycle that is borderline with the current BIOS confguration). (D) Remove the wifi service (the only "option" from an operational perspective). (E) Provision a replacement system and remove other services in order to isolate environmental factors (likely to confirm that a useless system is working fine generally!). (F) Give up with this hardware (and thus deprive the Linux community from possible improvement if some kernel issue was to be identified in troubleshooting ???). Any suggestion? - Thierry Moreau