Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751054AbcDQVaT (ORCPT ); Sun, 17 Apr 2016 17:30:19 -0400 Received: from gherkin.frus.com ([192.158.254.49]:58550 "EHLO gherkin.frus.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750860AbcDQVaR (ORCPT ); Sun, 17 Apr 2016 17:30:17 -0400 X-Greylist: delayed 1479 seconds by postgrey-1.27 at vger.kernel.org; Sun, 17 Apr 2016 17:30:16 EDT Date: Sun, 17 Apr 2016 16:05:32 -0500 From: Bob Tracy To: linux-kernel@vger.kernel.org Cc: debian-alpha@lists.debian.org, mcree@orcon.net.nz, jay.estabrook@gmail.com, mattst88@gmail.com Subject: [BUG] machine check Oops on Alpha Message-ID: <20160417210532.GA27208@gherkin.frus.com> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="fdj2RfSjLxBAspz7" Content-Disposition: inline User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5949 Lines: 94 --fdj2RfSjLxBAspz7 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Apologies in advance for the "poor" quality of this bug report. No idea how to proceed, because the issue historically has been intermittent to non-existant for reasons unknown. Within 24 hours of booting my Alpha (PWS 433au), I'm pretty much guaranteed to see a "machine check" Oops which typically will occur during a period of high disk activity (for example, during an "apt-get update / upgrade". If I want a huge mess to clean up afterward, "git pull" on the kernel source tree will generally suffice as well :-(. As long as the "Oops" trace doesn't include evidence of filesystem write activity (calls to ext3/4 functions), the machine is perfectly stable afterward for as long as I care to let it run -- days, weeks, whatever -- no further Oopses will occur, regardless of how hard I flog the machine. A "bad" Oops will cause an immediate system lockup if any process attempts to access the region of disk that was active at the time the Oops occurred. While a "machine check" is normally indicative of an underlying hardware issue, the fact this is a one-time-per-boot issue has me thinking otherwise. I suspect a code path being traversed prior to the Oops that gets bypassed afterward. As previously mentioned, there have been months- long intervals in the past where the issue has either been masked or non- existent. Currently, the issue has persisted through several 4.X kernel release candidates and releases. Attached is an example of precisely what I'm talking about as far as a "good" Oops. It occurred within a day of the last reboot, and the machine has been running fine since. Been flogging the devil out of it, too: lots of updates (hundreds of megabytes), kernel builds, etc. While any and all help tracking this down will be appreciated, please know that kernel rebuilds (to turn on debugging or for whatever reason) are an overnight affair on this system. In other words, turnaround time on diagnostic iterations involving kernel modifications will be slow. --Bob --fdj2RfSjLxBAspz7 Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename=good_oops Apr 9 21:40:15 smirkin kernel: Unable to handle kernel paging request at virtual address 0000000000000010 Apr 9 21:40:15 smirkin kernel: dpkg-deb(19404): Oops 0 Apr 9 21:40:15 smirkin kernel: pc = [] ra = [] ps = 0007 Not tainted Apr 9 21:40:15 smirkin kernel: pc is at process_mcheck_info+0x54/0x370 Apr 9 21:40:15 smirkin kernel: ra is at cia_machine_check+0x98/0xb0 Apr 9 21:40:15 smirkin kernel: v0 = 0000000000000004 t0 = 0000000000000000 t1 = 0000000000000001 Apr 9 21:40:15 smirkin kernel: t2 = 0000000000000630 t3 = fffffc0000d405f0 t4 = fffffc0000acf166 Apr 9 21:40:15 smirkin kernel: t5 = 00000000001fffff t6 = 00000000ffffffff t7 = fffffc005cf38000 Apr 9 21:40:15 smirkin kernel: s0 = 0000000000000000 s1 = fffffc0000c61750 s2 = 0000000000000000 Apr 9 21:40:15 smirkin kernel: s3 = 0000000000000000 s4 = fffffc0000cbcef0 s5 = fffffc0000d405d0 Apr 9 21:40:15 smirkin kernel: s6 = fffffc0000c7ef70 Apr 9 21:40:15 smirkin kernel: a0 = 0000000000000630 a1 = fffffc0000aca965 a2 = 0000000000000630 Apr 9 21:40:15 smirkin kernel: a3 = 0000000000000000 a4 = 0000000000000000 a5 = 0000000000000000 Apr 9 21:40:15 smirkin kernel: t8 = 000000000000001f t9 = fffffc0000acbb38 t10= fffffc0000d40608 Apr 9 21:40:15 smirkin kernel: t11= 0000000000000000 pv = fffffc0000316120 at = 0000000000800000 Apr 9 21:40:15 smirkin kernel: gp = fffffc0000cabb38 sp = fffffc005cf3b978 Apr 9 21:40:15 smirkin kernel: Disabling lock debugging due to kernel taint Apr 9 21:40:15 smirkin kernel: Trace: Apr 9 21:40:15 smirkin kernel: [] cia_machine_check+0x98/0xb0 Apr 9 21:40:15 smirkin kernel: [] do_entInt+0x1c0/0x1e0 Apr 9 21:40:15 smirkin kernel: [] ret_from_sys_call+0x0/0x10 Apr 9 21:40:15 smirkin kernel: [] get_page_from_freelist+0x504/0xa10 Apr 9 21:40:15 smirkin kernel: [] clear_page+0x0/0xc4 Apr 9 21:40:15 smirkin kernel: [] clear_page+0x18/0xc4 Apr 9 21:40:15 smirkin kernel: [] __alloc_pages_nodemask+0xec/0xa00 Apr 9 21:40:15 smirkin kernel: [] wp_page_copy.isra.100+0x3c0/0x620 Apr 9 21:40:15 smirkin kernel: [] wp_page_copy.isra.100+0x5c/0x620 Apr 9 21:40:15 smirkin kernel: [] do_wp_page.isra.102+0x128/0x640 Apr 9 21:40:15 smirkin kernel: [] do_wp_page.isra.102+0x58/0x640 Apr 9 21:40:15 smirkin kernel: [] current_fs_time+0x4c/0x70 Apr 9 21:40:15 smirkin kernel: [] handle_mm_fault+0x73c/0x1180 Apr 9 21:40:15 smirkin kernel: [] handle_mm_fault+0xfc8/0x1180 Apr 9 21:40:15 smirkin kernel: [] timekeeping_update+0x130/0x200 Apr 9 21:40:15 smirkin kernel: [] hrtimer_run_queues+0x50/0x210 Apr 9 21:40:15 smirkin kernel: [] do_page_fault+0x150/0x500 Apr 9 21:40:15 smirkin kernel: [] find_vma+0x28/0xc0 Apr 9 21:40:15 smirkin kernel: [] do_page_fault+0xd4/0x500 Apr 9 21:40:15 smirkin kernel: [] tick_periodic.constprop.17+0x3c/0xc0 Apr 9 21:40:15 smirkin kernel: [] do_page_fault+0xbc/0x500 Apr 9 21:40:15 smirkin kernel: [] __do_softirq+0x184/0x310 Apr 9 21:40:15 smirkin kernel: [] entMM+0x9c/0xc0 Apr 9 21:40:15 smirkin kernel: [] handle_irq+0x8c/0xf0 Apr 9 21:40:15 smirkin kernel: [] do_entInt+0x5c/0x1e0 Apr 9 21:40:15 smirkin kernel: Apr 9 21:40:15 smirkin kernel: Code: a53e0008 a55e0010 23de0020 6bfa8001 a55de018 47f00412 261dffe2 --fdj2RfSjLxBAspz7--