Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754695AbdLOBye (ORCPT ); Thu, 14 Dec 2017 20:54:34 -0500 Received: from szxga06-in.huawei.com ([45.249.212.32]:41181 "EHLO huawei.com" rhost-flags-OK-FAIL-OK-FAIL) by vger.kernel.org with ESMTP id S1754571AbdLOBy3 (ORCPT ); Thu, 14 Dec 2017 20:54:29 -0500 Subject: [consult the suggestion]: Avoid kernel panic when killing an application if happen RAS page table error From: gengdongjiu CC: James Morse , "linux-mm@kvack.org" , Linux Kernel Mailing List , Huangshaoyu , Wuquanming , "linux-arm-kernel@lists.infradead.org" References: <0184EA26B2509940AA629AE1405DD7F2019C8B36@DGGEMA503-MBS.china.huawei.com> <20171205165727.GG3070@tassilo.jf.intel.com> <0276f3b3-94a5-8a47-dfb7-8773cd2f99c5@huawei.com> Message-ID: <0b7bb7b3-ae39-0c97-9c0a-af37b0701ab4@huawei.com> Date: Fri, 15 Dec 2017 09:54:01 +0800 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.3.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset="utf-8" Content-Language: en-US Content-Transfer-Encoding: 7bit X-Originating-IP: [10.142.68.147] X-CFilter-Loop: Reflected To: unlisted-recipients:; (no To-header on input) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3894 Lines: 67 Hi James/All, If the user space application happen page table RAS error,Memory error handler(memory_failure()) will do nothing except making a poisoned page flag, and fault handler in arch/arm64/mm/fault.c will deliver a signal to kill this application. when this application exits, it will call unmap_vmas () to release his vma resource, but here it will touch the error page table again, then will trigger RAS error again, so this application cannot be killed and system will be panic, the log is shown in [2]. As shown the stack in [1], unmap_page_range() will touch the error page table, so system will panic, there are some simple way to avoid this panic and avoid change much about the memory management. 1. put the tasks to dead status, not run it again. 2. not release the page table for this task. Of cause, above methods may happen memory leakage. do you have good suggestion about how to solve it?, or do you think this panic is expected behavior? thanks. [1]: get_signal() do_group_exit() mmput() exit_mmap() unmap_vmas() unmap_single_vma() unmap_page_range() [2] [ 676.669053] Synchronous External Abort: level 0 (translation table walk) (0x82000214) at 0x0000000033ff7008 [ 676.686469] Memory failure: 0xcd4b: already hardware poisoned [ 676.700652] Synchronous External Abort: synchronous external abort (0x96000410) at 0x0000000033ff7008 [ 676.723301] Internal error: : 96000410 [#1] PREEMPT SMP [ 676.723616] Modules linked in: inject_memory_error(O) [ 676.724601] CPU: 0 PID: 1506 Comm: mca-recover Tainted: G O 4.14.0-rc8-00019-g5b5c6f4-dirty #109 [ 676.724844] task: ffff80000cd41d00 task.stack: ffff000009b30000 [ 676.726616] PC is at unmap_page_range+0x78/0x6fc [ 676.726960] LR is at unmap_single_vma+0x88/0xdc [ 676.727122] pc : [] lr : [] pstate: 80400149 [ 676.727227] sp : ffff000009b339b0 [ 676.727348] x29: ffff000009b339b0 x28: ffff80000cd41d00 [ 676.727653] x27: 0000000000000000 x26: ffff80000cd42410 [ 676.727919] x25: ffff80000cd41d00 x24: ffff80000cd1e180 [ 676.728161] x23: ffff80000ce22300 x22: 0000000000000000 [ 676.728407] x21: ffff000009b33b28 x20: 0000000000400000 [ 676.728642] x19: ffff80000cd1e180 x18: 000000000000016d [ 676.728875] x17: 0000000000000190 x16: 0000000000000064 [ 676.729117] x15: 0000000000000339 x14: 0000000000000000 [ 676.729344] x13: 00000000000061a8 x12: 0000000000000339 [ 676.729582] x11: 0000000000000018 x10: 0000000000000a80 [ 676.729829] x9 : ffff000009b33c60 x8 : ffff80000cd427e0 [ 676.730065] x7 : ffff000009b33de8 x6 : 00000000004a2000 [ 676.730287] x5 : 0000000000400000 x4 : ffff80000cd4b000 [ 676.730517] x3 : 00000000004a1fff x2 : 0000008000000000 [ 676.730741] x1 : 0000007fffffffff x0 : 0000008000000000 [ 676.731101] Process mca-recover (pid: 1506, stack limit = 0xffff000009b30000) [ 676.731281] Call trace: [ 676.734196] [] unmap_page_range+0x78/0x6fc [ 676.734539] [] unmap_single_vma+0x88/0xdc [ 676.734892] [] unmap_vmas+0x68/0xb4 [ 676.735456] [] exit_mmap+0x90/0x140 [ 676.736468] [] mmput+0x60/0x118 [ 676.736791] [] do_exit+0x240/0x9cc [ 676.736997] [] do_group_exit+0x38/0x98 [ 676.737384] [] get_signal+0x1ec/0x548 [ 676.738313] [] do_signal+0x7c/0x668 [ 676.738617] [] do_notify_resume+0xcc/0x114 [ 676.740983] [] work_pending+0x8/0x10 [ 676.741360] Code: f94043a4 f9404ba2 f94037a3 d1000441 (f9400080) [ 676.741745] ---[ end trace e42d453027313552 ]--- [ 676.804174] Fixing recursive fault but reboot is needed! [ 677.462082] Memory failure: 0xcd4b: already hardware poisoned