Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18;
Subject: Re: [PATCH] arm64: panic on synchronous external abort in kernel
 context
To:     James Morse <james.morse@arm.com>
CC:     Mark Rutland <mark.rutland@arm.com>, <catalin.marinas@arm.com>,
        <will@kernel.org>, <tglx@linutronix.de>, <tanxiaofei@huawei.com>,
        <wangxiongfeng2@huawei.com>,
        <linux-arm-kernel@lists.infradead.org>,
        <linux-kernel@vger.kernel.org>
References: <20200410015245.23230-1-xiexiuqi@huawei.com>
 <20200414105923.GA2486@C02TD0UTHF1T.local>
 <adc93578-5cfc-09c3-0b88-b265e310ef97@arm.com>
 <21997719-c521-c39a-f521-54feae16fc45@huawei.com>
 <60131aba-4e09-7824-17b2-a6fc711c150b@arm.com>
From:   Xie XiuQi <xiexiuqi@huawei.com>
Message-ID: <483792fd-5b01-8cf7-7687-abe513cd3474@huawei.com>
Date:   Thu, 30 Apr 2020 17:44:00 +0800
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101
 Thunderbird/68.6.0
MIME-Version: 1.0
In-Reply-To: <60131aba-4e09-7824-17b2-a6fc711c150b@arm.com>
Content-Type: text/plain; charset="utf-8"
Content-Language: en-US
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk

Hi James,

Sorry for reply late.

On 2020/4/14 22:53, James Morse wrote:
> Hi Xie,
> 
> On 14/04/2020 13:39, Xie XiuQi wrote:
>> On 2020/4/14 20:16, James Morse wrote:
>>> On 14/04/2020 11:59, Mark Rutland wrote:
>>>> On Fri, Apr 10, 2020 at 09:52:45AM +0800, Xie XiuQi wrote:
>>>>> We should panic even panic_on_oops is not set, when we can't recover
>>>>> from synchronous external abort in kernel context.
>>>
>>> Hmm, fault-from-kernel-context doesn't mean the fault affects the kernel. If the kernel is
>>> reading or writing from user-space memory for a syscall, its the user-space memory that is
>>> affected. This thread can't make progress, so we kill it.
>>> If its a kernel thread or we were in irq context, we panic().
>>>
>>> I don't think you really want all faults that happen as a result of a kernel access to be
>>> fatal!
> 
>> Yes, you're right. I just want to fix a hung up when ras error inject testing.
>>
>> panic_on_oops is not set in the kernel image for testing. When receiving a sea in kernel
>> context, the PE trap in do_exit(), and can't return any more.
> 
> trap? gets trapped, (or gets stuck, to prevent confusion with the architectures use of the
> word 'trap'!)
> 
> 
>> I analyze the source code, the call trace might like this:
>>    do_mem_abort
>>      -> arm64_notify_die
>>         -> die                    # kernel context, call die() directly;
>>            -> do_exit             # kernel process context, call do_exit(SIGSEGV);
>>               -> do_task_dead()   # call do_task_dead(), and hung up this core;
> 
> Thanks for the trace. This describes a corrected error in your I-cache, that occurred
> while the kernel was executing a kernel thread. These shouldn't be fatal, because it was
> corrected ... but the kernel doesn't know that because it doesn't know how to parse those
> records.
> 
> There are two things wrong here:
> 1. it locks up while trying to kill the thread.
> 2. it tried to kill the thread in the first place!
> 
> For 1, does your l1l2_inject module take any spinlocks or tinker with the pre-empt counter?
> 
> I suspect this is some rarely-tested path in do_task_dead() that sleeps, but can't from
> your l1l2_inject module because the pre-empt counter has been raised.
> 
> CONFIG_DEBUG_ATOMIC_SLEEP may point at the function to blame.
> 
> It may be accessing some thread data that kthreads don't have, taking a second exception
> and blocking on the die_lock. LOCKDEP should catch this one.
> 
> We should fix this one first.
> 

I analyze the l1l2_inject module, there is a kworker thread used to inject error.
do_sea() try to kill the kworker thread, so the work(s) queued on this kworker
is blocked.

panic_on_oops option is usually set on production environment, so if someone
unset this option for testing, he should take his own risk. Is it right?

> 
> 2 is a bit more complicated. Today, this is fatal because the arch code assumes this was
> probably a memory error, and if it returns to user-space it can't avoid getting stuck in a
> loop until the scheduled memory_failure() work runs. Instead it unconditionally signals
> the process.
> 
> [0] fixes this up for memory errors. But in this case it will assume all the work has been
> done by APEI, (or will be before we get back to user-space). APEI ignored the processor
> error you fed it, because it doesn't know what they are, they are just printed out.
> 
> This is fine for corrected errors, but were are reliant on your firmware describing
> uncorrected errors with a 'fatal' severity... which might be too heavy a hammer. (Ideally
> that would mean 'uncontained', and the kernel should handle, or detect it can't, any other
> errror...)
> 
> We can discuss the right thing to do here when support for parsing these 'arm processor
> errors' is posted.
> (If you think I need to do something different in [0] because of this, please shout!)

For some errors which could be recoverable or corrected, do_sea() should not kill process
or die() unconditionally. It's better detect this situation.

> 
> 
>> [  387.740609] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 9
>> [  387.748837] {1}[Hardware Error]: event severity: recoverable
>> [  387.754470] {1}[Hardware Error]:  Error 0, type: recoverable
> 
>> [  387.760103] {1}[Hardware Error]:   section_type: ARM processor error
> 
> et voila! Case 2. Linux doesn't handle these 'ARM processor error' things, because it
> doesn't know what they are. It just prints them out.
> 
> 
>> [  387.766425] {1}[Hardware Error]:   MIDR: 0x00000000481fd010
>> [  387.771972] {1}[Hardware Error]:   Multiprocessor Affinity Register (MPIDR): 0x0000000081080000
>> [  387.780628] {1}[Hardware Error]:   error affinity level: 0
>> [  387.786088] {1}[Hardware Error]:   running state: 0x1
>> [  387.791115] {1}[Hardware Error]:   Power State Coordination Interface state: 0
>> [  387.798301] {1}[Hardware Error]:   Error info structure 0:
>> [  387.803761] {1}[Hardware Error]:   num errors: 1
>> [  387.808356] {1}[Hardware Error]:    error_type: 0, cache error
>> [  387.814160] {1}[Hardware Error]:    error_info: 0x0000000024400017
>> [  387.820311] {1}[Hardware Error]:     transaction type: Instruction
>> [  387.826461] {1}[Hardware Error]:     operation type: Generic error (type cannot be determined)
>> [  387.835031] {1}[Hardware Error]:     cache level: 1
> 
>> [  387.839878] {1}[Hardware Error]:     the error has been corrected
> 
> As this is corrected, I think the bug is a deadlock somewhere in do_task_dead().
> 
> 
>> [  387.845942] {1}[Hardware Error]:    physical fault address: 0x00000027caf50770
> 
> (and your firmware gives you the physical address, excellent, the kernel can do something
> with this!)
> 
> 
>> [  388.021037] Call trace:
>> [  388.023475]  lsu_inj_ue+0x58/0x70 [l1l2_inject]
>> [  388.029019]  error_inject+0x64/0xb0 [l1l2_inject]
>> [  388.033707]  process_one_work+0x158/0x4b8
>> [  388.037699]  worker_thread+0x50/0x498
>> [  388.041348]  kthread+0xfc/0x128
>> [  388.044480]  ret_from_fork+0x10/0x1c
>> [  388.048042] Code: b2790000 d519f780 f9800020 d5033f9f (58001001)
>> [  388.054109] ---[ end trace 39d51c21b0e42ba6 ]---
>>
>> core 0 hung up at here.
> 
> DEBUG_ATOMIC_SLEEP and maybe LOCKDEP should help you pin down where the kernel is getting
> stuck. It looks like a bug in the core code.
> 
> 
> Thanks,
> 
> James
> 
> [0] https://lore.kernel.org/linux-acpi/20200228174817.74278-4-james.morse@arm.com/
> .
>