Received: by 2002:a05:7412:419a:b0:f3:1519:9f41 with SMTP id i26csp2373783rdh; Sun, 26 Nov 2023 04:26:25 -0800 (PST) X-Google-Smtp-Source: AGHT+IGV2IgsIUbUrlnqqY64TJzHcZZQL3ewlciL7VAgjODUn1SU138hZ+lnQ/s7yweykrpbpOKC X-Received: by 2002:a05:6a20:8403:b0:18a:d7a8:5e5a with SMTP id c3-20020a056a20840300b0018ad7a85e5amr14565140pzd.58.1701001585037; Sun, 26 Nov 2023 04:26:25 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1701001585; cv=none; d=google.com; s=arc-20160816; b=kfqtAiikIMUClhZTk4fYUl4mjyH8Ek2hhZlWLq1nJ2+21Ipq5NKOgk5keQmaNc6uaK 0TbqDFS/nQ7eX90sx7YA4NZBvofBufgMq3wRJyiBRJyK9PkKhmBuM5XIoSHo+c+slRH3 8X9Ll36Ybnf96/v6KsQqwDlWTDqAzD3C5inilbQG8IL+7AjHmcPC9KDpll/Kup1mn1DG u3NSV8FkldhCYvDRK+h4iLLp6WKoVkaPfrIMIaiGPbiIqIdX//oNnwKS2yKocRoaeHO3 90hNAu/CQ1vOJiD+hZpQILDNTJThAk1x5AJCr0yS4fdZVV1eJzng/np4fzR5ReEkYw7g jeDg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:content-language:subject:user-agent:mime-version :date:message-id; bh=rGv1XIeTgXzrE82QL/lEVzlB5CR7aPW/Sf1F+1TgDNk=; fh=rikEj6uffZHTMk8314CTdpswIQOdhdUpb66g2zZL8Lk=; b=j473tTWhR1guqxXMhJo/0PCvtl9LNxipYT5zqhGcyMRFLFOOLbFtWvuxbzsYWtb2HJ c9TWuhiaqo5BYexIu10TzxDM0YwpM+2IdZTNnr/N69dsQLW3KhPm7ffGxuBy02KlIlkM dGi+eAr6DE/KjgpbOgj2q73oygUErQlJ0Jg2oELts1paFHYuNF76qNaGUJuyUni/UHjJ 1JAmGue15mkSDDNmtYAkaM4z+R0YuJM3jckaePXDjwCQe6AuZv17+Gbxj+QACHn2tR5h CEeYTfyraEVT5OnxAAhLkXCCpDBLLaj0lDB4C8F/JjQ2XK7liQZrzSnip5mvwT3uBXo3 iT/A== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:6 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from pete.vger.email (pete.vger.email. [2620:137:e000::3:6]) by mx.google.com with ESMTPS id b8-20020a170902d50800b001cfcd4d8f3csi136754plg.357.2023.11.26.04.26.24 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 26 Nov 2023 04:26:25 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:6 as permitted sender) client-ip=2620:137:e000::3:6; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:6 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by pete.vger.email (Postfix) with ESMTP id B1AC48025BEE; Sun, 26 Nov 2023 04:26:21 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at pete.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229456AbjKZMZs (ORCPT + 99 others); Sun, 26 Nov 2023 07:25:48 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52530 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229798AbjKZMZo (ORCPT ); Sun, 26 Nov 2023 07:25:44 -0500 Received: from out30-113.freemail.mail.aliyun.com (out30-113.freemail.mail.aliyun.com [115.124.30.113]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0436DDE; Sun, 26 Nov 2023 04:25:48 -0800 (PST) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R181e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018045192;MF=xueshuai@linux.alibaba.com;NM=1;PH=DS;RN=34;SR=0;TI=SMTPD_---0Vx6Gb4W_1701001542; Received: from 30.27.123.85(mailfrom:xueshuai@linux.alibaba.com fp:SMTPD_---0Vx6Gb4W_1701001542) by smtp.aliyun-inc.com; Sun, 26 Nov 2023 20:25:45 +0800 Message-ID: <1048123e-b608-4db1-8d5f-456dd113d06f@linux.alibaba.com> Date: Sun, 26 Nov 2023 20:25:38 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v9 0/2] ACPI: APEI: handle synchronous errors in task work with proper si_code Content-Language: en-US To: Borislav Petkov Cc: rafael@kernel.org, wangkefeng.wang@huawei.com, tanxiaofei@huawei.com, mawupeng1@huawei.com, tony.luck@intel.com, linmiaohe@huawei.com, naoya.horiguchi@nec.com, james.morse@arm.com, gregkh@linuxfoundation.org, will@kernel.org, jarkko@kernel.org, linux-acpi@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, akpm@linux-foundation.org, linux-edac@vger.kernel.org, acpica-devel@lists.linuxfoundation.org, stable@vger.kernel.org, x86@kernel.org, justin.he@arm.com, ardb@kernel.org, ying.huang@intel.com, ashish.kalra@amd.com, baolin.wang@linux.alibaba.com, tglx@linutronix.de, mingo@redhat.com, dave.hansen@linux.intel.com, lenb@kernel.org, hpa@zytor.com, robert.moore@intel.com, lvying6@huawei.com, xiexiuqi@huawei.com, zhuo.song@linux.alibaba.com References: <20221027042445.60108-1-xueshuai@linux.alibaba.com> <20231007072818.58951-1-xueshuai@linux.alibaba.com> <20231123150710.GEZV9qnkWMBWrggGc1@fat_crate.local> <9e92e600-86a4-4456-9de4-b597854b107c@linux.alibaba.com> <20231125121059.GAZWHkU27odMLns7TZ@fat_crate.local> From: Shuai Xue In-Reply-To: <20231125121059.GAZWHkU27odMLns7TZ@fat_crate.local> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-0.8 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE, UNPARSEABLE_RELAY autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on pete.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (pete.vger.email [0.0.0.0]); Sun, 26 Nov 2023 04:26:21 -0800 (PST) On 2023/11/25 20:10, Borislav Petkov wrote: Hi, Borislav, Thank you for your reply, and sorry for the confusion I made. Please see my rely inline. Best Regards, Shuai > On Sat, Nov 25, 2023 at 02:44:52PM +0800, Shuai Xue wrote: >> - an AR error consumed by current process is deferred to handle in a >> dedicated kernel thread, but memory_failure() assumes that it runs in the >> current context > > On x86? ARM? > > Pease point to the exact code flow. An AR error consumed by current process is deferred to handle in a dedicated kernel thread on ARM platform. The AR error is handled in bellow flow: ----------------------------------------------------------------------------- [usr space task einj_mem_uc consumd data poison, CPU 3] STEP 0 ----------------------------------------------------------------------------- [ghes_sdei_critical_callback: current einj_mem_uc, CPU 3] STEP 1 ghes_sdei_critical_callback => __ghes_sdei_callback => ghes_in_nmi_queue_one_entry // peak and read estatus => irq_work_queue(&ghes_proc_irq_work) <=> ghes_proc_in_irq // irq_work [ghes_sdei_critical_callback: return] ----------------------------------------------------------------------------- [ghes_proc_in_irq: current einj_mem_uc, CPU 3] STEP 2 => ghes_do_proc => ghes_handle_memory_failure => ghes_do_memory_failure => memory_failure_queue // put work task on current CPU => if (kfifo_put(&mf_cpu->fifo, entry)) schedule_work_on(smp_processor_id(), &mf_cpu->work); => task_work_add(current, &estatus_node->task_work, TWA_RESUME); [ghes_proc_in_irq: return] ----------------------------------------------------------------------------- // kworker preempts einj_mem_uc on CPU 3 due to RESCHED flag STEP 3 [memory_failure_work_func: current kworker, CPU 3] => memory_failure_work_func(&mf_cpu->work) => while kfifo_get(&mf_cpu->fifo, &entry); // until get no work => memory_failure(entry.pfn, entry.flags); ----------------------------------------------------------------------------- [ghes_kick_task_work: current einj_mem_uc, other cpu] STEP 4 => memory_failure_queue_kick => cancel_work_sync - waiting memory_failure_work_func finish => memory_failure_work_func(&mf_cpu->work) => kfifo_get(&mf_cpu->fifo, &entry); // no work ----------------------------------------------------------------------------- [einj_mem_uc resume at the same PC, trigger a page fault STEP 5 STEP 0: A user space task, named einj_mem_uc consume a poison. The firmware notifies hardware error to kernel through is SDEI (ACPI_HEST_NOTIFY_SOFTWARE_DELEGATED). STEP 1: The swapper running on CPU 3 is interrupted. irq_work_queue() rasie a irq_work to handle hardware errors in IRQ context STEP2: In IRQ context, ghes_proc_in_irq() queues memory failure work on current CPU in workqueue and add task work to sync with the workqueue. STEP3: The kworker preempts the current running thread and get CPU 3. Then memory_failure() is processed in kworker. STEP4: ghes_kick_task_work() is called as task_work to ensure any queued workqueue has been done before returning to user-space. STEP5: Upon returning to user-space, the task einj_mem_uc resumes at the current instruction, because the poison page is unmapped by memory_failure() in step 3, so a page fault will be triggered. memory_failure() assumes that it runs in the current context on both x86 and ARM platform. for example: memory_failure() in mm/memory-failure.c: if (flags & MF_ACTION_REQUIRED) { folio = page_folio(p); res = kill_accessing_process(current, folio_pfn(folio), flags); } > >> - another page fault is not unnecessary, we can send sigbus to current >> process in the first Synchronous External Abort SEA on arm64 (analogy >> Machine Check Exception on x86) > > I have no clue what that means. What page fault? I mean page fault in step 5. We can simplify the above flow by queuing memory_failure() as a task work for AR errors in step 3 directly. > >> I just give an example that the user space process *really* relys on the >> si_code of signal to handle hardware errors > > No, don't give examples. > > Explain what the exact problem is you're seeing, in your use case, point > to the code and then state how you think it should be fixed and why. > > Right now your text is "all over the place" and I have no clue what you > even want. Ok, got it. Thank you. > >> The SIGBUS si_codes defined in include/uapi/asm-generic/siginfo.h says: >> >> /* hardware memory error consumed on a machine check: action required */ >> #define BUS_MCEERR_AR 4 >> /* hardware memory error detected in process but not consumed: action optional*/ >> #define BUS_MCEERR_AO 5 >> >> When a synchronous error is consumed by Guest, the kernel should send a >> signal with BUS_MCEERR_AR instead of BUS_MCEERR_AO. > > Can you drop this "synchronous" bla and concentrate on the error > *severity*? > > I think you want to say that there are some types of errors for which > error handling needs to happen immediately and for some reason that > doesn't happen. > > Which errors are those? Types? > > Why do you need them to be handled immediately? Well, the severity defined on x86 and ARM platform is quite different. I guess you mean taxonomy of producer error types. - X86: Software recoverable action required (SRAR) A UCR error that *requires* system software to take a recovery action on this processor *before scheduling another stream of execution on this processor*. (15.6.3 UCR Error Classification in Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3) - ARM: Recoverable state (UER) The PE determines that software *must* take action to locate and repair the error to successfully recover execution. This might be because the exception was taken before the error was architecturally consumed by the PE, at the point when the PE was not be able to make correct progress without either consuming the error or *otherwise making the state of the PE unrecoverable*. (2.3.2 PE error state classification in Arm RAS Supplement https://documentation-service.arm.com/static/63185614f72fad1903828eda) I think above two types of error need to be handled immediately. > >> Exactly. > > No, not exactly. Why is it ok to do that? What are the implications of > this? > > Is immediate killing the right decision? > > Is this ok for *every* possible kernel running out there - not only for > your use case? > > And so on and so on... > I don't have a clear answer here. I guess the poison data only effects the user space task which triggers exception. A panic is not necessary. On x86 platform, the current error handling of memory_failure() in kill_me_maybe() is just send a sigbus forcely. kill_me_maybe(): ret = memory_failure(pfn, flags); if (ret == -EHWPOISON || ret == -EOPNOTSUPP) return; pr_err("Memory error not recovered"); kill_me_now(cb); Do you have any comments or suggestion about this? I don't change x86 behavior. For arm64 platform, step 3 in above flow, memory_failure_work_func(), the call site of memory_failure(), does not handle the return code of memory_failure(). I just add the same behavior.