Received: by 2002:a05:6602:18e:0:0:0:0 with SMTP id m14csp8590669ioo; Sat, 4 Jun 2022 12:32:48 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxMqIG+AQkKFlrbO5pwUPuTpUcoN2ueNlmHOA2LOOXhvW+GF6SOwvttcLtiG81Afn1q/oyQ X-Received: by 2002:a63:2319:0:b0:3fb:fe88:5831 with SMTP id j25-20020a632319000000b003fbfe885831mr14245270pgj.85.1654371168441; Sat, 04 Jun 2022 12:32:48 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1654371168; cv=none; d=google.com; s=arc-20160816; b=H2Y6KqrzHPsFASnChqDN4A7L5Cx2bD7igkFdf4p5j78RkDsBPM6OYIw/A/qn2CQPin ddX8FAGknj4Rc2QIXJSNN7+FNfuH5oxoW3iFVGQg5n8pMAFhnqprZK0qEz/I4U6W+6VL U5KEwpZLJYCGARoJkmeQXsDRHm3mHqILQ2xW4Tj64k9VfjQkz5QL46BWdmpCr91LnmAY 1lCh6BW6gkGifqG5AnFkDxs0ha6WRGh9ByHwIWYLgaijfkTBLbIVNS1JLn6bFtQGOwEP G4Za61skuno4pbFNfjP8OjJHRyHnEDJsnyfQRQfrreW3p84jXDYJXkimMtVOGbpbdpMb pNEg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:content-language:subject:user-agent:mime-version :date:message-id; bh=AYAWon+En0hy5c6rBACZyv+Zy/FGhhMMXLvGQvadSGw=; b=plfo739Ui8VBIKeGH5AQpcNPDyJdo15oEV0QMokm8LimlK70hXSxnAfqy5zC1Vppkn IpvK3HFZ6Zga7N2zBjUtidFZh88p2R30ws8Plcm76ZMLH5cLexsAjysVt0nXzd70Bz5I UqIBxeJoxn7aAlDOVCxTMdNtq8hGeCTWbHyVWDZpjYqPlcyl+12X7BvkWoDpmzyCDV7W ZOTIA/hy9b0cswbNu9pXD7zQ3qr6REguPzkg32k6KkVl/OM+GHW5x+iJivDFAql/IsEO 33eVGQi/DMJ90ofMXF0yzDybHXhbPpI9i7+/FI84Mw9SjnAwDbfCX5jyzO3TudeM4Z4U BJtA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id u11-20020aa7838b000000b0051036547a4fsi12877492pfm.115.2022.06.04.12.32.36; Sat, 04 Jun 2022 12:32:48 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S245468AbiFCPYJ (ORCPT + 99 others); Fri, 3 Jun 2022 11:24:09 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:32954 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229944AbiFCPYH (ORCPT ); Fri, 3 Jun 2022 11:24:07 -0400 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 8A46038BC2 for ; Fri, 3 Jun 2022 08:24:06 -0700 (PDT) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id EA63B1063; Fri, 3 Jun 2022 08:24:05 -0700 (PDT) Received: from [192.168.1.11] (unknown [172.31.20.19]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 23B4B3F766; Fri, 3 Jun 2022 08:24:03 -0700 (PDT) Message-ID: <99a207dc-93cd-1bea-2ffc-404a9f6587bf@arm.com> Date: Fri, 3 Jun 2022 17:23:33 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.8.1 Subject: Re: [aarch64] INFO: rcu_sched detected expedited stalls on CPUs/tasks Content-Language: en-US To: Bruno Goncalves , linux-arm-kernel@lists.infradead.org, LKML Cc: CKI Project , rric@kernel.org, Ionela Voinescu , Dietmar Eggemann References: From: Pierre Gondois In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-8.2 required=5.0 tests=BAYES_00,NICE_REPLY_A, RCVD_IN_DNSWL_HI,SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello Bruno, This looks like something we noticed on the PCC channel of the Tx2. Here was the original message: ''' It seems there is synchronization issue on the PCC channels of the ThunderX2. Some abbreviations first. References are always to ACPI 6.4: Command Complete bit (CCb): 1 means the OS owns the PCC channel, 0 means the firmware owns the channel, cf s14.2.2 "Generic Communications Channel Status Field" Doorbell Write bits (DWb): Write a mask (just one bit in our case) to the doorbell register to notify the firmware of a message waiting in the channel, cf s14.1.4 "HW-Reduced Communications Subspace Structure (type 1)" Minimum Request Turnaround Time (MRTT): PCC channels have a 'Minimum Request Turnaround Time', being 'The minimum amount of time that OSPM must wait after the completion of a command before issuing the next command'. cf s14.1.4 "HW-Reduced Communications Subspace Structure (type 1)" The scenario that seems to cause trouble is: 1. The OS places a payload and clears the CCb bit 2. The OS rings at the doorbell (sets the DWb) 3. The firmware processes the message and then sets the CCb (the DWb seems to be still set) 4. The OS continues (the DWb seems to be still set) 5. The OS wants to send another command. The MRTT has elapsed. So the OS does 1. again. (the DWb seems to be still set) 6. The OS does 2. again, but the DWb are still set so the OS overwrites the DWb 7. The firmware finally clears the DWb. From 7.: - The OS indefinitely waits for an answer, thinking the firmware needs to answer. The timeout of this request elapses, but the channel is still assumed to belong to the firmware, so the OS never rings the doorbell again. - The firmware waits for the doorbell to ring (the DWb to be set), but the OS never rings again. This can be reproduced by running a big load (e.g. 60 tasks running at 5% of the maximum CPU capacity). PCCT tables must have been published by selecting the right option in UEFI. Doubling the MRTT (going from 5ms to 10ms) makes the synchronization issue disappears, but it means decreasing the speed of all PCC channels. ''' If you get messages such as: "PCC check channel failed for ss: XX. ret=X" then this should be the same issue. What might be happening for you is that a stall is detected while the sugov_work thread is trying to set a frequency. check_pcc_chan() waits for 500 * 3000 us (the PCC channel nominal latency for the Tx2) = 1.5s, which is quite long. Cf. the end of the original message, could you try increasing the mrtt value ? (here it is doubled) pcc_data[pcc_ss_idx]->pcc_mrtt = 2 * pcc_chan->min_turnaround_time; https://github.com/torvalds/linux/blob/50fd82b3a9a9335df5d50c7ddcb81c81d358c4fc/drivers/acpi/cppc_acpi.c#L547 (for info, where the cppc dirvers waits for the mrtt to elapse) https://github.com/torvalds/linux/blob/50fd82b3a9a9335df5d50c7ddcb81c81d358c4fc/drivers/acpi/cppc_acpi.c#L263 On 6/3/22 11:44, Bruno Goncalves wrote: > Hello, > > We recently started to hit this problem on some of our aarch64 > machines. The stalls can happen even during boot. > > [ 1086.949484] rcu: INFO: rcu_sched detected expedited stalls on > CPUs/tasks: { 23-... } 3 jiffies s: 3441 root: 0x2/. > [ 1086.949510] rcu: blocking rcu_node structures (internal RCU debug): > l=1:16-31:0x80/. > [ 1086.949524] Task dump for CPU 23: > [ 1086.949528] task:sugov:23 state:R running task stack: > 0 pid: 2914 ppid: 2 flags:0x0000000a > [ 1086.949543] Call trace: > [ 1086.949546] __switch_to+0x104/0x19c > [ 1086.949568] __schedule+0x410/0x67c > [ 1086.949576] schedule+0x70/0xa8 > [ 1086.949583] schedule_hrtimeout_range_clock+0x144/0x1d8 > [ 1086.949592] schedule_hrtimeout_range+0x20/0x2c > [ 1086.949598] usleep_range_state+0x5c/0x80 > [ 1086.949603] check_pcc_chan+0x7c/0xf4 > [ 1086.949615] send_pcc_cmd+0x130/0x2a8 > [ 1086.949619] cppc_set_perf+0x12c/0x22c > [ 1086.949624] cppc_cpufreq_set_target+0xf8/0x15c [cppc_cpufreq] > [ 1086.949645] __cpufreq_driver_target+0x94/0xfc > [ 1086.949658] sugov_work+0x98/0xe0 > [ 1086.949675] kthread_worker_fn+0x124/0x2b8 > [ 1086.949683] kthread+0xd4/0x558 > [ 1086.949689] ret_from_fork+0x10/0x20 > > More logs: > https://s3.us-east-1.amazonaws.com/arr-cki-prod-datawarehouse-public/datawarehouse-public/2022/06/02/553734635/redhat:553734635_aarch64/tests/Storage_block_filesystem_fio_test/12073991_aarch64_1_dmesg.log > > https://s3.us-east-1.amazonaws.com/arr-cki-prod-datawarehouse-public/datawarehouse-public/2022/06/02/553734635/redhat:553734635_aarch64/tests/Boot_test/12073991_aarch64_1_test_console.log > > CKI issue tracker: https://datawarehouse.cki-project.org/issue/1259 > > Thanks, > Bruno Goncalves >