Received: by 2002:a5d:9c59:0:0:0:0:0 with SMTP id 25csp87343iof; Sun, 5 Jun 2022 21:58:54 -0700 (PDT) X-Google-Smtp-Source: ABdhPJx2s2ydNZvmIRHr1Ld/bvS8aBZg4gdfkMRDBJt1m2Iax8KD7LGEryd+3OgKN3JTeIBCptLd X-Received: by 2002:a17:902:c2c1:b0:162:3a9:2819 with SMTP id c1-20020a170902c2c100b0016203a92819mr22617560pla.38.1654491534453; Sun, 05 Jun 2022 21:58:54 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1654491534; cv=none; d=google.com; s=arc-20160816; b=CYTOOTWA9aIOIqPb1GwC3j0NCsW+y+4UF4hsveXFwVgHPh5pcr6FH0jiYRldPEKHkP uT0qTnLc6Wya187zmb5xMd1JCOxUEuL//qhiotA7TW/s4xgjnXosf7xoId3P1+KOQMKu tDt2ZJZNuFBw4esGCcI/oN9Hl0VQrRV2BxN3eQjrV1JQ7T5XY6BKeWFCOwt76FmdA6eN 8P+ibB4hXp9wC5nnP1Q2wP+YerMJasaPqAH4S0ymVs9waPfqLm4YCgsdAPj+iCN7uzJj uOjWuSoLGT90IUUySZKgXdtsoSYiILvfaISN5yCNFMy3fh82N7jXbuN3Fx8k40Y1bZKL yMTw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=FvY94IZDVudeoGpp5T1ZFrfscSwRdc7raF/u1KiDh0U=; b=DhnONOa439VgAHbr9qNa6FPniMyi2t8DyJaTG2aKrCtdUFR2M64dIqN7givOUqhkfz c8b9BQBSaM3y21t82A7lQT60qioESpvKWLeIWaUjVNqo3ABmI34P94vU9aHrsfgcBsy+ GOwo6hpuSVQ6tQHjcGRQUMQsWFVU4Gy2/ojfDehRh07U4IBUiSLPxjoeoaad/jrr9cgH Z/C25tKHJxHkiMiQJl+T/jHCCsUI+iwxsALQCKhLK1K2qZATyC8rwbZdhqqbb/MnqxDv GUjfq0xIIrE11v+Pb+sTKz0euz2FD0ZT4BoeWsXET93Cz3E6IjaY7XSF7Jt9O6zCHwQv 5osg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=GKGr6MeA; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [2620:137:e000::1:18]) by mx.google.com with ESMTPS id ob4-20020a17090b390400b001e0728a2a5csi13328590pjb.99.2022.06.05.21.58.54 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 05 Jun 2022 21:58:54 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) client-ip=2620:137:e000::1:18; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=GKGr6MeA; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 4924A108A8F; Sun, 5 Jun 2022 21:12:18 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S245530AbiFCPcZ (ORCPT + 99 others); Fri, 3 Jun 2022 11:32:25 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49902 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S243975AbiFCPcX (ORCPT ); Fri, 3 Jun 2022 11:32:23 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 0DF37340DD for ; Fri, 3 Jun 2022 08:32:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1654270340; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=FvY94IZDVudeoGpp5T1ZFrfscSwRdc7raF/u1KiDh0U=; b=GKGr6MeAofjzJgk43uwIB4sLNW776+GUxrpgbWZwhbIrRODBPK+qWhHfQNAFyGJO0r4uD1 DACSmBJzXmplTxt13QLxhm5jXOnVLLYx/GZOn8QKpT6lQ5eCofbEJUMSmgJkHZb9WbLYwv Ni4Lsd+9kflOLFpW6NMLLCmhJh3oV9A= Received: from mail-lj1-f200.google.com (mail-lj1-f200.google.com [209.85.208.200]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-615-6zi4aiSkNkim0bmztOvvNQ-1; Fri, 03 Jun 2022 11:32:18 -0400 X-MC-Unique: 6zi4aiSkNkim0bmztOvvNQ-1 Received: by mail-lj1-f200.google.com with SMTP id e3-20020a2e9303000000b00249765c005cso1350874ljh.17 for ; Fri, 03 Jun 2022 08:32:18 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=FvY94IZDVudeoGpp5T1ZFrfscSwRdc7raF/u1KiDh0U=; b=dDdadNQ+VMSksvHzetLDhNXgvM2LFayMV8K9jn+YtsNkOmcvg/uG4Z2Cfxax3KY9XD XVP7XkFsWw4IYvqkno51zXQejpMU1B5eAbnu9jKmOr4XCgf/W+Z1AmYl4V6j2uWDJRYg zMT97aA6qcQUbij8FrkZ3Yd32SbKp8gJcJQmY0ySx55lFjF+gJKfv/2lwZl8mubMckDp cwdwqZ4/5bdrFNkeMxAMbO6bU6XBpbRPYjjBpjSvAZ9WN3yDxJWO8Cl5izLiu2rKxcuL P0cbSptJT3TWXnwOw6fH/AhJ6i78nMnzj59jWpUli2qGylYxIkcGbsfmN7mfdNptDllj om2g== X-Gm-Message-State: AOAM531vvaX3zBgLgeq8PLQvcqd6Y7SSx5j5pQAJFWBnoWBk7CxhDl2A oBs4+rvh3MQpWCy+aMTtRDTXnYGdV7brBrLEK24UaYFxigZdcydjeDgG+XmxtBxA4TIEALhUCUd perTzzem0ONCUD37CbgG3EeBeeHnwn87zzopafy+U X-Received: by 2002:a05:6512:1682:b0:479:97f:ebb with SMTP id bu2-20020a056512168200b00479097f0ebbmr7290347lfb.52.1654270337254; Fri, 03 Jun 2022 08:32:17 -0700 (PDT) X-Received: by 2002:a05:6512:1682:b0:479:97f:ebb with SMTP id bu2-20020a056512168200b00479097f0ebbmr7290336lfb.52.1654270337012; Fri, 03 Jun 2022 08:32:17 -0700 (PDT) MIME-Version: 1.0 References: <99a207dc-93cd-1bea-2ffc-404a9f6587bf@arm.com> In-Reply-To: <99a207dc-93cd-1bea-2ffc-404a9f6587bf@arm.com> From: Bruno Goncalves Date: Fri, 3 Jun 2022 17:32:06 +0200 Message-ID: Subject: Re: [aarch64] INFO: rcu_sched detected expedited stalls on CPUs/tasks To: Pierre Gondois Cc: linux-arm-kernel@lists.infradead.org, LKML , CKI Project , rric@kernel.org, Ionela Voinescu , Dietmar Eggemann Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-3.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RDNS_NONE,SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 3 Jun 2022 at 17:24, Pierre Gondois wrote: > > Hello Bruno, > This looks like something we noticed on the PCC channel of the Tx2. Here was > the original message: > ''' > It seems there is synchronization issue on the PCC channels of the ThunderX2. > > Some abbreviations first. References are always to ACPI 6.4: > Command Complete bit (CCb): > 1 means the OS owns the PCC channel, 0 means the firmware owns the channel, > cf s14.2.2 "Generic Communications Channel Status Field" > > Doorbell Write bits (DWb): > Write a mask (just one bit in our case) to the doorbell register to notify the > firmware of a message waiting in the channel, > cf s14.1.4 "HW-Reduced Communications Subspace Structure (type 1)" > > Minimum Request Turnaround Time (MRTT): > PCC channels have a 'Minimum Request Turnaround Time', being 'The minimum > amount of time that OSPM must wait after the completion of a command before > issuing the next command'. > cf s14.1.4 "HW-Reduced Communications Subspace Structure (type 1)" > > The scenario that seems to cause trouble is: > 1. The OS places a payload and clears the CCb bit > 2. The OS rings at the doorbell (sets the DWb) > 3. The firmware processes the message and then sets the CCb (the DWb seems to > be still set) > 4. The OS continues (the DWb seems to be still set) > 5. The OS wants to send another command. The MRTT has elapsed. So the OS does > 1. again. (the DWb seems to be still set) > 6. The OS does 2. again, but the DWb are still set so the OS overwrites the DWb > 7. The firmware finally clears the DWb. > > From 7.: > - The OS indefinitely waits for an answer, thinking the firmware needs > to answer. The timeout of this request elapses, but the channel is still > assumed to belong to the firmware, so the OS never rings the doorbell again. > - The firmware waits for the doorbell to ring (the DWb to be set), but the > OS never rings again. > > This can be reproduced by running a big load (e.g. 60 tasks running at 5% > of the maximum CPU capacity). PCCT tables must have been published by > selecting the right option in UEFI. > > Doubling the MRTT (going from 5ms to 10ms) makes the synchronization issue > disappears, but it means decreasing the speed of all PCC channels. > ''' > > If you get messages such as: > "PCC check channel failed for ss: XX. ret=X" > then this should be the same issue. Thanks for your reply, on console.log we don't see the message above. Bruno > > What might be happening for you is that a stall is detected while the > sugov_work thread is trying to set a frequency. check_pcc_chan() waits for > 500 * 3000 us (the PCC channel nominal latency for the Tx2) = 1.5s, which > is quite long. > > Cf. the end of the original message, could you try increasing the mrtt value ? > (here it is doubled) > pcc_data[pcc_ss_idx]->pcc_mrtt = 2 * pcc_chan->min_turnaround_time; > https://github.com/torvalds/linux/blob/50fd82b3a9a9335df5d50c7ddcb81c81d358c4fc/drivers/acpi/cppc_acpi.c#L547 > (for info, where the cppc dirvers waits for the mrtt to elapse) > https://github.com/torvalds/linux/blob/50fd82b3a9a9335df5d50c7ddcb81c81d358c4fc/drivers/acpi/cppc_acpi.c#L263 > > On 6/3/22 11:44, Bruno Goncalves wrote: > > Hello, > > > > We recently started to hit this problem on some of our aarch64 > > machines. The stalls can happen even during boot. > > > > [ 1086.949484] rcu: INFO: rcu_sched detected expedited stalls on > > CPUs/tasks: { 23-... } 3 jiffies s: 3441 root: 0x2/. > > [ 1086.949510] rcu: blocking rcu_node structures (internal RCU debug): > > l=1:16-31:0x80/. > > [ 1086.949524] Task dump for CPU 23: > > [ 1086.949528] task:sugov:23 state:R running task stack: > > 0 pid: 2914 ppid: 2 flags:0x0000000a > > [ 1086.949543] Call trace: > > [ 1086.949546] __switch_to+0x104/0x19c > > [ 1086.949568] __schedule+0x410/0x67c > > [ 1086.949576] schedule+0x70/0xa8 > > [ 1086.949583] schedule_hrtimeout_range_clock+0x144/0x1d8 > > [ 1086.949592] schedule_hrtimeout_range+0x20/0x2c > > [ 1086.949598] usleep_range_state+0x5c/0x80 > > [ 1086.949603] check_pcc_chan+0x7c/0xf4 > > [ 1086.949615] send_pcc_cmd+0x130/0x2a8 > > [ 1086.949619] cppc_set_perf+0x12c/0x22c > > [ 1086.949624] cppc_cpufreq_set_target+0xf8/0x15c [cppc_cpufreq] > > [ 1086.949645] __cpufreq_driver_target+0x94/0xfc > > [ 1086.949658] sugov_work+0x98/0xe0 > > [ 1086.949675] kthread_worker_fn+0x124/0x2b8 > > [ 1086.949683] kthread+0xd4/0x558 > > [ 1086.949689] ret_from_fork+0x10/0x20 > > > > More logs: > > https://s3.us-east-1.amazonaws.com/arr-cki-prod-datawarehouse-public/datawarehouse-public/2022/06/02/553734635/redhat:553734635_aarch64/tests/Storage_block_filesystem_fio_test/12073991_aarch64_1_dmesg.log > > > > https://s3.us-east-1.amazonaws.com/arr-cki-prod-datawarehouse-public/datawarehouse-public/2022/06/02/553734635/redhat:553734635_aarch64/tests/Boot_test/12073991_aarch64_1_test_console.log > > > > CKI issue tracker: https://datawarehouse.cki-project.org/issue/1259 > > > > Thanks, > > Bruno Goncalves > > >