Received: by 2002:a05:6a10:5bc5:0:0:0:0 with SMTP id os5csp8501pxb; Mon, 25 Oct 2021 02:48:32 -0700 (PDT) X-Google-Smtp-Source: ABdhPJz9nbEKArktseKJSQTtCAthz8f9K9vD5a0b1jhufnYfsOaKEJfRwq3bTDMPr9Mf/XfMd5DQ X-Received: by 2002:a17:902:9a91:b0:138:efd5:7302 with SMTP id w17-20020a1709029a9100b00138efd57302mr15323125plp.35.1635155312048; Mon, 25 Oct 2021 02:48:32 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1635155312; cv=none; d=google.com; s=arc-20160816; b=0dQ03s33V65lRUiAB+ET+zTVLyl/mm4KgnxheEoRiYksEWdwy+TJAau48JWZnwbhoe 5vGqJ+3LQqZKbOmGD+WPZcFnMmF0Qg3DRvwBYbTs0UWex5Rsq7gADS3T9thodHkxi+UX REkl5XqgN5RKR/bD9fK/ALYESNtR7RLMdL8LQiKP2jNerloysc0aJBIK35xMHMOKv4To TtMSCl1OaPOSobqH6Uf8X4pn0UJM+VKKQWaKsZA7Q2H+RWWAD+FHCZ7/T0RmfPufE/lH bDyHSWr1sKatuiXrv4QJN3rwY1ixkTkZyD8tFl/OEP8vMcSPjqCYRk5DBtEQvSD+gyEU /Yjw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:mime-version :dkim-signature; bh=rCGuC9ewYiAfYgiMkd1Yzii2AFCTuDrZolTjPAQg3+c=; b=NGRkInvxqvEp0z9WSHu9zMr+sVeiLOPqzeD3LuwOFURRmBRfvR7MyX1ug8BEKrpK1j GrhmeMA/WyCjcc8BRMY9tev9iPTxYlyiZrB7NIPNPW+qL6Q1hdjVbXlbDPBb80fDEGk6 BZXR0WDJTRT0HRXTAxnr2cJnWVCPEyxaITFi6tlp6ETqpVh8HaoK/QME3D8uSWuuQfZ2 ZnR/N4nl/XwrcX8uCi304gqtsny138FYJ2pygl01sBQTvJfjJWr34lZHHre0zzbvInOn bf7V5mrWz1BETAtZc+NNJ/Ab4msprgoO2cFg6NwyRT9gt0fp4RcEfJGR39cNEfSPRVNx NFZw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20210112 header.b=iU7RJCmX; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id u10si462406pfi.289.2021.10.25.02.48.17; Mon, 25 Oct 2021 02:48:32 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20210112 header.b=iU7RJCmX; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232720AbhJYJq2 (ORCPT + 99 others); Mon, 25 Oct 2021 05:46:28 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59024 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232546AbhJYJq1 (ORCPT ); Mon, 25 Oct 2021 05:46:27 -0400 Received: from mail-lf1-x130.google.com (mail-lf1-x130.google.com [IPv6:2a00:1450:4864:20::130]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 71B2DC061745; Mon, 25 Oct 2021 02:44:05 -0700 (PDT) Received: by mail-lf1-x130.google.com with SMTP id j21so10060257lfe.0; Mon, 25 Oct 2021 02:44:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:from:date:message-id:subject:to:cc; bh=rCGuC9ewYiAfYgiMkd1Yzii2AFCTuDrZolTjPAQg3+c=; b=iU7RJCmXcYEuggBefGylFwYi5nT3nQEIBL9C3howVlfj1yqS/mQxUnMSTQRyeTkPyu LPqeRdAZpTiHUTZq3VCCFdm9ecIXsl51GH1QXpT3b0sVWW6uERkxUj7b53jb8vcVsjp8 uA7KiMOOMkYV3j9e4LL0ztkrNMqsO9njfkq9fLbEP7AUzGLt9slLGDhBlArgJ0FgFiKC Dq8ganCs1B4D5lrcUi7Su1DUyd9Xn+/EGaW30TqB1uAVoTQHC3MMPXs0EYbMWPMR4fWu 2WJ11Auf+7c5M12yKVuJ50sOsju9Up3tZbQnfKCQSkqNd6XVfQPlJno/VojUAsJoU8HS fTkQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:from:date:message-id:subject:to:cc; bh=rCGuC9ewYiAfYgiMkd1Yzii2AFCTuDrZolTjPAQg3+c=; b=o0kJAWJcTtlsSrnK2Fmg7n9akJTA5A37m0kQ6pDZg20w8vcvSBpho52kKqyRnO1oiF 6sfCWUxlwzPG8sDCLOIAZSzZssLK3eBbGMsgmV8rDUNqccNGATD5R9x28vHGz2ORSrKa ZJ8YEcLIZXAAmPZWoa7/GjnvjQRy++eTRFFlBDKdU+Fqnaoj3mCd+D/M/34AfdOX9ixm 3nfyyDQ97mmTicS6kqwmQmvqPJWqLlLZHKLXrsT+p8Z7vSujZVoeGgex5QQaO1xWD7ya +XiQmEFDh7ieZbLJRJLQh7eiIs03BQZ2UXagMG4lk3qP+8yUIXARd++oByoXHI0c9olp 1nfQ== X-Gm-Message-State: AOAM531bsSmZ1tjM3h79FYsWDTTUixZCsgKS6NTIUBePUVU0LAufa5Ui fHNtwMFYq652koFUPKH8YkeO33Y04nX4odKf3shbojcvxAkpcc4f X-Received: by 2002:a05:6512:b29:: with SMTP id w41mr16174240lfu.240.1635155043576; Mon, 25 Oct 2021 02:44:03 -0700 (PDT) MIME-Version: 1.0 From: Ronny Meeus Date: Mon, 25 Oct 2021 11:43:52 +0200 Message-ID: Subject: Unbounded priority inversion while assigning tasks into cgroups. To: linux-kernel@vger.kernel.org Cc: linux-rt-users@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello an unbounded priority inversion is observed when moving tasks into cgroups. In my case I'm using the cpu and cpuacct cgroups but the issue is independent of this. Kernel version: 4.9.79 CPU: Dual core Cavium Octeon (MIPS) Kernel configured with CONFIG_PREEMPT=y I have a small application running at RT priority 92. Its job is to move high CPU consuming applications into a cgroup when the system is under high load. Under extreme load conditions (meaning a lot of script processing (process clone / exec / exit) and high application load), sometimes the application hangs for a long time (can be a couple of seconds but also hangs of 2 minutes are observed already). Extending the kernel with traces (see below) showed that the root-cause of the blocking is the global rwsem "cgroup_threadgroup_rwsem". While adding a task into the cgroup (__cgroup_procs_write), the write lock is taken which will have to wait until all writers and readers have completed their critical section which can take very long. Especially since there are many of them running at a much lower priority and we have also applications running at medium priority running with a very high load. As an initial attempt I tried applying the RT patch but this does not resolve the issue. The second attempt was to replace the cgroup_threadgroup_rwsem by a rt_mutex (which offers priority inheritance). After this change the issue seems to be resolved. A disadvantage of this approach is that all accesses to the critical section are serialized on all cores (writes to assign tasks to cgroups and reads to create/exec/exit processes). For the moment I do not see any other alternative to resolve this problem. Any advice on the right way forward would be appreciated. Best regards, Ronny Relevant part of the instrumented code of function: __cgroup_procs_write: trace_cgroup_lock(1000); percpu_down_write(&cgroup_threadgroup_rwsem); trace_cgroup_lock(2000); rcu_read_lock(); A normal trace looks like: resource_monito-18855 [001] .... 2685.097016: cgroup_lock: idx=2 resource_monito-18855 [001] .... 2685.097017: cgroup_lock: idx=1000 resource_monito-18855 [001] .... 2685.097018: cgroup_lock: idx=2000 resource_monito-18855 [001] .... 2685.097018: cgroup_lock: idx=101 A trace of a blocked application looks like: resource_monito-18855 [001] .... 2689.736364: cgroup_lock: idx=2 resource_monito-18855 [001] .... 2689.736365: cgroup_lock: idx=1000 resource_monito-18855 [001] .... 2693.780339: cgroup_lock: idx=2000 resource_monito-18855 [001] .... 2693.780339: cgroup_lock: idx=101 In the problematic case above, the resource_monitor application was blocked for 4s waiting for the write lock on the cgroup.