Received: by 2002:a05:7412:8521:b0:e2:908c:2ebd with SMTP id t33csp885247rdf; Fri, 3 Nov 2023 20:14:40 -0700 (PDT) X-Google-Smtp-Source: AGHT+IF3XkFJ8YQr4KuaqSV1eHOHkm66dqAzLwqkx8JT7ocAgklPKtxL7nUqLThFEF6kbfMy6H+3 X-Received: by 2002:a05:6808:143:b0:3ab:73a6:1469 with SMTP id h3-20020a056808014300b003ab73a61469mr24896206oie.14.1699067680548; Fri, 03 Nov 2023 20:14:40 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1699067680; cv=none; d=google.com; s=arc-20160816; b=PlphgQ0nO33EgaMxE+5gyG1lYSrKtpT3oU0Hujs0ygSHz01JNXXltJaaBVP1H/Wkiy tU5pl08PHKv1Z8EPNfk+CW82m2T4zodTr+O6TVzsVSj3nptPFrY9uilYkmVHYX7uG2Ob CdRzAhrCsseoXDfCnaK5cql+lDp2mkZP6r/4EISHZEdm1sDyRW+7g3Es5TPnkqAHOeZI 9q3AjZG7BNWVHVD77bRHbGqOvepYSLI7FTrCskFUhjldEbJYbWO7Cqel1HhtLB0pQDlC UCRLwjMWe3VlYa6UbZw0weSZfYAdlBYjovP/3SlBJJVq76HWe1/0o32PZqKpTUCD6h79 VCAA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=kfEa/PGrfaQZvR+CyJp9bolRL2l7BniJt6LJ8tbhPMY=; fh=vy39GFVfO6UaqrntP4AkYjitNZbWGjYvUyBXYh7mu0A=; b=WjJE7SKhG/r2DoDWoyms8yVM7+cv8LScskw8R4MubJHUlkGpW3D2UjoaVqzECWwjor qfF8ZbsM1qFelvYZa+Ycw6m5aKX9nZjC9Y1rLKA57k3i8Hp+msk4uiWLmeeSkZqm6bBa Em74UB0JuT/mK1CD2vc7POMGVy+mO/G9b8dZ8lwzdrSSHnAU0Ua2sH7eq6XH3qC2vOXl iWmkXRacWs6Y8zUkBIf73eZ1ipQ/X778Jfed3TzS0yd71q9F2IP/9rndPLwcXjOPcG9W q8Bx12YaBLe+kJv8r1rDpEzit7TxxOQjsuDcTMxb6qFmfkAKjsh0l0VUrLmWeG8OhnnH C+mA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=Fa4sZGs7; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from agentk.vger.email (agentk.vger.email. [23.128.96.32]) by mx.google.com with ESMTPS id e15-20020aa78c4f000000b006bd5d6f7c71si2661143pfd.390.2023.11.03.20.14.40 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 03 Nov 2023 20:14:40 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) client-ip=23.128.96.32; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=Fa4sZGs7; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by agentk.vger.email (Postfix) with ESMTP id C225881EB9E1; Fri, 3 Nov 2023 20:14:35 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at agentk.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235079AbjKDDOO (ORCPT + 99 others); Fri, 3 Nov 2023 23:14:14 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55868 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235011AbjKDDOM (ORCPT ); Fri, 3 Nov 2023 23:14:12 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A3128D54 for ; Fri, 3 Nov 2023 20:13:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1699067599; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=kfEa/PGrfaQZvR+CyJp9bolRL2l7BniJt6LJ8tbhPMY=; b=Fa4sZGs7/ljldrk+yitemgluf6Zpo78KXFQIsL2roBJIDny9TrxYK3FcQErPSoveNdOKp/ B05ZHmPdmhpqeSrWsRC7oLgu1wpTdvAt3bY2vqbmRSnxei+y+DcfON0Y1o598ZCiWkg9LS 4B9wgFGrkaPwD9nAqvkzFoE66LPu5A8= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-651-MErBzw98P1i88MoGN1QToA-1; Fri, 03 Nov 2023 23:13:18 -0400 X-MC-Unique: MErBzw98P1i88MoGN1QToA-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.rdu2.redhat.com [10.11.54.8]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id EFE16185A781; Sat, 4 Nov 2023 03:13:17 +0000 (UTC) Received: from llong.com (unknown [10.22.33.74]) by smtp.corp.redhat.com (Postfix) with ESMTP id 80E4FC1290F; Sat, 4 Nov 2023 03:13:17 +0000 (UTC) From: Waiman Long To: Tejun Heo , Zefan Li , Johannes Weiner Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Joe Mario , Sebastian Jug , Yosry Ahmed , Waiman Long Subject: [PATCH v3 1/3] cgroup/rstat: Reduce cpu_lock hold time in cgroup_rstat_flush_locked() Date: Fri, 3 Nov 2023 23:13:01 -0400 Message-Id: <20231104031303.592879-2-longman@redhat.com> In-Reply-To: <20231104031303.592879-1-longman@redhat.com> References: <20231104031303.592879-1-longman@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.8 X-Spam-Status: No, score=-1.3 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on agentk.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (agentk.vger.email [0.0.0.0]); Fri, 03 Nov 2023 20:14:36 -0700 (PDT) When cgroup_rstat_updated() isn't being called concurrently with cgroup_rstat_flush_locked(), its run time is pretty short. When both are called concurrently, the cgroup_rstat_updated() run time can spike to a pretty high value due to high cpu_lock hold time in cgroup_rstat_flush_locked(). This can be problematic if the task calling cgroup_rstat_updated() is a realtime task running on an isolated CPU with a strict latency requirement. The cgroup_rstat_updated() call can happen when there is a page fault even though the task is running in user space most of the time. The percpu cpu_lock is used to protect the update tree - updated_next and updated_children. This protection is only needed when cgroup_rstat_cpu_pop_updated() is being called. The subsequent flushing operation which can take a much longer time does not need that protection as it is already protected by cgroup_rstat_lock. To reduce the cpu_lock hold time, we need to perform all the cgroup_rstat_cpu_pop_updated() calls up front with the lock released afterward before doing any flushing. This patch adds a new cgroup_rstat_updated_list() function to return a singly linked list of cgroups to be flushed. Some instrumentation code are added to measure the cpu_lock hold time right after lock acquisition to after releasing the lock. Parallel kernel build on a 2-socket x86-64 server is used as the benchmarking tool for measuring the lock hold time. The maximum cpu_lock hold time before and after the patch are 100us and 29us respectively. So the worst case time is reduced to about 30% of the original. However, there may be some OS or hardware noises like NMI or SMI in the test system that can worsen the worst case value. Those noises are usually tuned out in a real production environment to get a better result. OTOH, the lock hold time frequency distribution should give a better idea of the performance benefit of the patch. Below were the frequency distribution before and after the patch: Hold time Before patch After patch --------- ------------ ----------- 0-01 us 804,139 13,738,708 01-05 us 9,772,767 1,177,194 05-10 us 4,595,028 4,984 10-15 us 303,481 3,562 15-20 us 78,971 1,314 20-25 us 24,583 18 25-30 us 6,908 12 30-40 us 8,015 40-50 us 2,192 50-60 us 316 60-70 us 43 70-80 us 7 80-90 us 2 >90 us 3 Signed-off-by: Waiman Long Reviewed-by: Yosry Ahmed --- include/linux/cgroup-defs.h | 7 ++++++ kernel/cgroup/rstat.c | 43 ++++++++++++++++++++++++------------- 2 files changed, 35 insertions(+), 15 deletions(-) diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 265da00a1a8b..ff4b4c590f32 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -491,6 +491,13 @@ struct cgroup { struct cgroup_rstat_cpu __percpu *rstat_cpu; struct list_head rstat_css_list; + /* + * A singly-linked list of cgroup structures to be rstat flushed. + * This is a scratch field to be used exclusively by + * cgroup_rstat_flush_locked() and protected by cgroup_rstat_lock. + */ + struct cgroup *rstat_flush_next; + /* cgroup basic resource statistics */ struct cgroup_base_stat last_bstat; struct cgroup_base_stat bstat; diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c index d80d7a608141..1f300bf4dc40 100644 --- a/kernel/cgroup/rstat.c +++ b/kernel/cgroup/rstat.c @@ -145,6 +145,32 @@ static struct cgroup *cgroup_rstat_cpu_pop_updated(struct cgroup *pos, return pos; } +/* Return a list of updated cgroups to be flushed */ +static struct cgroup *cgroup_rstat_updated_list(struct cgroup *root, int cpu) +{ + raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu); + struct cgroup *head, *tail, *next; + unsigned long flags; + + /* + * The _irqsave() is needed because cgroup_rstat_lock is + * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring + * this lock with the _irq() suffix only disables interrupts on + * a non-PREEMPT_RT kernel. The raw_spinlock_t below disables + * interrupts on both configurations. The _irqsave() ensures + * that interrupts are always disabled and later restored. + */ + raw_spin_lock_irqsave(cpu_lock, flags); + head = tail = cgroup_rstat_cpu_pop_updated(NULL, root, cpu); + while (tail) { + next = cgroup_rstat_cpu_pop_updated(tail, root, cpu); + tail->rstat_flush_next = next; + tail = next; + } + raw_spin_unlock_irqrestore(cpu_lock, flags); + return head; +} + /* * A hook for bpf stat collectors to attach to and flush their stats. * Together with providing bpf kfuncs for cgroup_rstat_updated() and @@ -179,21 +205,9 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp) lockdep_assert_held(&cgroup_rstat_lock); for_each_possible_cpu(cpu) { - raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, - cpu); - struct cgroup *pos = NULL; - unsigned long flags; + struct cgroup *pos = cgroup_rstat_updated_list(cgrp, cpu); - /* - * The _irqsave() is needed because cgroup_rstat_lock is - * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring - * this lock with the _irq() suffix only disables interrupts on - * a non-PREEMPT_RT kernel. The raw_spinlock_t below disables - * interrupts on both configurations. The _irqsave() ensures - * that interrupts are always disabled and later restored. - */ - raw_spin_lock_irqsave(cpu_lock, flags); - while ((pos = cgroup_rstat_cpu_pop_updated(pos, cgrp, cpu))) { + for (; pos; pos = pos->rstat_flush_next) { struct cgroup_subsys_state *css; cgroup_base_stat_flush(pos, cpu); @@ -205,7 +219,6 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp) css->ss->css_rstat_flush(css, cpu); rcu_read_unlock(); } - raw_spin_unlock_irqrestore(cpu_lock, flags); /* play nice and yield if necessary */ if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) { -- 2.39.3