Received: by 2002:a05:6358:11c7:b0:104:8066:f915 with SMTP id i7csp4269444rwl; Tue, 28 Mar 2023 05:10:13 -0700 (PDT) X-Google-Smtp-Source: AKy350Yke2R5mO7jZeh57Qg0aaoiJcYaAunnVS9hf1tPvoYqKEhVW5yvz5xJavvd7KR6OQVqtMIw X-Received: by 2002:a62:1807:0:b0:625:a34a:89fd with SMTP id 7-20020a621807000000b00625a34a89fdmr12288216pfy.16.1680005413397; Tue, 28 Mar 2023 05:10:13 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1680005413; cv=none; d=google.com; s=arc-20160816; b=ok2IlKx+HcXkF+WUPRfa4TPLirCtU9otL5R/pCFFYmDf0zzlmOdQvoM7gsGAY/bIN6 6IFWxK47mWrS8zHfmHnV9cdeProwzPM+k242juXDKnGiSca/+3Wpk9oFr8RujhwY+miP b5upLi01l5bDXRQ5yfsN6JgHSj4horDWo93XkmEvNeZ1PtZl6k0oN4eb53kFmmkaDGzX x0Nde2HtTSUuOx4n5dN6wChWGChz/TXQVWb59fPfcCrSCBX+D0Pc57rQ5HEtk0Nc5XkH hiyt6EBw/jIxOcz/Na3Fsm5WFgZBs13sRkZ6kNuKLzA0Tdi4Jtle/BUKJcDywR+4Z7xH N9bg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:content-language:subject:user-agent:mime-version :date:message-id; bh=IJgZ28v3I5AwJnii2K80IcodTKTaZ5P1fIZtpzZmUWY=; b=ro7VPbUFXESgSruqgagzgyv3i98h8PAecHC1SrzTqknTaq6L/10/8EexY//mgzoXgg eTCbXyrQtr1yPD7kNFM9hKfjMxqTAWjY+E//+w1d4PKedn9thjspTvRyO2600FMz54NT BsmdqWUzt1E8qu/wDn7aI/ndwFw6sVYRFCWGbfjJkSnymuYI1ZspQr6qBmJe7e+64Gv2 wFbwwLu/ncP1wmHDj9YUGU1BuihM58YCtOsYi0bh/ZiQf5iNsRdWss4NvvLo6z6AprKq PbLXkPW69FvWvPFknYQUjAWo3/H09tJdnU5usNUJNTq1H5gl/P6bJSgYZ5LP3dXB8EAt mE+A== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id x5-20020a654145000000b0051355a20e24si4591173pgp.548.2023.03.28.05.10.01; Tue, 28 Mar 2023 05:10:13 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232700AbjC1MJy (ORCPT + 99 others); Tue, 28 Mar 2023 08:09:54 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40802 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232950AbjC1MJu (ORCPT ); Tue, 28 Mar 2023 08:09:50 -0400 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 8E5B5270A for ; Tue, 28 Mar 2023 05:09:44 -0700 (PDT) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 3612FC14; Tue, 28 Mar 2023 05:10:28 -0700 (PDT) Received: from [192.168.178.6] (unknown [172.31.20.19]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id A8B533F6C4; Tue, 28 Mar 2023 05:09:41 -0700 (PDT) Message-ID: <943d44f7-1fa9-8545-dc1d-890e4dd20854@arm.com> Date: Tue, 28 Mar 2023 14:09:39 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.8.0 Subject: Re: [RFC PATCH] sched/fair: Make tg->load_avg per node Content-Language: en-US To: Aaron Lu , Peter Zijlstra , Ingo Molnar , Juri Lelli , Vincent Guittot Cc: Steven Rostedt , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , Valentin Schneider , Tim Chen , Nitin Tekchandani , Waiman Long , Yu Chen , linux-kernel@vger.kernel.org References: <20230327053955.GA570404@ziqianlu-desk2> From: Dietmar Eggemann In-Reply-To: <20230327053955.GA570404@ziqianlu-desk2> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-2.3 required=5.0 tests=NICE_REPLY_A, RCVD_IN_DNSWL_MED,SPF_HELO_NONE,SPF_NONE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 27/03/2023 07:39, Aaron Lu wrote: > When using sysbench to benchmark Postgres in a single docker instance > with sysbench's nr_threads set to nr_cpu, it is observed there are times > update_cfs_group() and update_load_avg() shows noticeable overhead on > cpus of one node of a 2sockets/112core/224cpu Intel Sapphire Rapids: > > 10.01% 9.86% [kernel.vmlinux] [k] update_cfs_group > 7.84% 7.43% [kernel.vmlinux] [k] update_load_avg > > While cpus of the other node normally sees a lower cycle percent: > > 4.46% 4.36% [kernel.vmlinux] [k] update_cfs_group > 4.02% 3.40% [kernel.vmlinux] [k] update_load_avg > > Annotate shows the cycles are mostly spent on accessing tg->load_avg > with update_load_avg() being the write side and update_cfs_group() being > the read side. > > The reason why only cpus of one node has bigger overhead is: task_group > is allocated on demand from a slab and whichever cpu happens to do the > allocation, the allocated tg will be located on that node and accessing > to tg->load_avg will have a lower cost for cpus on the same node and > a higer cost for cpus of the remote node. > > Tim Chen told me that PeterZ once mentioned a way to solve a similar > problem by making a counter per node so do the same for tg->load_avg. > After this change, the worst number I saw during a 5 minutes run from > both nodes are: > > 2.77% 2.11% [kernel.vmlinux] [k] update_load_avg > 2.72% 2.59% [kernel.vmlinux] [k] update_cfs_group > > Another observation of this workload is: it has a lot of wakeup time > task migrations and that is the reason why update_load_avg() and > update_cfs_group() shows noticeable cost. Running this workload in N > instances setup where N >= 2 with sysbench's nr_threads set to 1/N nr_cpu, > task migrations on wake up time are greatly reduced and the overhead from > the two above mentioned functions also dropped a lot. It's not clear to > me why running in multiple instances can reduce task migrations on > wakeup path yet. > > Reported-by: Nitin Tekchandani > Signed-off-by: Aaron Lu I'm so far not seeing this issue on my Arm64 server. $ numactl -H available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 node 2 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 node 3 cpus: 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 node distances: node 0 1 2 3 0: 10 12 20 22 1: 12 10 22 24 2: 20 22 10 12 3: 22 24 12 10 sysbench --table-size=100000 --tables=24 --threads=96 ... /usr/share/sysbench/oltp_read_write.lua run perf report | grep kernel | head 9.12% sysbench [kernel.vmlinux] [k] _raw_spin_unlock_irqrestore 5.26% sysbench [kernel.vmlinux] [k] finish_task_switch 1.56% sysbench [kernel.vmlinux] [k] __do_softirq 1.22% sysbench [kernel.vmlinux] [k] arch_local_irq_restore 1.12% sysbench [kernel.vmlinux] [k] __arch_copy_to_user 1.12% sysbench [kernel.vmlinux] [k] el0_svc_common.constprop.1 0.95% sysbench [kernel.vmlinux] [k] __fget_light 0.94% sysbench [kernel.vmlinux] [k] rwsem_spin_on_owner 0.85% sysbench [kernel.vmlinux] [k] tcp_ack 0.56% sysbench [kernel.vmlinux] [k] do_sys_poll Is your postgres/sysbench running in a cgroup with cpu controller attached? Mine isn't. Maybe I'm doing something else differently?