Received: by 2002:a6b:fb09:0:0:0:0:0 with SMTP id h9csp337032iog; Wed, 15 Jun 2022 03:20:19 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzHYy40ruVHwgH1hLUDXhbjbRtaEi24c2aOvoK7pnIQAzhkCIgVHnwZhbOPJNUlUAfvy29z X-Received: by 2002:a63:e517:0:b0:3fe:4273:1063 with SMTP id r23-20020a63e517000000b003fe42731063mr8208022pgh.371.1655288418856; Wed, 15 Jun 2022 03:20:18 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1655288418; cv=none; d=google.com; s=arc-20160816; b=n+RdI/+O0eQLMOY3gv718hJBecJQVsXPHB8Rkk4+MlLBI8xYmrqKXnIpiudGIl/Qrs XpC4inI13sFTSQTuMVy5xpk6aZRgHX4LJJRBydC3uKYVpok+d17ImXyYkBMiEfA85h/X UvxVgDNHJC5MrgtrIAfRx4RxloEaZ7JtENVNvQINHM2W87BvykiaRXELu3TKrl0DJeTg VTQLK+zbuQEUsld4P2PdEiJWp711VheeFTXYSXPVNnc9RkBVd9n94Ky/QkiwizvMcYMb RWapklovuJgDBgYGl+yMFBX6OohCZzsPn1EWi9ufWcTP+KY2jo8IjntxMm7jnX7QOsex iAeg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:content-language:subject:user-agent:mime-version :date:message-id:dkim-signature; bh=u65QQc2UhosWLMahxlQMn59AukfWR4oEL4oKp6IP8R4=; b=lNawzJSdgpAB76ZhAIbqf308W+O31jtcE1xy7ksVrZ57GNqXusdZG5znPsE9/9vdV+ /KN/FwjbMJf6yifUxa/Oa3vaKGXLVzuHREAebW4tNDUPTqqAzFBHfxBadOgFmR/l79ej p7DdBwW5y85TK30XBZaD/Oqen3liKwP3GM+UVGEqtLoyyMvvVg9JlztosaPrVXRZgSkP z76oLelPOPG2Wli5Fncf8x8V+gWNEbnLYsuEEPMAwtGfXgpa7/6TmncaYl9xluErCiml qASKLJJDHtpSUqLX29zEfByVnZCtJuqR3CGHNnJfjIJPqzKnAXJWYuDT/7VHI02Jgk6r 9CaQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=p6mwc20X; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=bytedance.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id g2-20020a056a000b8200b00518929bce96si18492195pfj.213.2022.06.15.03.20.05; Wed, 15 Jun 2022 03:20:18 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=p6mwc20X; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=bytedance.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S243412AbiFOKPV (ORCPT + 99 others); Wed, 15 Jun 2022 06:15:21 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58848 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1346891AbiFOKNe (ORCPT ); Wed, 15 Jun 2022 06:13:34 -0400 Received: from mail-pf1-x429.google.com (mail-pf1-x429.google.com [IPv6:2607:f8b0:4864:20::429]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0B67A25C6D for ; Wed, 15 Jun 2022 03:13:31 -0700 (PDT) Received: by mail-pf1-x429.google.com with SMTP id z17so11004909pff.7 for ; Wed, 15 Jun 2022 03:13:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=message-id:date:mime-version:user-agent:subject:content-language:to :cc:references:from:in-reply-to:content-transfer-encoding; bh=u65QQc2UhosWLMahxlQMn59AukfWR4oEL4oKp6IP8R4=; b=p6mwc20X/Q1hFeLzxCcgv78Ie6ePALgDnjKy0f6FJRwbFLlEtcyClUGIdIrRdeQuXG S/APdh60P9Vj8ZWvK21slVtmWATXSGVi+5VECU1WfwLJvu2C/wmcawnZDXRiot0hlSHm Rj7uCUvqbhfhVQA2GJoevnDtzp3l7dads+GM0V7twfGct+XRmzBvtEqdoi5foez5H8Tl y+0lXd+AX5G7s1qiOthMz582qdHInuCvPvzrWtGgxndpyZsnunHSHy2OoTApYkKzSehV bDdojCchjZBnfLQaL4EpQdwXR+4O+F/a96TGh2BXnAkuyMFutgthOpv/gugFEo+C2bE3 9Itg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:message-id:date:mime-version:user-agent:subject :content-language:to:cc:references:from:in-reply-to :content-transfer-encoding; bh=u65QQc2UhosWLMahxlQMn59AukfWR4oEL4oKp6IP8R4=; b=gVhw1OICK/eMsx8mFUdDJkT65iU+r+QXu4HG5IAcLiRrVxomrjh4/TWsSmKjlN7Jv/ xXEplAWKAGhfvTdbY0SK7dIpajFETpEysDe5xrvEnk6cE/dfNNnp++626UijlYWVaCcp Ex9NspYZhU8VB0or5xOfRcX8X+ByYSCAq1oDaqPW38hP/z/MtkI40J4FZV18TgxzqlLz DXVC2mofEq4ksIXqnfM3mUFDGJWmGp5/DhQRPxvKglgj4teLUnBhRR6RW5++dzfJxCTr NXeI3hqPc+JNcZDhT6tmA4ivYuvdmCaCQ++WQRwUAIHvfHulHYUcSAs8rWsmmfY9eAKL orkg== X-Gm-Message-State: AOAM533PSpyXvo9SC+jh6j7E5pAztI+avs9iTOrfKBul3UIBHLE+R4kX bksRvSB4wtdFIb9UO7N1pA9OfA== X-Received: by 2002:a63:81c3:0:b0:3fc:c510:1a3 with SMTP id t186-20020a6381c3000000b003fcc51001a3mr8220683pgd.581.1655288010235; Wed, 15 Jun 2022 03:13:30 -0700 (PDT) Received: from [10.255.194.85] ([139.177.225.252]) by smtp.gmail.com with ESMTPSA id jd13-20020a170903260d00b0016184e7b013sm8885181plb.36.2022.06.15.03.13.16 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 15 Jun 2022 03:13:27 -0700 (PDT) Message-ID: <0e27425e-1fb6-bc7c-9845-71dc805897c3@bytedance.com> Date: Wed, 15 Jun 2022 18:13:12 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0) Gecko/20100101 Thunderbird/91.10.0 Subject: Re: Re: [PATCH 0/5 v1] mm, oom: Introduce per numa node oom for CONSTRAINT_MEMORY_POLICY Content-Language: en-US To: Michal Hocko Cc: akpm@linux-foundation.org, songmuchun@bytedance.com, hca@linux.ibm.com, gor@linux.ibm.com, agordeev@linux.ibm.com, borntraeger@linux.ibm.com, svens@linux.ibm.com, ebiederm@xmission.com, keescook@chromium.org, viro@zeniv.linux.org.uk, rostedt@goodmis.org, mingo@redhat.com, peterz@infradead.org, acme@kernel.org, mark.rutland@arm.com, alexander.shishkin@linux.intel.com, jolsa@kernel.org, namhyung@kernel.org, david@redhat.com, imbrenda@linux.ibm.com, apopple@nvidia.com, adobriyan@gmail.com, stephen.s.brennan@oracle.com, ohoono.kwon@samsung.com, haolee.swjtu@gmail.com, kaleshsingh@google.com, zhengqi.arch@bytedance.com, peterx@redhat.com, shy828301@gmail.com, surenb@google.com, ccross@google.com, vincent.whitchurch@axis.com, tglx@linutronix.de, bigeasy@linutronix.de, fenghua.yu@intel.com, linux-s390@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-perf-users@vger.kernel.org References: <20220512044634.63586-1-ligang.bdlg@bytedance.com> From: Gang Li In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-3.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,NICE_REPLY_A,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, I've done some benchmarking in the last few days. On 2022/5/17 00:44, Michal Hocko wrote: > Sorry, I have only now found this email thread. The limitation of the > NUMA constrained oom is well known and long standing. Basically the > whole thing is a best effort as we are lacking per numa node memory > stats. I can see that you are trying to fill up that gap but this is > not really free. Have you measured the runtime overhead? Accounting is > done in a very performance sensitive paths and it would be rather > unfortunate to make everybody pay the overhead while binding to a > specific node or sets of nodes is not the most common usecase. ## CPU consumption According to the result of Unixbench. There is less than one percent performance loss in most cases. On 40c512g machine. 40 parallel copies of tests: +----------+----------+-----+----------+---------+---------+---------+ | numastat | FileCopy | ... | Pipe | Fork | syscall | total | +----------+----------+-----+----------+---------+---------+---------+ | off | 2920.24 | ... | 35926.58 | 6980.14 | 2617.18 | 8484.52 | | on | 2919.15 | ... | 36066.07 | 6835.01 | 2724.82 | 8461.24 | | overhead | 0.04% | ... | -0.39% | 2.12% | -3.95% | 0.28% | +----------+----------+-----+----------+---------+---------+---------+ 1 parallel copy of tests: +----------+----------+-----+---------+--------+---------+---------+ | numastat | FileCopy | ... | Pipe | Fork | syscall | total | +----------+----------+-----+---------+--------+---------+---------+ | off | 1515.37 | ... | 1473.97 | 546.88 | 1152.37 | 1671.2 | | on | 1508.09 | ... | 1473.75 | 532.61 | 1148.83 | 1662.72 | | overhead | 0.48% | ... | 0.01% | 2.68% | 0.31% | 0.51% | +----------+----------+-----+---------+--------+---------+---------+ ## MEM consumption per task_struct: sizeof(int) * num_possible_nodes() + sizeof(int*) typically 4 * 2 + 8 bytes per mm_struct: sizeof(atomic_long_t) * num_possible_nodes() + sizeof(atomic_long_t*) typically 8 * 2 + 8 bytes zap_pte_range: sizeof(int) * num_possible_nodes() + sizeof(int*) typically 4 * 2 + 8 bytes > Also have you tried to have a look at cpusets? Those should be easier to > make a proper selection as it should be possible to iterate over tasks > belonging to a specific cpuset much more easier - essentialy something > similar to memcg oom killer. We do not do that right now and by a very > brief look at the CONSTRAINT_CPUSET it seems that this code is not > really doing much these days. Maybe that would be a more appropriate way > to deal with more precise node aware oom killing? Looks like both CONSTRAINT_MEMORY_POLICY and CONSTRAINT_CPUSET can be uesd to deal with node aware oom killing. I think we can calculate badness in this way: If constraint=CONSTRAINT_MEMORY_POLICY, get badness by `nodemask`. If constraint=CONSTRAINT_CPUSET, get badness by `mems_allowed`. example code: ``` long oom_badness(struct task_struct *p, struct oom_control *oc) long points; ... if (unlikely(oc->constraint == CONSTRAINT_MEMORY_POLICY)) { for_each_node_mask(nid, oc->nodemask) points += get_mm_counter(p->mm, -1, nid) } else if (unlikely(oc->constraint == CONSTRAINT_CPUSET)) { for_each_node_mask(nid, cpuset_current_mems_allowed) points += get_mm_counter(p->mm, -1, nid) } else { points = get_mm_rss(p->mm); } points += get_mm_counter(p->mm, MM_SWAPENTS, NUMA_NO_NODE) \ + mm_pgtables_bytes(p->mm) / PAGE_SIZE; ... } ``` > > [...] >> 21 files changed, 317 insertions(+), 111 deletions(-) > > The code footprint is not free either. And more importantnly does this > even work much more reliably? I can see quite some NUMA_NO_NODE > accounting (e.g. copy_pte_range!).Is this somehow fixable? > Also how do those numbers add up. Let's say you increase the counter as > NUMA_NO_NODE but later on during the clean up you decrease based on the > page node? > Last but not least I am really not following MM_NO_TYPE concept. I can > only see add_mm_counter users without any decrements. What is going on > there? There are two usage scenarios of NUMA_NO_NODE in this patch. 1. placeholder when swap pages in and out of swapfile. ``` // mem to swapfile dec_mm_counter(vma->vm_mm, MM_ANONPAGES, page_to_nid(page)); inc_mm_counter(vma->vm_mm, MM_SWAPENTS, NUMA_NO_NODE); // swapfile to mem inc_mm_counter(vma->vm_mm, MM_ANONPAGES, page_to_nid(page)); dec_mm_counter(vma->vm_mm, MM_SWAPENTS, NUMA_NO_NODE); ``` In *_mm_counter(vma->vm_mm, MM_SWAPENTS, NUMA_NO_NODE), NUMA_NO_NODE is a placeholder. It means this page does not exist in any node anymore. 2. placeholder in `add_mm_rss_vec` and `sync_mm_rss` for per process mm counter synchronization with SPLIT_RSS_COUNTING enabled. MM_NO_TYPE is also a placeholder in `*_mm_counter`, `add_mm_rss_vec` and `sync_mm_rss`. These placeholders are very strange. Maybe I should introduce a helper function for mm->rss_stat.numa_count counting instead of using placeholder.