Received: by 2002:a05:6358:d09b:b0:dc:cd0c:909e with SMTP id jc27csp9774555rwb; Thu, 24 Nov 2022 18:52:53 -0800 (PST) X-Google-Smtp-Source: AA0mqf5e19JtwwE5wzsd09NtKGDvsUjzPPhwGa2NKrvrtpbsy4ZvJPT/IOwhjinhcZ4QFZI21j+R X-Received: by 2002:a17:906:89a0:b0:7ae:47ef:93f with SMTP id gg32-20020a17090689a000b007ae47ef093fmr18784514ejc.116.1669344773529; Thu, 24 Nov 2022 18:52:53 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1669344773; cv=none; d=google.com; s=arc-20160816; b=VtF303OkJFiYj8LhNsJJ4WoMLr5e4m33trS8Zy8I5x4j506Q26qdW4+OAC+J3bxYP/ 1/xwqA04okncwoHi8wGSJvwe2QKa14z17ZsxvoHx2zkoV0bbR1E5g9wQEtntsG/Vvbkg IhUEGs6ID6qrKE2ItrqvPP1gsw0dZ34w44IHCUOB0iC/Hxr9NGEMyBqOu+/bfXgvGvRy gKpgG5guVUROXuW9h0QO071Xky7KLNR8ax7xDUjsUC4HTpf0MJgtWsWn9GbHhpbegVZ+ KELjxdjD7as8ycovDLJgawOmy45DRl0DSgRgbJYNeLJcBeJRLj0tudotP5ODw7pKBMZP WS4A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:subject:user-agent:mime-version:date:message-id :dkim-signature; bh=Br3OQZtK3UyZAW9R8ObAHLEhWzoDHHFos/UBtSFV900=; b=xITF8ItBwVpU/rck8wH/D3YhqgwcQagyVYyFahOHj/gAbfAcpbYQ47t9Q3WjthL89l 5cvfFGY8Fwoufx9WTjuk/W++GmfzQAxAt0IwnkvwPHPeEOv77ywPV4vp9UqyXSdsGnua aS6cJAcElI/zIZ1Q9Oa3NEqiS3c4pzVPFTbgqIWR4xQ6LTYJ3zLyWmoqM5Jlj+yfDCFA 7WORQ9NseAtp9rtpnjKyr82XcLW3asQDKLR860RYO0k6F8lgFmwNhNu04OBeGGNJwAVF RNEOi59X4AEZ43PIM0iaIEU1F+T/UCyjfznql01Uwg5rfa7XHaUZWfefPdhrBEm8O6/p xjBw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@shopee.com header.s=shopee.com header.b=VlOUoJLP; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=shopee.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id f7-20020a056402194700b004571a669230si2031999edz.469.2022.11.24.18.52.30; Thu, 24 Nov 2022 18:52:53 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@shopee.com header.s=shopee.com header.b=VlOUoJLP; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=shopee.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229627AbiKYCO1 (ORCPT + 87 others); Thu, 24 Nov 2022 21:14:27 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57928 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229569AbiKYCOZ (ORCPT ); Thu, 24 Nov 2022 21:14:25 -0500 Received: from mail-pf1-x431.google.com (mail-pf1-x431.google.com [IPv6:2607:f8b0:4864:20::431]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1DCE423EB6 for ; Thu, 24 Nov 2022 18:14:24 -0800 (PST) Received: by mail-pf1-x431.google.com with SMTP id z26so2938127pff.1 for ; Thu, 24 Nov 2022 18:14:24 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=shopee.com; s=shopee.com; h=content-transfer-encoding:in-reply-to:from:references:cc:to:subject :user-agent:mime-version:date:message-id:from:to:cc:subject:date :message-id:reply-to; bh=Br3OQZtK3UyZAW9R8ObAHLEhWzoDHHFos/UBtSFV900=; b=VlOUoJLPcGXo16QybyechvkdHIW1n/Wr/6u2etJiej/LPYuhMI3n8fkAKNy6arGeeo slRp9jjohJj41PLviufLIT0F6mjV9LmETthn5KWAZ69XbaquBfzdaG6p1+R+mVsnvkEc +wcGgBSMmy1ikMmuM+silTE1Ar4TuYOrObGmx2gEpodCEW2Tjigbn2hD3e2J7ZB+P5OD 4r5KU6LMV3JzC7RupMVZHwE4Jzu0D+boNaoPhUs/kjQsRckhwJT6e+fVmY0jhecvwW1W +Gy7wj4RYbwT8abd0AXG3vptH4eEJyOmm+qSkY5Y36HuKyqrENFHCNIKZBB0jLzbrkXK Rcdw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:in-reply-to:from:references:cc:to:subject :user-agent:mime-version:date:message-id:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=Br3OQZtK3UyZAW9R8ObAHLEhWzoDHHFos/UBtSFV900=; b=cUO4XP5BRqVk+rVBx2oahXEGCKCVp8DkoT5A/n2nPcVtDmKEOFxLeQEDgWPWGo3W/d gOKdR4oC8CMiYbaSGB+Hi9ijK6iWMjv4kJ/DfaaqBUUnxBlEABaQvS1GCzVPnxDutJt+ Tup8iwlBqDT+GZznFLZWJ58xDwgI66KfgHxpniy6od7ac/5haIWBvGMkLruhKv4+27Gn YiEGvqllsk3t3mmGMxIqudJD3X9cAN8HhCd0ruyvRKsK3BJf3gCaNqHGYKUNzg00dRNx Jkv5ggKHMA9JeQG/baVqzgQ//9K9MJ1VBMDyF5Y3DJNcEnSd91FjP5F5LAFoBO4s7g4h sKnQ== X-Gm-Message-State: ANoB5pkoMCFYcAHFr67RnCrL0qeDOe2idxWi/5EBVr7KAbppqTuYICO4 0rt3Sb49z//tUMZ16B/uK180Yg== X-Received: by 2002:a62:cf81:0:b0:56b:add7:fe2f with SMTP id b123-20020a62cf81000000b0056badd7fe2fmr17564444pfg.51.1669342463514; Thu, 24 Nov 2022 18:14:23 -0800 (PST) Received: from [10.54.24.49] ([143.92.118.3]) by smtp.gmail.com with ESMTPSA id k4-20020a63d844000000b00476e84c3530sm1613756pgj.60.2022.11.24.18.14.21 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 24 Nov 2022 18:14:23 -0800 (PST) Message-ID: <0f86b2d7-d197-574b-ded9-4af1c074a658@shopee.com> Date: Fri, 25 Nov 2022 10:14:19 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0) Gecko/20100101 Thunderbird/102.5.0 Subject: Re: [PATCH] cgroup/cpuset: Optimize update_tasks_nodemask() To: Waiman Long Cc: lizefan.x@bytedance.com, tj@kernel.org, hannes@cmpxchg.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org References: <20221123082157.71326-1-haifeng.xu@shopee.com> <2ac6f207-e08a-2a7f-01ae-dfaf15eefaf6@redhat.com> <4de8821b-e0c0-bf63-4d76-b0ce208cce3b@shopee.com> <21e73dad-c6d0-21ea-dcdf-355b71c8537b@shopee.com> <1a997ea7-bb63-1710-14d6-c3b88a22bdb3@redhat.com> From: Haifeng Xu In-Reply-To: <1a997ea7-bb63-1710-14d6-c3b88a22bdb3@redhat.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,NICE_REPLY_A,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2022/11/25 07:00, Waiman Long wrote: > On 11/24/22 02:49, Haifeng Xu wrote: >> >> On 2022/11/24 12:24, Waiman Long wrote: >>> On 11/23/22 22:33, Haifeng Xu wrote: >>>> On 2022/11/24 04:23, Waiman Long wrote: >>>>> On 11/23/22 03:21, haifeng.xu wrote: >>>>>> When change the 'cpuset.mems' under some cgroup, system will hung >>>>>> for a long time. From the dmesg, many processes or theads are >>>>>> stuck in fork/exit. The reason is show as follows. >>>>>> >>>>>> thread A: >>>>>> cpuset_write_resmask /* takes cpuset_rwsem */ >>>>>>      ... >>>>>>        update_tasks_nodemask >>>>>>          mpol_rebind_mm /* waits mmap_lock */ >>>>>> >>>>>> thread B: >>>>>> worker_thread >>>>>>      ... >>>>>>        cpuset_migrate_mm_workfn >>>>>>          do_migrate_pages /* takes mmap_lock */ >>>>>> >>>>>> thread C: >>>>>> cgroup_procs_write /* takes cgroup_mutex and >>>>>> cgroup_threadgroup_rwsem */ >>>>>>      ... >>>>>>        cpuset_can_attach >>>>>>          percpu_down_write /* waits cpuset_rwsem */ >>>>>> >>>>>> Once update the nodemasks of cpuset, thread A wakes up thread B to >>>>>> migrate mm. But when thread A iterates through all tasks, including >>>>>> child threads and group leader, it has to wait the mmap_lock which >>>>>> has been take by thread B. Unfortunately, thread C wants to migrate >>>>>> tasks into cgroup at this moment, it must wait thread A to release >>>>>> cpuset_rwsem. If thread B spends much time to migrate mm, the >>>>>> fork/exit which acquire cgroup_threadgroup_rwsem also need to >>>>>> wait for a long time. >>>>>> >>>>>> There is no need to migrate the mm of child threads which is >>>>>> shared with group leader. Just iterate through the group >>>>>> leader only. >>>>>> >>>>>> Signed-off-by: haifeng.xu >>>>>> --- >>>>>>     kernel/cgroup/cpuset.c | 3 +++ >>>>>>     1 file changed, 3 insertions(+) >>>>>> >>>>>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c >>>>>> index 589827ccda8b..43cbd09546d0 100644 >>>>>> --- a/kernel/cgroup/cpuset.c >>>>>> +++ b/kernel/cgroup/cpuset.c >>>>>> @@ -1968,6 +1968,9 @@ static void update_tasks_nodemask(struct cpuset >>>>>> *cs) >>>>>>               cpuset_change_task_nodemask(task, &newmems); >>>>>>     +        if (!thread_group_leader(task)) >>>>>> +            continue; >>>>>> + >>>>>>             mm = get_task_mm(task); >>>>>>             if (!mm) >>>>>>                 continue; >>>>> Could you try the attached test patch to see if it can fix your >>>>> problem? >>>>> Something along the line of this patch will be more acceptable. >>>>> >>>>> Thanks, >>>>> Longman >>>>> >>>> Hi, Longman. >>>> Thanks for your patch, but there are still some problems. >>>> >>>> 1) >>>>     (group leader, node: 0,1) >>>>            cgroup0 >>>>            /     \ >>>>           /       \ >>>>       cgroup1   cgroup2 >>>>      (threads)  (threads) >>>> >>>> If set node 0 in cgroup1 and node 1 in cgroup2, both of them will >>>> update >>>> the mm. And the nodemask of mm depends on who set the node last. >>> Yes, that is the existing behavior. It was not that well defined in the >>> past and so it is somewhat ambiguous as to what we need to do about it. >>> >> The test patch works if the child threads are in same cpuset with group >> leader which has same logic with my patch. But if they are in different >> cpusets, the test patch will fail because the contention of mmap_lock >> still exsits and seems similar to the original logic. > > That is true. I am thinking about adding a nodemask to mm_struct so that > we can figure out if we need to propagate the changes down to all the > VMAs and do the migration. That will enable us to avoid doing wasteful > work. > > Current node mask handling isn't that efficient especially for distros > that have a relatively large NODES_SHIFT value. Some work may also be > need in this area. > >>> BTW, cgroup1 has a memory_migrate flag which will force page migration >>> if set. I guess you may have it set in your case as it will introduce a >>> lot more delay as page migration takes time. That is probably the reason >>> why you are seeing a long delay. So one possible solution is to turn >>> this flag off. Cgroup v2 doesn't have this flag. >>> >> Dou you mean 'CS_MEMORY_MIGRATE'? This flag can be turn off in Cgroup >> v1, but it has been set in Cgroup v2 (cpuset_css_alloc) in default and >> couldn't be changed. > You are right. Cgroup v2 has CS_MEMORY_MIGRATE enabled by default and > can't be turned off. >> >>>> 2) >>>>      (process, node: 0,1) >>>>            cgroup0 >>>>            /     \ >>>>           /       \ >>>>       cgroup1   cgroup2 >>>>      (node: 0)  (node: 1) >>>> >>>> If migrate thread from cgroup0 to cgroup1 or cgroup2, cpuset_attach >>>> won't update the mm. So the nodemask of thread, including mems_allowed >>>> and mempolicy(updated in cpuset_change_task_nodemask), is different >>>> from >>>> the vm_policy in vma(updated in mpol_rebind_mm). >>> Yes, that can be the case. >>> >>>> >>>> In a word, if threads have different cpusets with different >>>> nodemask, it >>>> will cause inconsistent memory behavior. >>> So do you have suggestion of what we need to do going forward? >> Should we prevent thread from migrating to those cgroups which have >> different nodemask with the cgroup that contains the group leader? >> >> In addition, the group leader and child threads should be in same cgroup >> tree, also the level of cgroup containes group leader must be higher >> than these cgroups contain child threads, so update_nodemask will work. >> >> Or just disable thread migration in cpuset?It's easy to achieve but will >> affect cpu bind. > > As said above, my current inclination is to add a nodemask to mm_struct > and revise the way nodemask is being handled. That will take some time> > Cheers, > Longman > OK, thanks.