Received: by 2002:a05:6a10:22f:0:0:0:0 with SMTP id 15csp248060pxk; Tue, 1 Sep 2020 23:24:38 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwaXiut8CrDnYOc/h7J4mV9NWHUe1PKiUqE4o67dssH+qTHriwBf+Prj5JwJYL/WeQEB2vN X-Received: by 2002:a17:906:2618:: with SMTP id h24mr2757237ejc.198.1599027877930; Tue, 01 Sep 2020 23:24:37 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1599027877; cv=none; d=google.com; s=arc-20160816; b=w+r11cFO4oJJeone20HlLZUe8zMrfCUFHQX0HEvoWJ7cqtqdBury3A20KoiETA1fN4 bnTtzlTbDuExy64kn2wAaPXY2vFX19Kw2PGsKOpVr58r62qFVs7j6EDS6jjKS7x6RgW6 VxsQkgW1GIpL1nFXORnA4FqmN5A6Rkmxo1NPLYSLiGb5oxfEdZ4+81tZvCozR6fPEn2i XkEAFNZilOBGLadhgNgdmFPenpV2bgBOxJvKcE0f0G96W2oB5d2v2aeWCTfDAS4rYlRc zK0HfL1IgaOyLVxoVWIM9v4ceSYI90eVpm57AW1D5JvPZzzc5kV9tTHa/LHhP600SdCJ tAoA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:in-reply-to:content-disposition :mime-version:references:reply-to:message-id:subject:cc:to:from:date :dkim-signature; bh=Qoo4IPgfjGzssjVtjDecsDGFzDuZljj645sp3n19IaQ=; b=xH7ICUaawoSNIphFlQPYjiPzhU1bl8pbeRPeAJEPX1p/evGOPFl07C2+8xcmP27ENP 9p4L/vtZdn/CwwRLe8r7bacslEJlnKDyefLudEfuPVbo15MiYqzw9s4p2JsLidLiKnYI NoU1j4FEquhIKxshmViQvOHBy3cl7oAjVAHy6LDHgWqfF5JVVzNP4HGBDbF2c1aV0kIy KklIzq2siGDeB0hcZ+q4VswFOgmYKqwwH7ab3wcg2ojCQaU+TZE/RzOUywP6WjtOXPED /TACVQphrHTAMdBHmRgBz+bUUkufSlYngxEOphI73qittk9e9XM22YqQKmjYq/8j8Zyg sQqg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@ibm.com header.s=pp1 header.b=FKLWkUaY; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=ibm.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id zn8si2041457ejb.737.2020.09.01.23.24.12; Tue, 01 Sep 2020 23:24:37 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@ibm.com header.s=pp1 header.b=FKLWkUaY; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=ibm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726312AbgIBGXc (ORCPT + 99 others); Wed, 2 Sep 2020 02:23:32 -0400 Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]:27670 "EHLO mx0b-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726140AbgIBGXc (ORCPT ); Wed, 2 Sep 2020 02:23:32 -0400 Received: from pps.filterd (m0127361.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id 082621uU021263; Wed, 2 Sep 2020 02:23:22 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ibm.com; h=date : from : to : cc : subject : message-id : reply-to : references : mime-version : content-type : in-reply-to; s=pp1; bh=Qoo4IPgfjGzssjVtjDecsDGFzDuZljj645sp3n19IaQ=; b=FKLWkUaYmaDDGRCPmW6AiVXlRQwEum0XLB/Qbih2sfc8MsvgT3yFXHrktrnUK8ooiLT5 q16zgY1PRo69SH8H8dZXSDFM2YsmW6S/3UqnVI+qMShDOYQBtDVtNeI44c/0j3VnW2m4 qQrBCuBh4wlrQUBJ5ijPEczRoqbAQwGKlIPoxabMmh3TXAh/mFU0HiogTJWW+ihlLvBx u2scS7Nn6K2SFJs9SDgVyV4AO3DhUPJqlDDAw+thRBFb/ZL7p5BrtsBpigbXPLuDopI0 eVQzzcyR0rjEUNL8dWV/csnc6SEbdM778vTDNpR3msaTlpBYdblFj01QsU8U8Va6loUH Zg== Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com with ESMTP id 33a4kqa2n0-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 02 Sep 2020 02:23:22 -0400 Received: from m0127361.ppops.net (m0127361.ppops.net [127.0.0.1]) by pps.reinject (8.16.0.36/8.16.0.36) with SMTP id 08262Fm4021867; Wed, 2 Sep 2020 02:23:21 -0400 Received: from ppma05fra.de.ibm.com (6c.4a.5195.ip4.static.sl-reverse.com [149.81.74.108]) by mx0a-001b2d01.pphosted.com with ESMTP id 33a4kqa2mf-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 02 Sep 2020 02:23:21 -0400 Received: from pps.filterd (ppma05fra.de.ibm.com [127.0.0.1]) by ppma05fra.de.ibm.com (8.16.0.42/8.16.0.42) with SMTP id 0826J5Gt015660; Wed, 2 Sep 2020 06:23:19 GMT Received: from b06avi18626390.portsmouth.uk.ibm.com (b06avi18626390.portsmouth.uk.ibm.com [9.149.26.192]) by ppma05fra.de.ibm.com with ESMTP id 337en82naq-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 02 Sep 2020 06:23:19 +0000 Received: from d06av26.portsmouth.uk.ibm.com (d06av26.portsmouth.uk.ibm.com [9.149.105.62]) by b06avi18626390.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 0826Lklu64487890 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 2 Sep 2020 06:21:46 GMT Received: from d06av26.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 7FE31AE045; Wed, 2 Sep 2020 06:23:17 +0000 (GMT) Received: from d06av26.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 21629AE053; Wed, 2 Sep 2020 06:23:14 +0000 (GMT) Received: from in.ibm.com (unknown [9.85.95.25]) by d06av26.portsmouth.uk.ibm.com (Postfix) with ESMTPS; Wed, 2 Sep 2020 06:23:13 +0000 (GMT) Date: Wed, 2 Sep 2020 11:53:11 +0530 From: Bharata B Rao To: Pavel Tatashin Cc: Roman Gushchin , "linux-mm@kvack.org" , Andrew Morton , Michal Hocko , Johannes Weiner , Shakeel Butt , Vladimir Davydov , "linux-kernel@vger.kernel.org" , Kernel Team , Yafang Shao , stable , Linus Torvalds , Sasha Levin , Greg Kroah-Hartman Subject: Re: [PATCH v2 00/28] The new cgroup slab memory controller Message-ID: <20200902062311.GB52094@in.ibm.com> Reply-To: bharata@linux.ibm.com References: <20200127173453.2089565-1-guro@fb.com> <20200130020626.GA21973@in.ibm.com> <20200130024135.GA14994@xps.DHCP.thefacebook.com> <20200813000416.GA1592467@carbon.dhcp.thefacebook.com> <20200901052819.GA52094@in.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-TM-AS-GCONF: 00 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.235,18.0.687 definitions=2020-09-02_03:2020-09-01,2020-09-02 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 mlxlogscore=999 suspectscore=0 impostorscore=0 phishscore=0 mlxscore=0 priorityscore=1501 clxscore=1015 adultscore=0 spamscore=0 malwarescore=0 lowpriorityscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2006250000 definitions=main-2009020052 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Sep 01, 2020 at 08:52:05AM -0400, Pavel Tatashin wrote: > On Tue, Sep 1, 2020 at 1:28 AM Bharata B Rao wrote: > > > > On Fri, Aug 28, 2020 at 12:47:03PM -0400, Pavel Tatashin wrote: > > > There appears to be another problem that is related to the > > > cgroup_mutex -> mem_hotplug_lock deadlock described above. > > > > > > In the original deadlock that I described, the workaround is to > > > replace crash dump from piping to Linux traditional save to files > > > method. However, after trying this workaround, I still observed > > > hardware watchdog resets during machine shutdown. > > > > > > The new problem occurs for the following reason: upon shutdown systemd > > > calls a service that hot-removes memory, and if hot-removing fails for > > > some reason systemd kills that service after timeout. However, systemd > > > is never able to kill the service, and we get hardware reset caused by > > > watchdog or a hang during shutdown: > > > > > > Thread #1: memory hot-remove systemd service > > > Loops indefinitely, because if there is something still to be migrated > > > this loop never terminates. However, this loop can be terminated via > > > signal from systemd after timeout. > > > __offline_pages() > > > do { > > > pfn = scan_movable_pages(pfn, end_pfn); > > > # Returns 0, meaning there is nothing available to > > > # migrate, no page is PageLRU(page) > > > ... > > > ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn, > > > NULL, check_pages_isolated_cb); > > > # Returns -EBUSY, meaning there is at least one PFN that > > > # still has to be migrated. > > > } while (ret); > > > > > > Thread #2: ccs killer kthread > > > css_killed_work_fn > > > cgroup_mutex <- Grab this Mutex > > > mem_cgroup_css_offline > > > memcg_offline_kmem.part > > > memcg_deactivate_kmem_caches > > > get_online_mems > > > mem_hotplug_lock <- waits for Thread#1 to get read access > > > > > > Thread #3: systemd > > > ksys_read > > > vfs_read > > > __vfs_read > > > seq_read > > > proc_single_show > > > proc_cgroup_show > > > mutex_lock -> wait for cgroup_mutex that is owned by Thread #2 > > > > > > Thus, thread #3 systemd stuck, and unable to deliver timeout interrupt > > > to thread #1. > > > > > > The proper fix for both of the problems is to avoid cgroup_mutex -> > > > mem_hotplug_lock ordering that was recently fixed in the mainline but > > > still present in all stable branches. Unfortunately, I do not see a > > > simple fix in how to remove mem_hotplug_lock from > > > memcg_deactivate_kmem_caches without using Roman's series that is too > > > big for stable. > > > > We too are seeing this on Power systems when stress-testing memory > > hotplug, but with the following call trace (from hung task timer) > > instead of Thread #2 above: > > > > __switch_to > > __schedule > > schedule > > percpu_rwsem_wait > > __percpu_down_read > > get_online_mems > > memcg_create_kmem_cache > > memcg_kmem_cache_create_func > > process_one_work > > worker_thread > > kthread > > ret_from_kernel_thread > > > > While I understand that Roman's new slab controller patchset will fix > > this, I also wonder if infinitely looping in the memory unplug path > > with mem_hotplug_lock held is the right thing to do? Earlier we had > > a few other exit possibilities in this path (like max retries etc) > > but those were removed by commits: > > > > 72b39cfc4d75: mm, memory_hotplug: do not fail offlining too early > > ecde0f3e7f9e: mm, memory_hotplug: remove timeout from __offline_memory > > > > Or, is the user-space test is expected to induce a signal back-off when > > unplug doesn't complete within a reasonable amount of time? > > Hi Bharata, > > Thank you for your input, it looks like you are experiencing the same > problems that I observed. > > What I found is that the reason why our machines did not complete > hot-remove within the given time is because of this bug: > https://lore.kernel.org/linux-mm/20200901124615.137200-1-pasha.tatashin@soleen.com > > Could you please try it and see if that helps for your case? I am on an old codebase that already has the fix that you are proposing, so I might be seeing someother issue which I will debug further. So looks like the loop in __offline_pages() had a call to drain_all_pages() before it was removed by c52e75935f8d: mm: remove extra drain pages on pcp list Regards, Bharata.