Received: by 2002:a05:6a10:1a4d:0:0:0:0 with SMTP id nk13csp5371163pxb; Sun, 6 Feb 2022 23:56:02 -0800 (PST) X-Google-Smtp-Source: ABdhPJx1IZ9tZ2VPxuAGwirdPoSUEe1AsNSlbylMKzHvlDJYCqP35xolwnKFex5C62lDWCxoUXm3 X-Received: by 2002:a17:90a:8a13:: with SMTP id w19mr9882078pjn.22.1644220562518; Sun, 06 Feb 2022 23:56:02 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1644220562; cv=none; d=google.com; s=arc-20160816; b=FLPdjPqnfgl3n8dklG2i79Od7Arvp66Ky0e/8v9ePZCB8409vViYTHn/scX3HsDatT ZKbbKaJCykVDcRXLJWjrph3gtqzEWqZe2vxQupkHFVGfMeQaXLyCzG7mTjIE0TB1dWrs KRATrKJkaReuW0Ma7Z1aVsGiglYVC+CcS8MZg1jZpxAPy/lNCOM55j75SuPItsqUfwEv GhBHCKB1/GoamBmk0gKD4F07yndn4TVuZujYHtPga32Qm/B7q/k21avn4By2bRlb9mQH Ojcv6xutsewGGA4WoBforHdpye9N1dLVQPJ1V7E716lTPBJpTKnHg9b4Rj1WfUnwJDw3 5gkA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:references:message-id:in-reply-to :subject:cc:to:from:date:dkim-signature; bh=aFsW4oJYjKeOMHpCwLOrqYmCPgdn3u/aN+st8V1n1k0=; b=xYsZI79Ei7K3Z43Byu4YrWn21V0Y7Z4MbF5R8Eahv77bx68vUh/3a+LvzaT4XO8tOX 55ap/S0VFvpUkgIGbAm56ph1ggAcZT19GsSQT8Q7idv8Xt7a9nhXbzPyM9J9GXVaSb+8 ZHZNM3Sp3jutX1Ww5V3yrA6VIPqR2M8y1C9y+g3i7xUNnbZKBpvAD03FgHvW/70eTfRC HnVcbALpEZQcouPIwKBwPgH+AKTWTfqcEO/6DKy38lTVruw9eSM0IrxDYS4BPtsFOc/i iZWSGgYSfQAwoAWUg+n18jrlwEYEVzElQZNJmreYXk0XUMVXXqY7yI6Pnq2I+J+anywI S3EA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=Jo0cdlDv; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id j30si9531744pgm.529.2022.02.06.23.55.51; Sun, 06 Feb 2022 23:56:02 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=Jo0cdlDv; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1344778AbiBFWJH (ORCPT + 99 others); Sun, 6 Feb 2022 17:09:07 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:32842 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1344706AbiBFWIt (ORCPT ); Sun, 6 Feb 2022 17:08:49 -0500 Received: from mail-pl1-x633.google.com (mail-pl1-x633.google.com [IPv6:2607:f8b0:4864:20::633]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C556DC061348 for ; Sun, 6 Feb 2022 14:08:48 -0800 (PST) Received: by mail-pl1-x633.google.com with SMTP id t9so7555790plg.13 for ; Sun, 06 Feb 2022 14:08:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:from:to:cc:subject:in-reply-to:message-id:references :mime-version; bh=aFsW4oJYjKeOMHpCwLOrqYmCPgdn3u/aN+st8V1n1k0=; b=Jo0cdlDv6uybOeKk7VmFcccPwrHxPCx063Hn0O/eWQL+Lm74kSuEzcyx6Hnn8K4jDj kkfoIXVF3bITg/OQmP/RJqDdkcaYo0HH2donmBy4aOSQ8uQONkF6b8AQNTUbfzyRBygl D6LEk9i56y1f4/hJC9zPSX8bsEjrLHhC9SLndqqAOs7xUM+fOGu/fTH2l12CFifXLKK7 mrAJdsfZfc2TsA2iBYQNAVO+kXpDg6Zk0NnnisG08fQ4pwgTLyZFuqjj+Be1UUd8EUTA MTG8R/tjaJkQN4yrXwoK1oNMwwsvj7Ye13TrSrHvJDgIBHtEP8zV+KC228bK/LqTLHLq btbQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id :references:mime-version; bh=aFsW4oJYjKeOMHpCwLOrqYmCPgdn3u/aN+st8V1n1k0=; b=keJS0WNqNAnkzQXWUl1axLx3QeZud6pDtPseRYyzTyqDJwybp2POHgFOkzU/PIZpqG SoXYQIOviIJH/FKW9mrV+i9qEWUHWIGmx/TNmH6zG8H5WhMrx0hUofbAAnY/xjHHlcK5 OwV7T905Or+uZcU0+QZroTO0FN+cHOmVjR7hRG9jmLLxIBbDQMsWw93KEjZkQFsuDXJK DHumsoi3AD9iFRhm6YJjt1XHpXnj91yh92fWuhRo8YXHhWFDuE57PDixVpgKUsi9m8k6 FG4mpba5yHLHndUIH57Q/R2DDuaIIOORAoQFaVHFL8aKd/OSkbyBhk86Er5ZxV7Lo6aP DPWA== X-Gm-Message-State: AOAM531ymChGCAwxYrDUzDZCbbPQH++oC82hm/aQJaYyK2A8WJZrGxg3 fJ8c68fR4JbUtTchnNpszTB15sLYjdo33g== X-Received: by 2002:a17:902:760e:: with SMTP id k14mr13433737pll.11.1644185327962; Sun, 06 Feb 2022 14:08:47 -0800 (PST) Received: from [2620:15c:29:204:dae1:9bee:7b85:4b01] ([2620:15c:29:204:dae1:9bee:7b85:4b01]) by smtp.gmail.com with ESMTPSA id j10sm9471267pfu.93.2022.02.06.14.08.47 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 06 Feb 2022 14:08:47 -0800 (PST) Date: Sun, 6 Feb 2022 14:08:47 -0800 (PST) From: David Rientjes To: Mel Gorman cc: Andrew Morton , Hugh Dickins , Michal Hocko , Vlastimil Babka , Rik van Riel , Linux-MM , LKML Subject: Re: [PATCH] mm: vmscan: remove deadlock due to throttling failing to make progress In-Reply-To: <20220203100326.GD3301@suse.de> Message-ID: References: <20220203100326.GD3301@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 3 Feb 2022, Mel Gorman wrote: > A soft lockup bug in kcompactd was reported in a private bugzilla with > the following visible in dmesg; > > [15980.045209][ C33] watchdog: BUG: soft lockup - CPU#33 stuck for 26s! [kcompactd0:479] > [16008.044989][ C33] watchdog: BUG: soft lockup - CPU#33 stuck for 52s! [kcompactd0:479] > [16036.044768][ C33] watchdog: BUG: soft lockup - CPU#33 stuck for 78s! [kcompactd0:479] > [16064.044548][ C33] watchdog: BUG: soft lockup - CPU#33 stuck for 104s! [kcompactd0:479] > > The machine had 256G of RAM with no swap and an earlier failed allocation > indicated that node 0 where kcompactd was run was potentially > unreclaimable; > > Node 0 active_anon:29355112kB inactive_anon:2913528kB active_file:0kB > inactive_file:0kB unevictable:64kB isolated(anon):0kB isolated(file):0kB > mapped:8kB dirty:0kB writeback:0kB shmem:26780kB shmem_thp: > 0kB shmem_pmdmapped: 0kB anon_thp: 23480320kB writeback_tmp:0kB > kernel_stack:2272kB pagetables:24500kB all_unreclaimable? yes > > Vlastimil Babka investigated a crash dump and found that a task migrating pages > was trying to drain PCP lists; > > PID: 52922 TASK: ffff969f820e5000 CPU: 19 COMMAND: "kworker/u128:3" > #0 [ffffaf4e4f4c3848] __schedule at ffffffffb840116d > #1 [ffffaf4e4f4c3908] schedule at ffffffffb8401e81 > #2 [ffffaf4e4f4c3918] schedule_timeout at ffffffffb84066e8 > #3 [ffffaf4e4f4c3990] wait_for_completion at ffffffffb8403072 > #4 [ffffaf4e4f4c39d0] __flush_work at ffffffffb7ac3e4d > #5 [ffffaf4e4f4c3a48] __drain_all_pages at ffffffffb7cb707c > #6 [ffffaf4e4f4c3a80] __alloc_pages_slowpath.constprop.114 at ffffffffb7cbd9dd > #7 [ffffaf4e4f4c3b60] __alloc_pages at ffffffffb7cbe4f5 > #8 [ffffaf4e4f4c3bc0] alloc_migration_target at ffffffffb7cf329c > #9 [ffffaf4e4f4c3bf0] migrate_pages at ffffffffb7cf6d15 > 10 [ffffaf4e4f4c3cb0] migrate_to_node at ffffffffb7cdb5aa > 11 [ffffaf4e4f4c3da8] do_migrate_pages at ffffffffb7cdcf26 > 12 [ffffaf4e4f4c3e88] cpuset_migrate_mm_workfn at ffffffffb7b859d2 > 13 [ffffaf4e4f4c3e98] process_one_work at ffffffffb7ac45f3 > 14 [ffffaf4e4f4c3ed8] worker_thread at ffffffffb7ac47fd > 15 [ffffaf4e4f4c3f10] kthread at ffffffffb7acbdc6 > 16 [ffffaf4e4f4c3f50] ret_from_fork at ffffffffb7a047e2 > > The root of the problem is that kcompact0 is not rescheduling on a CPU > while a task that has isolated a large number of the pages from the > LRU is waiting on kcompact0 to reschedule so the pages can be released. > While shrink_inactive_list() only loops once around too_many_isolated, > reclaim can continue without rescheduling if sc->skipped_deactivate == > 1 which could happen if there was no file LRU and the inactive anon list > was not low. > > Debugged-by: Vlastimil Babka > Signed-off-by: Mel Gorman Acked-by: David Rientjes