Received: by 2002:a05:6a10:6744:0:0:0:0 with SMTP id w4csp541470pxu; Wed, 7 Oct 2020 09:26:42 -0700 (PDT) X-Google-Smtp-Source: ABdhPJz1cn7VKngWiJvP2gh8Q2Y640F1Eb6GOtOADYh0EswlQu56wKGCTsQHH+jSbaXSUvSCCeiD X-Received: by 2002:a17:906:37c6:: with SMTP id o6mr4341081ejc.404.1602088002123; Wed, 07 Oct 2020 09:26:42 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1602088002; cv=none; d=google.com; s=arc-20160816; b=ed5bVI1OzJJT2vnVYZBronyl9zbVRU2vuytLV4WG4xB0NIBpIdQT+Ccd/JsOBkaaui BrU4W2/gPYX0RnifFtRVTtQkA3rGu3h37uH/Kd55eC7CxQnP+bwUpG8fQG95XMIyKQCL ypD8gu+LGi9+FOwxQIZGVWSfsQvk20lVAe+q1v+LGhy0j86T1mCfVboqTTayALVfqKCm nol0JoNkp4NGpJCtYK0onIPNBVzlS7WRje+pA9xda224ZfqV+DaeG99INDCJvegUWDZ9 5UvwC9FH9ZTLC2tSxxM04mTDa3wPKfmB7LobM83JFtbLOsP/bVhXDYZ5v9GZCCB42Anj N9kA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:message-id:in-reply-to:references:date:from:cc :to:subject:ironport-sdr:ironport-sdr; bh=VqJwIQTGZt//+ASbTxB6CXXqaPW2H8CCHJSc2TzD0IQ=; b=Sq41+sFHbKBtCIwwKBGI4jehvLUVAamizyVFd+8OjvhVeae0ACUJVhyjFLUY0xD5kV krdvuIwFjr8nI3VJhx8ny0m+UAUbGaO7rOH07l1Bf8duD1ridlLrXhRBx1ImlnCBuJt8 jM6TDSOCGzhgHDzBHvrRylVgUpKUIppYYz+BlyS5F6RuQS6jePnVyJWHZlKQHqbVWzjQ Xpi3zmwcEtlTbuar2cnqTv3uAA+4YbH4hwgyS/BTk2WONJwvX03+k26PtiOSwGcIoMzZ NodzFTtIGsWrVIZwTDySi3qwbl6xbsFg7sRaB4YJagTUDdu7Z8mP8D1G3OCcTnCqpsj2 2O3g== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id p60si1343720edd.331.2020.10.07.09.26.18; Wed, 07 Oct 2020 09:26:42 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729206AbgJGQSA (ORCPT + 99 others); Wed, 7 Oct 2020 12:18:00 -0400 Received: from mga04.intel.com ([192.55.52.120]:4759 "EHLO mga04.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729192AbgJGQRx (ORCPT ); Wed, 7 Oct 2020 12:17:53 -0400 IronPort-SDR: 2MyCp23HOrFGhH1CCwcew3PMuHpVmyonkNIJvuuTGchuUUte5tgOpbsHeWV8hSudtkrBF4sK93 qu1kAhugb+zw== X-IronPort-AV: E=McAfee;i="6000,8403,9767"; a="162388001" X-IronPort-AV: E=Sophos;i="5.77,347,1596524400"; d="scan'208";a="162388001" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 Oct 2020 09:17:52 -0700 IronPort-SDR: WAxQbRwmxaISY1f4Vhk148Twpcj9BCIoZ3gJLAYz62zznowkSaEzwtMmME3f/6PlE7LFIqZrf7 YyGTz0DxUKBQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.77,347,1596524400"; d="scan'208";a="517869735" Received: from viggo.jf.intel.com (HELO localhost.localdomain) ([10.54.77.144]) by fmsmga006.fm.intel.com with ESMTP; 07 Oct 2020 09:17:52 -0700 Subject: [RFC][PATCH 5/9] mm/migrate: demote pages during reclaim To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, Dave Hansen , yang.shi@linux.alibaba.com, rientjes@google.com, ying.huang@intel.com, dan.j.williams@intel.com From: Dave Hansen Date: Wed, 07 Oct 2020 09:17:45 -0700 References: <20201007161736.ACC6E387@viggo.jf.intel.com> In-Reply-To: <20201007161736.ACC6E387@viggo.jf.intel.com> Message-Id: <20201007161745.26B1D789@viggo.jf.intel.com> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Dave Hansen This is mostly derived from a patch from Yang Shi: https://lore.kernel.org/linux-mm/1560468577-101178-10-git-send-email-yang.shi@linux.alibaba.com/ Add code to the reclaim path (shrink_page_list()) to "demote" data to another NUMA node instead of discarding the data. This always avoids the cost of I/O needed to read the page back in and sometimes avoids the writeout cost when the pagee is dirty. A second pass through shrink_page_list() will be made if any demotions fail. This essentally falls back to normal reclaim behavior in the case that demotions fail. Previous versions of this patch may have simply failed to reclaim pages which were eligible for demotion but were unable to be demoted in practice. Note: This just adds the start of infratructure for migration. It is actually disabled next to the FIXME in migrate_demote_page_ok(). Signed-off-by: Dave Hansen Cc: Yang Shi Cc: David Rientjes Cc: Huang Ying Cc: Dan Williams -- changes from 20200730: * Add another pass through shrink_page_list() when demotion fails. --- b/include/linux/migrate.h | 2 b/mm/vmscan.c | 97 ++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 99 insertions(+) diff -puN include/linux/migrate.h~demote-with-migrate_pages include/linux/migrate.h --- a/include/linux/migrate.h~demote-with-migrate_pages 2020-10-07 09:15:31.028642442 -0700 +++ b/include/linux/migrate.h 2020-10-07 09:15:31.034642442 -0700 @@ -27,6 +27,7 @@ enum migrate_reason { MR_MEMPOLICY_MBIND, MR_NUMA_MISPLACED, MR_CONTIG_RANGE, + MR_DEMOTION, MR_TYPES }; @@ -196,6 +197,7 @@ struct migrate_vma { int migrate_vma_setup(struct migrate_vma *args); void migrate_vma_pages(struct migrate_vma *migrate); void migrate_vma_finalize(struct migrate_vma *migrate); +int next_demotion_node(int node); #endif /* CONFIG_MIGRATION */ diff -puN mm/vmscan.c~demote-with-migrate_pages mm/vmscan.c --- a/mm/vmscan.c~demote-with-migrate_pages 2020-10-07 09:15:31.030642442 -0700 +++ b/mm/vmscan.c 2020-10-07 09:15:31.037642442 -0700 @@ -43,6 +43,7 @@ #include #include #include +#include #include #include #include @@ -1034,6 +1035,24 @@ static enum page_references page_check_r return PAGEREF_RECLAIM; } +bool migrate_demote_page_ok(struct page *page, struct scan_control *sc) +{ + int next_nid = next_demotion_node(page_to_nid(page)); + + VM_BUG_ON_PAGE(!PageLocked(page), page); + VM_BUG_ON_PAGE(PageHuge(page), page); + VM_BUG_ON_PAGE(PageLRU(page), page); + + if (next_nid == NUMA_NO_NODE) + return false; + if (PageTransHuge(page) && !thp_migration_supported()) + return false; + + // FIXME: actually enable this later in the series + return false; +} + + /* Check if a page is dirty or under writeback */ static void page_check_dirty_writeback(struct page *page, bool *dirty, bool *writeback) @@ -1064,6 +1083,60 @@ static void page_check_dirty_writeback(s mapping->a_ops->is_dirty_writeback(page, dirty, writeback); } +static struct page *alloc_demote_page(struct page *page, unsigned long node) +{ + /* + * Try to fail quickly if memory on the target node is not + * available. Leaving out __GFP_IO and __GFP_FS helps with + * this. If the desintation node is full, we want kswapd to + * run there so that its pages will get reclaimed and future + * migration attempts may succeed. + */ + gfp_t flags = (__GFP_HIGHMEM | __GFP_MOVABLE | __GFP_NORETRY | + __GFP_NOMEMALLOC | __GFP_NOWARN | __GFP_THISNODE | + __GFP_KSWAPD_RECLAIM); + /* HugeTLB pages should not be on the LRU */ + WARN_ON_ONCE(PageHuge(page)); + + if (PageTransHuge(page)) { + struct page *thp; + + flags |= __GFP_COMP; + + thp = alloc_pages_node(node, flags, HPAGE_PMD_ORDER); + if (!thp) + return NULL; + prep_transhuge_page(thp); + return thp; + } + + return __alloc_pages_node(node, flags, 0); +} + +/* + * Take pages on @demote_list and attempt to demote them to + * another node. Pages which are not demoted are left on + * @demote_pages. + */ +static unsigned int demote_page_list(struct list_head *demote_pages, + struct pglist_data *pgdat, + struct scan_control *sc) +{ + int target_nid = next_demotion_node(pgdat->node_id); + unsigned int nr_succeeded = 0; + int err; + + if (list_empty(demote_pages)) + return 0; + + /* Demotion ignores all cpuset and mempolicy settings */ + err = migrate_pages(demote_pages, alloc_demote_page, NULL, + target_nid, MIGRATE_ASYNC, MR_DEMOTION, + &nr_succeeded); + + return nr_succeeded; +} + /* * shrink_page_list() returns the number of reclaimed pages */ @@ -1076,12 +1149,15 @@ static unsigned int shrink_page_list(str { LIST_HEAD(ret_pages); LIST_HEAD(free_pages); + LIST_HEAD(demote_pages); unsigned int nr_reclaimed = 0; unsigned int pgactivate = 0; + bool do_demote_pass = true; memset(stat, 0, sizeof(*stat)); cond_resched(); +retry: while (!list_empty(page_list)) { struct address_space *mapping; struct page *page; @@ -1231,6 +1307,16 @@ static unsigned int shrink_page_list(str } /* + * Before reclaiming the page, try to relocate + * its contents to another node. + */ + if (do_demote_pass && migrate_demote_page_ok(page, sc)) { + list_add(&page->lru, &demote_pages); + unlock_page(page); + continue; + } + + /* * Anonymous process memory has backing store? * Try to allocate it some swap space here. * Lazyfree page could be freed directly @@ -1477,6 +1563,17 @@ keep: list_add(&page->lru, &ret_pages); VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page); } + /* 'page_list' is always empty here */ + + /* Migrate pages selected for demotion */ + nr_reclaimed += demote_page_list(&demote_pages, pgdat, sc); + /* Pages that could not be demoted are still in @demote_pages */ + if (!list_empty(&demote_pages)) { + /* Pages which failed to demoted go back on on @page_list for retry: */ + list_splice_init(&demote_pages, page_list); + do_demote_pass = false; + goto retry; + } pgactivate = stat->nr_activate[0] + stat->nr_activate[1]; _