Received: by 2002:a25:8b12:0:0:0:0:0 with SMTP id i18csp1231843ybl; Thu, 22 Aug 2019 11:18:46 -0700 (PDT) X-Google-Smtp-Source: APXvYqxr+qRl7VZ67MNkgrPFToAiLi1/hYhpdePaC2l5usgDnQlI6s7+NkTam8aNiTad5pB/ybMJ X-Received: by 2002:a63:de4f:: with SMTP id y15mr468677pgi.239.1566497925720; Thu, 22 Aug 2019 11:18:45 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1566497925; cv=none; d=google.com; s=arc-20160816; b=C4C2em/b3IQkZAwi1g5AgZbf+r/Q7YUEIiWxL1bOIh8aT+5R/irZqvRpKuEC9cd+QB C8kiBV89Ke3fTllHx1/EH9wZdu84X9oO7g/2XBQbStUBDw4pAzizt1IUkwnMZwvuPVpj 5Drs9MQ/YtUyk2mWjW+M0jziyt1M/5504+XXCeBWaf2DqegfwjMFRcP0KCoBVeCY8Juq UIZi/RhfnbEkwo9Ffpejm1sGselWWOn8Jnv/JYnkMBUYIkZb8B1pJCddxoC74/EqEOqt YNATu4xCOikSSumImm2CmVv1pCuNU9GmvpiVw8fpaj4elyWiF0sN8ZVpgyUuF1PM2LYo sHBw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject; bh=GbxdV7QdwhCoBivwY/Bu58CDI3BqjtVP+YyXg3Yikp8=; b=lV5v66gcXMZUWbxz7P+BycvdckLtp+OwEitWVCmsldHA5THyNqPn8kds1ymTjjDS5c /cdAy8KU+dcs6eW7SIWHugJssjEInCE4s9IXENWsa4dMy0Y3cMWU+tWcRl1j2VKDZb8z ofiwm+zQGo/xNJ8kfZt3ZBoHmXWeLnKQqmGkVv6sMCIbvfBVqAK9G5waRxZ8S9W8lJHq ooYt1MiVVZ8vE3zXZDDlB7t6RPZTAG1VoM7/7eREADz3R64NQPNSsiI5KpSjEXqBGEMb XNvvvHycBNvfwI/iB2G/+xnX19e75ThcPql2NvuOzBzV+Kb82C4O3s074WOB8MFgmn2Y mgug== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id e1si182257plt.276.2019.08.22.11.18.29; Thu, 22 Aug 2019 11:18:45 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2388782AbfHVM5B (ORCPT + 99 others); Thu, 22 Aug 2019 08:57:01 -0400 Received: from mx2.suse.de ([195.135.220.15]:46090 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S2388104AbfHVM5A (ORCPT ); Thu, 22 Aug 2019 08:57:00 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id C9C4CADBB; Thu, 22 Aug 2019 12:56:58 +0000 (UTC) Subject: Re: [v2 PATCH -mm] mm: account deferred split THPs into MemAvailable To: Michal Hocko , kirill.shutemov@linux.intel.com, Yang Shi Cc: hannes@cmpxchg.org, rientjes@google.com, akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <1566410125-66011-1-git-send-email-yang.shi@linux.alibaba.com> <20190822080434.GF12785@dhcp22.suse.cz> From: Vlastimil Babka Message-ID: Date: Thu, 22 Aug 2019 14:56:56 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.8.0 MIME-Version: 1.0 In-Reply-To: <20190822080434.GF12785@dhcp22.suse.cz> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 8/22/19 10:04 AM, Michal Hocko wrote: > On Thu 22-08-19 01:55:25, Yang Shi wrote: >> Available memory is one of the most important metrics for memory >> pressure. > > I would disagree with this statement. It is a rough estimate that tells > how much memory you can allocate before going into a more expensive > reclaim (mostly swapping). Allocating that amount still might result in > direct reclaim induced stalls. I do realize that this is simple metric > that is attractive to use and works in many cases though. > >> Currently, the deferred split THPs are not accounted into >> available memory, but they are reclaimable actually, like reclaimable >> slabs. >> >> And, they seems very common with the common workloads when THP is >> enabled. A simple run with MariaDB test of mmtest with THP enabled as >> always shows it could generate over fifteen thousand deferred split THPs >> (accumulated around 30G in one hour run, 75% of 40G memory for my VM). >> It looks worth accounting in MemAvailable. > > OK, this makes sense. But your above numbers are really worrying. > Accumulating such a large amount of pages that are likely not going to > be used is really bad. They are essentially blocking any higher order > allocations and also push the system towards more memory pressure. > > IIUC deferred splitting is mostly a workaround for nasty locking issues > during splitting, right? This is not really an optimization to cache > THPs for reuse or something like that. What is the reason this is not > done from a worker context? At least THPs which would be freed > completely sound like a good candidate for kworker tear down, no? Agreed that it's a good question. For Kirill :) Maybe with kworker approach we also wouldn't need the cgroup awareness? >> Record the number of freeable normal pages of deferred split THPs into >> the second tail page, and account it into KReclaimable. Although THP >> allocations are not exactly "kernel allocations", once they are unmapped, >> they are in fact kernel-only. KReclaimable has been accounted into >> MemAvailable. > > This sounds reasonable to me. > >> When the deferred split THPs get split due to memory pressure or freed, >> just decrease by the recorded number. >> >> With this change when running program which populates 1G address space >> then madvise(MADV_DONTNEED) 511 pages for every THP, /proc/meminfo would >> show the deferred split THPs are accounted properly. >> >> Populated by before calling madvise(MADV_DONTNEED): >> MemAvailable: 43531960 kB >> AnonPages: 1096660 kB >> KReclaimable: 26156 kB >> AnonHugePages: 1056768 kB >> >> After calling madvise(MADV_DONTNEED): >> MemAvailable: 44411164 kB >> AnonPages: 50140 kB >> KReclaimable: 1070640 kB >> AnonHugePages: 10240 kB >> >> Suggested-by: Vlastimil Babka >> Cc: Michal Hocko >> Cc: Kirill A. Shutemov >> Cc: Johannes Weiner >> Cc: David Rientjes >> Signed-off-by: Yang Shi Thanks, looks like it wasn't too difficult with the 2nd tail page use :) ... >> --- a/mm/huge_memory.c >> +++ b/mm/huge_memory.c >> @@ -524,6 +524,7 @@ void prep_transhuge_page(struct page *page) >> >> INIT_LIST_HEAD(page_deferred_list(page)); >> set_compound_page_dtor(page, TRANSHUGE_PAGE_DTOR); >> + page[2].nr_freeable = 0; >> } >> >> static unsigned long __thp_get_unmapped_area(struct file *filp, unsigned long len, >> @@ -2766,6 +2767,10 @@ int split_huge_page_to_list(struct page *page, struct list_head *list) >> if (!list_empty(page_deferred_list(head))) { >> ds_queue->split_queue_len--; >> list_del(page_deferred_list(head)); >> + __mod_node_page_state(page_pgdat(page), >> + NR_KERNEL_MISC_RECLAIMABLE, >> + -head[2].nr_freeable); >> + head[2].nr_freeable = 0; >> } >> if (mapping) >> __dec_node_page_state(page, NR_SHMEM_THPS); >> @@ -2816,11 +2821,14 @@ void free_transhuge_page(struct page *page) >> ds_queue->split_queue_len--; >> list_del(page_deferred_list(page)); >> } >> + __mod_node_page_state(page_pgdat(page), NR_KERNEL_MISC_RECLAIMABLE, >> + -page[2].nr_freeable); >> + page[2].nr_freeable = 0; Wouldn't it be safer to fully tie the nr_freeable use to adding the page to the deffered list? So here the code would be in the if (!list_empty()) { } part above. >> spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); >> free_compound_page(page); >> } >> >> -void deferred_split_huge_page(struct page *page) >> +void deferred_split_huge_page(struct page *page, unsigned int nr) >> { >> struct deferred_split *ds_queue = get_deferred_split_queue(page); >> #ifdef CONFIG_MEMCG >> @@ -2844,6 +2852,9 @@ void deferred_split_huge_page(struct page *page) >> return; >> >> spin_lock_irqsave(&ds_queue->split_queue_lock, flags); >> + page[2].nr_freeable += nr; >> + __mod_node_page_state(page_pgdat(page), NR_KERNEL_MISC_RECLAIMABLE, >> + nr); Same here, only do this when adding to the list, below? Or we might perhaps account base pages multiple times? >> if (list_empty(page_deferred_list(page))) { >> count_vm_event(THP_DEFERRED_SPLIT_PAGE); >> list_add_tail(page_deferred_list(page), &ds_queue->split_queue); >> diff --git a/mm/rmap.c b/mm/rmap.c >> index e5dfe2a..6008fab 100644 >> --- a/mm/rmap.c >> +++ b/mm/rmap.c >> @@ -1286,7 +1286,7 @@ static void page_remove_anon_compound_rmap(struct page *page) >> >> if (nr) { >> __mod_node_page_state(page_pgdat(page), NR_ANON_MAPPED, -nr); >> - deferred_split_huge_page(page); >> + deferred_split_huge_page(page, nr); >> } >> } >> >> @@ -1320,7 +1320,7 @@ void page_remove_rmap(struct page *page, bool compound) >> clear_page_mlock(page); >> >> if (PageTransCompound(page)) >> - deferred_split_huge_page(compound_head(page)); >> + deferred_split_huge_page(compound_head(page), 1); >> >> /* >> * It would be tidy to reset the PageAnon mapping here, >> -- >> 1.8.3.1 >