Received: by 2002:a25:683:0:0:0:0:0 with SMTP id 125csp794961ybg; Mon, 1 Jun 2020 14:51:26 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyVHVMXeyTWKMdGrKGqAqZF+8w1Ku0OFfB2WEImY7QU209bIkwaE2jc1QDPCbHQHOxymat+ X-Received: by 2002:a50:ed92:: with SMTP id h18mr23003442edr.190.1591048286390; Mon, 01 Jun 2020 14:51:26 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1591048286; cv=none; d=google.com; s=arc-20160816; b=NRK6GZPU+3PLiZTv0kr13EZca56197+4iRqP2Kr3bylX4SYEeT9iHjsrQkXuP4oOKU Lz0QvCldhBacTV+ls/fY0s267jiilD1CXHOd95FMR1hBjnyaC0AhMwAQ5hCHlQ3GtB35 pKC7OETZqXplVz4sjwhj6aB6Yj6KK2U8vgv4eiaCZdOzcXd73F4V3bCswZT4MvkHxmFw Op1b+ZyiG6plGD32D2G5NLW1ODDDFv0Xeg6Of8qj0P7We+6HR25xKthZKcYEFEfuP8C8 JIkh1dsq8eD7gtHHZehAtObNZrSGWQ6ZdkGbtEzKZCUbd8/XNBP9Lzie8hMXEkmW1oOI SqPQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:from:subject:references :mime-version:message-id:in-reply-to:date:dkim-signature; bh=L6HcYQlG38MHcLNJDdcGJa98KbxQ65bOlzMXwQl4Ols=; b=yMyyESy5u4g9IqN7ux+o9Mvt9pNF+DH+LKnY2O9U0+BLNRMAEMj0Q9cRMscfqgFULW uwNzNf9y8Tzq2upJBAiBcsuaHdfoXWDFRHtnUlXgRBL4A4cR0cLHbsHtjGyT+yq7h755 pL7SZ6jvNlhamnnN4kuuLUZQyDtnDnXSp2/oBTEC+9qNHPBZP9pifEzHrvte2Bk57+2K 7SrlBaqh1hADQ87r2JiWK7S1pcfdGjT7fyQlv1wTj6VGVgOIPczcntLnxtIvFp4dsYg+ cEk9Gw0IB3RbFyL4C/rWat6o7Kvd6y0ynM1SU6lSsKHqsOVpGMUMbC8WC5fHDEBqVv9d PNQQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=mU+RVNgo; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id l16si413421edj.436.2020.06.01.14.51.01; Mon, 01 Jun 2020 14:51:26 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=mU+RVNgo; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728837AbgFAVss (ORCPT + 99 others); Mon, 1 Jun 2020 17:48:48 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45164 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728488AbgFAVsr (ORCPT ); Mon, 1 Jun 2020 17:48:47 -0400 Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 90D47C061A0E for ; Mon, 1 Jun 2020 14:48:47 -0700 (PDT) Received: by mail-yb1-xb49.google.com with SMTP id z7so14197942ybn.21 for ; Mon, 01 Jun 2020 14:48:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=L6HcYQlG38MHcLNJDdcGJa98KbxQ65bOlzMXwQl4Ols=; b=mU+RVNgo6wcNw9/pk95S1rAM+kKpTHvgEz3LqCLf8uAS0F8+VLGxpIjO4qcHKnnBRC wbVteTuoCZU8x3SiK05yAPFsk70j0SjJ5OdgAZhxvs/iOOYfykauZwfHvrjLCAYbYHhZ KOr9jd2Qdxslxt18qo+ikPrhnldO0BZbZ8I/XewhBjOC5VJerfQ+228LwV1zlL+6fopF tUoferDuqDJpn2suBCF5lQhiHj8LF9Jg+2Egaza7hc5OFm4W/+59pqw4s7KgWzOd6QDp T8SHvdKGEzfUqX96+xl679afK/i5vZ/An/1bwCHKtS7k9nvC14Ye7W6Ej+dP1WDWz+uE vdGg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=L6HcYQlG38MHcLNJDdcGJa98KbxQ65bOlzMXwQl4Ols=; b=El6fkLGije5JyQE9okjOBL80j02XVv2UYNdl5oYV6jHRsUompmZQRpx0PMbZ01bebC eI/YPRjayRq9hQm9PBOxbdfmRX6mPj1wnZN0AtduFRatWUFFESc1MYLkCc9nL0uTehKO k26VhDd1+Y24nvqfsnZzR+Ic6eneYOacXroy9xUtyYARS+kUT7DeOelLQoczbq2AgRaY aFl0bR7vHNbEf/HtL3GNevNrndPKPzi+mg0kJIDzbcoBX2/Y8L4VkwQMhrszixpzlKMA NvD76lRHFB10xdUsgKZkFmXqKe5APw5hyks8NLHwxfoZWllXdms+xvsU1nNAIqgJKq/j 2vAA== X-Gm-Message-State: AOAM531Xz+iU88VjHMmhhWZN9ZVwlfUuF8LN17+PWQp05pEEbYR3Piju 5k748w2LieUff1G9QlS/wbukWGYwPLwK X-Received: by 2002:a25:885:: with SMTP id 127mr39408333ybi.118.1591048126707; Mon, 01 Jun 2020 14:48:46 -0700 (PDT) Date: Mon, 01 Jun 2020 14:48:43 -0700 In-Reply-To: Message-Id: Mime-Version: 1.0 References: <20200601032204.124624-1-gthelen@google.com> Subject: Re: [PATCH] shmem, memcg: enable memcg aware shrinker From: Greg Thelen To: Kirill Tkhai , Hugh Dickins , Andrew Morton , "Kirill A. Shutemov" Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, stable@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Kirill Tkhai wrote: > Hi, Greg, > > good finding. See comments below. > > On 01.06.2020 06:22, Greg Thelen wrote: >> Since v4.19 commit b0dedc49a2da ("mm/vmscan.c: iterate only over charged >> shrinkers during memcg shrink_slab()") a memcg aware shrinker is only >> called when the per-memcg per-node shrinker_map indicates that the >> shrinker may have objects to release to the memcg and node. >> >> shmem_unused_huge_count and shmem_unused_huge_scan support the per-tmpfs >> shrinker which advertises per memcg and numa awareness. The shmem >> shrinker releases memory by splitting hugepages that extend beyond >> i_size. >> >> Shmem does not currently set bits in shrinker_map. So, starting with >> b0dedc49a2da, memcg reclaim avoids calling the shmem shrinker under >> pressure. This leads to undeserved memcg OOM kills. >> Example that reliably sees memcg OOM kill in unpatched kernel: >> FS=/tmp/fs >> CONTAINER=/cgroup/memory/tmpfs_shrinker >> mkdir -p $FS >> mount -t tmpfs -o huge=always nodev $FS >> # Create 1000 MB container, which shouldn't suffer OOM. >> mkdir $CONTAINER >> echo 1000M > $CONTAINER/memory.limit_in_bytes >> echo $BASHPID >> $CONTAINER/cgroup.procs >> # Create 4000 files. Ideally each file uses 4k data page + a little >> # metadata. Assume 8k total per-file, 32MB (4000*8k) should easily >> # fit within container's 1000 MB. But if data pages use 2MB >> # hugepages (due to aggressive huge=always) then files consume 8GB, >> # which hits memcg 1000 MB limit. >> for i in {1..4000}; do >> echo . > $FS/$i >> done >> >> v5.4 commit 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg >> aware") maintains the per-node per-memcg shrinker bitmap for THP >> shrinker. But there's no such logic in shmem. Make shmem set the >> per-memcg per-node shrinker bits when it modifies inodes to have >> shrinkable pages. >> >> Fixes: b0dedc49a2da ("mm/vmscan.c: iterate only over charged shrinkers during memcg shrink_slab()") >> Cc: # 4.19+ >> Signed-off-by: Greg Thelen >> --- >> mm/shmem.c | 61 +++++++++++++++++++++++++++++++----------------------- >> 1 file changed, 35 insertions(+), 26 deletions(-) >> >> diff --git a/mm/shmem.c b/mm/shmem.c >> index bd8840082c94..e11090f78cb5 100644 >> --- a/mm/shmem.c >> +++ b/mm/shmem.c >> @@ -1002,6 +1002,33 @@ static int shmem_getattr(const struct path *path, struct kstat *stat, >> return 0; >> } >> >> +/* >> + * Expose inode and optional page to shrinker as having a possibly splittable >> + * hugepage that reaches beyond i_size. >> + */ >> +static void shmem_shrinker_add(struct shmem_sb_info *sbinfo, >> + struct inode *inode, struct page *page) >> +{ >> + struct shmem_inode_info *info = SHMEM_I(inode); >> + >> + spin_lock(&sbinfo->shrinklist_lock); >> + /* >> + * _careful to defend against unlocked access to ->shrink_list in >> + * shmem_unused_huge_shrink() >> + */ >> + if (list_empty_careful(&info->shrinklist)) { >> + list_add_tail(&info->shrinklist, &sbinfo->shrinklist); >> + sbinfo->shrinklist_len++; >> + } >> + spin_unlock(&sbinfo->shrinklist_lock); >> + >> +#ifdef CONFIG_MEMCG >> + if (page && PageTransHuge(page)) >> + memcg_set_shrinker_bit(page->mem_cgroup, page_to_nid(page), >> + inode->i_sb->s_shrink.id); >> +#endif >> +} >> + >> static int shmem_setattr(struct dentry *dentry, struct iattr *attr) >> { >> struct inode *inode = d_inode(dentry); >> @@ -1048,17 +1075,13 @@ static int shmem_setattr(struct dentry *dentry, struct iattr *attr) >> * to shrink under memory pressure. >> */ >> if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) { >> - spin_lock(&sbinfo->shrinklist_lock); >> - /* >> - * _careful to defend against unlocked access to >> - * ->shrink_list in shmem_unused_huge_shrink() >> - */ >> - if (list_empty_careful(&info->shrinklist)) { >> - list_add_tail(&info->shrinklist, >> - &sbinfo->shrinklist); >> - sbinfo->shrinklist_len++; >> - } >> - spin_unlock(&sbinfo->shrinklist_lock); >> + struct page *page; >> + >> + page = find_get_page(inode->i_mapping, >> + (newsize & HPAGE_PMD_MASK) >> PAGE_SHIFT); >> + shmem_shrinker_add(sbinfo, inode, page); >> + if (page) >> + put_page(page); > > 1)I'd move PageTransHuge() check from shmem_shrinker_add() to here. In case of page is not trans huge, > it looks strange and completely useless to add inode to shrinklist and then to avoid memcg_set_shrinker_bit(). > Nothing should be added to the shrinklist in this case. Ack, I'll take a look at this. > 2)In general I think this "last inode page spliter" does not fit SHINKER_MEMCG_AWARE conception, and > shmem_unused_huge_shrink() should be reworked as a new separate !memcg-aware shrinker instead of > .nr_cached_objects callback of generic fs shrinker. > > CC: Kirill Shutemov > > Kirill, are there any fundamental reasons to keep this shrinking logic in the generic fs shrinker? > Are there any no-go to for conversion this as a separate !memcg-aware shrinker? Making the shmem shrinker !memcg-aware seems like it would require tail pages beyond i_size not be charged to memcg. Otherwise memcg pressure (which only calls memcg aware shrinkers) won't uncharge them. Currently the entire compound page is charged. >> } >> } >> } >> @@ -1889,21 +1912,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index, >> if (PageTransHuge(page) && >> DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE) < >> hindex + HPAGE_PMD_NR - 1) { >> - /* >> - * Part of the huge page is beyond i_size: subject >> - * to shrink under memory pressure. >> - */ >> - spin_lock(&sbinfo->shrinklist_lock); >> - /* >> - * _careful to defend against unlocked access to >> - * ->shrink_list in shmem_unused_huge_shrink() >> - */ >> - if (list_empty_careful(&info->shrinklist)) { >> - list_add_tail(&info->shrinklist, >> - &sbinfo->shrinklist); >> - sbinfo->shrinklist_len++; >> - } >> - spin_unlock(&sbinfo->shrinklist_lock); >> + shmem_shrinker_add(sbinfo, inode, page); >> } >> >> /* > > Thanks, > Kirill