Received: by 2002:a25:683:0:0:0:0:0 with SMTP id 125csp834295ybg; Mon, 1 Jun 2020 16:02:39 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzxfzazEp2Y8375f2ghNO2d4ohRWpa9EeAmZLcp2E/LSY6R8VlE+Ml06T7CDd+OmA6rqUZD X-Received: by 2002:a17:906:1558:: with SMTP id c24mr3563868ejd.48.1591052559231; Mon, 01 Jun 2020 16:02:39 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1591052559; cv=none; d=google.com; s=arc-20160816; b=W97ed5EGfKbLy1sZGmr64mMh3ug0wXoEW70G/lLWm3jquqKfq2M83wOE0BoRC/jqm+ mnwWYg+rYMVg6gwQvi+6QwiVDG/TQG91j2QwwdN4mbNGEZpwvjnSHpiL0XxdylLDSHxn /LvC2Ij+3jtj6xsNwkIB3lV4tvHjXVsmI2LbB0P79vHinwjKr3z+6GWbLp/1S2omCAbA ZwpWkKHk+x42eFa/5YxCGq5TlxY96dt8E46UpG5ESrhHs95wKhd22yY4bmCHBAKiYs1Z /6MopLMH/acnntvNRRQsoJ8nLMitne/VXk+ue6ofTZGoO0i2LtVxdI6KG9w/zmnM+FlS zXqg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=RXVWrJM0qErwoZtOlN6QuMQ8iwvgbXwoJOqbVdWAFHY=; b=xHhbEA7sIjyCjVxyoERLpUDJtfyGj22EvmhNqA/V4AVA6FUu/ZuoqejVTJh4qQyDOK ggvyEitiqgxeZWCXmijh2DxdTpY89GyMsDB8CxtVf7cTec2fqdkMXCr+nEU92SpZ32QK 7sAIe92YIQlghmNLGHjiKA85Cm8RI+xpnv11w9r74CHhiskpMv5bnUzCbRd4e28LAy31 Oc/6MR2qT4usPesrb1z2pLcFiP3lMtkHpcXj0ybCpYQkkF147m3unTPamGd6PSVwnkOL 0CdYvSXL7DLIPMLgVr9QE7UlDXX6iE8IVbzGV2jOma3VEafTMmRUKAZC/JkuV/0ohODr KS9Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=Xbisw2gF; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id cx3si465590edb.547.2020.06.01.16.02.14; Mon, 01 Jun 2020 16:02:39 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=Xbisw2gF; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728507AbgFAXAW (ORCPT + 99 others); Mon, 1 Jun 2020 19:00:22 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56302 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725766AbgFAXAV (ORCPT ); Mon, 1 Jun 2020 19:00:21 -0400 Received: from mail-ej1-x643.google.com (mail-ej1-x643.google.com [IPv6:2a00:1450:4864:20::643]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2AE7EC05BD43; Mon, 1 Jun 2020 16:00:20 -0700 (PDT) Received: by mail-ej1-x643.google.com with SMTP id gl26so10802080ejb.11; Mon, 01 Jun 2020 16:00:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=RXVWrJM0qErwoZtOlN6QuMQ8iwvgbXwoJOqbVdWAFHY=; b=Xbisw2gFUJ2KuoD5GrY5EdEz3hKD2LSX+FXSyf19uB3hWSqL2cyCBBdVSJDzm24eZA yRitsX00wWVhri4dWL84YRH/5TPgHP3Or58GNWyhiRsXH4TDc/wnSXC/EfSrc+cQ9uPY kmwJr99NcWpT18FIVNRJoylvm9IGChTJIsh7qn8UBxaya6e4j4Pwr/r9CVbhZbZa9ZAg jB1b5g4PtmBw+KdloZ4sU05ggtCAKh9Ra2xSDynQSEqQ8IUSTr2usZBUjODc4bw/tCIB UzGCGUybJ9iQEN5fa/Pv75q77zFAyuCUpgamDrFOLp2m/m80kfa0XboA4qYvItBNIC5B wz9w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=RXVWrJM0qErwoZtOlN6QuMQ8iwvgbXwoJOqbVdWAFHY=; b=e6V5aBVH7R7BVHp0woFsB3c9FeFwc+CmsLz9WY+KiFlFn4h4AkKvsZecZ4yTeGbd1k FpxdkmfViPrsllgIMGjOmcVD0PY5C46VvnnzoIj6zPdstQrPqZTwo22z1iZAT5OAvNKp TZ27hHas6BmUvGCvdtgEPTxFYrqA+CTgMrjcHO5wTz+7j16jd5lEWAZnKqRLn4fHRJru vCglAgYWGvRFSllE0uUih0kPGra3oVSAdFKVxvYDjBG8Vd9j4oBz3COyE7XIgNJcS8x7 FWxjcIGWuAX7qbRcK19ZoTupXI1M9nkokMek+W/grBMV/oLg6ad2GnxMW6m3a5T2RhLl /oZg== X-Gm-Message-State: AOAM531bmLa6VcAS0VSDoMQYsZyU9nEhAyBqJhqvpJ84PiEelVWHzoxZ MpwzdljGCaTR2e/LruySXxjIzf/d6ZXGOal9XjY= X-Received: by 2002:a17:906:ae85:: with SMTP id md5mr15533786ejb.213.1591052418869; Mon, 01 Jun 2020 16:00:18 -0700 (PDT) MIME-Version: 1.0 References: <20200601032204.124624-1-gthelen@google.com> In-Reply-To: <20200601032204.124624-1-gthelen@google.com> From: Yang Shi Date: Mon, 1 Jun 2020 15:59:56 -0700 Message-ID: Subject: Re: [PATCH] shmem, memcg: enable memcg aware shrinker To: Greg Thelen Cc: Hugh Dickins , Andrew Morton , Kirill Tkhai , Linux MM , Linux Kernel Mailing List , stable@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, May 31, 2020 at 8:22 PM Greg Thelen wrote: > > Since v4.19 commit b0dedc49a2da ("mm/vmscan.c: iterate only over charged > shrinkers during memcg shrink_slab()") a memcg aware shrinker is only > called when the per-memcg per-node shrinker_map indicates that the > shrinker may have objects to release to the memcg and node. > > shmem_unused_huge_count and shmem_unused_huge_scan support the per-tmpfs > shrinker which advertises per memcg and numa awareness. The shmem > shrinker releases memory by splitting hugepages that extend beyond > i_size. > > Shmem does not currently set bits in shrinker_map. So, starting with > b0dedc49a2da, memcg reclaim avoids calling the shmem shrinker under > pressure. This leads to undeserved memcg OOM kills. > Example that reliably sees memcg OOM kill in unpatched kernel: > FS=/tmp/fs > CONTAINER=/cgroup/memory/tmpfs_shrinker > mkdir -p $FS > mount -t tmpfs -o huge=always nodev $FS > # Create 1000 MB container, which shouldn't suffer OOM. > mkdir $CONTAINER > echo 1000M > $CONTAINER/memory.limit_in_bytes > echo $BASHPID >> $CONTAINER/cgroup.procs > # Create 4000 files. Ideally each file uses 4k data page + a little > # metadata. Assume 8k total per-file, 32MB (4000*8k) should easily > # fit within container's 1000 MB. But if data pages use 2MB > # hugepages (due to aggressive huge=always) then files consume 8GB, > # which hits memcg 1000 MB limit. > for i in {1..4000}; do > echo . > $FS/$i > done It looks all the inodes which have tail THP beyond i_size are on one single list, then the shrinker actually just splits the first nr_to_scan inodes. But since the list is not memcg aware, so it seems it may split the THPs which are not charged to the victim memcg and the victim memcg still may suffer from pre-mature oom, right? > > v5.4 commit 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg > aware") maintains the per-node per-memcg shrinker bitmap for THP > shrinker. But there's no such logic in shmem. Make shmem set the > per-memcg per-node shrinker bits when it modifies inodes to have > shrinkable pages. > > Fixes: b0dedc49a2da ("mm/vmscan.c: iterate only over charged shrinkers during memcg shrink_slab()") > Cc: # 4.19+ > Signed-off-by: Greg Thelen > --- > mm/shmem.c | 61 +++++++++++++++++++++++++++++++----------------------- > 1 file changed, 35 insertions(+), 26 deletions(-) > > diff --git a/mm/shmem.c b/mm/shmem.c > index bd8840082c94..e11090f78cb5 100644 > --- a/mm/shmem.c > +++ b/mm/shmem.c > @@ -1002,6 +1002,33 @@ static int shmem_getattr(const struct path *path, struct kstat *stat, > return 0; > } > > +/* > + * Expose inode and optional page to shrinker as having a possibly splittable > + * hugepage that reaches beyond i_size. > + */ > +static void shmem_shrinker_add(struct shmem_sb_info *sbinfo, > + struct inode *inode, struct page *page) > +{ > + struct shmem_inode_info *info = SHMEM_I(inode); > + > + spin_lock(&sbinfo->shrinklist_lock); > + /* > + * _careful to defend against unlocked access to ->shrink_list in > + * shmem_unused_huge_shrink() > + */ > + if (list_empty_careful(&info->shrinklist)) { > + list_add_tail(&info->shrinklist, &sbinfo->shrinklist); > + sbinfo->shrinklist_len++; > + } > + spin_unlock(&sbinfo->shrinklist_lock); > + > +#ifdef CONFIG_MEMCG > + if (page && PageTransHuge(page)) > + memcg_set_shrinker_bit(page->mem_cgroup, page_to_nid(page), > + inode->i_sb->s_shrink.id); > +#endif > +} > + > static int shmem_setattr(struct dentry *dentry, struct iattr *attr) > { > struct inode *inode = d_inode(dentry); > @@ -1048,17 +1075,13 @@ static int shmem_setattr(struct dentry *dentry, struct iattr *attr) > * to shrink under memory pressure. > */ > if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) { > - spin_lock(&sbinfo->shrinklist_lock); > - /* > - * _careful to defend against unlocked access to > - * ->shrink_list in shmem_unused_huge_shrink() > - */ > - if (list_empty_careful(&info->shrinklist)) { > - list_add_tail(&info->shrinklist, > - &sbinfo->shrinklist); > - sbinfo->shrinklist_len++; > - } > - spin_unlock(&sbinfo->shrinklist_lock); > + struct page *page; > + > + page = find_get_page(inode->i_mapping, > + (newsize & HPAGE_PMD_MASK) >> PAGE_SHIFT); > + shmem_shrinker_add(sbinfo, inode, page); > + if (page) > + put_page(page); > } > } > } > @@ -1889,21 +1912,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index, > if (PageTransHuge(page) && > DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE) < > hindex + HPAGE_PMD_NR - 1) { > - /* > - * Part of the huge page is beyond i_size: subject > - * to shrink under memory pressure. > - */ > - spin_lock(&sbinfo->shrinklist_lock); > - /* > - * _careful to defend against unlocked access to > - * ->shrink_list in shmem_unused_huge_shrink() > - */ > - if (list_empty_careful(&info->shrinklist)) { > - list_add_tail(&info->shrinklist, > - &sbinfo->shrinklist); > - sbinfo->shrinklist_len++; > - } > - spin_unlock(&sbinfo->shrinklist_lock); > + shmem_shrinker_add(sbinfo, inode, page); > } > > /* > -- > 2.27.0.rc0.183.gde8f92d652-goog > >