Received: by 2002:a05:6a10:9848:0:0:0:0 with SMTP id x8csp4598075pxf; Tue, 30 Mar 2021 11:43:34 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwkg/kiTDQBCeO8TxoOw6mxPdvJdFiA0z8qU3MbcbqO9yPScQmJbDl4+4NAqUk1L24sXNCh X-Received: by 2002:a05:6402:c88:: with SMTP id cm8mr35063053edb.62.1617129814487; Tue, 30 Mar 2021 11:43:34 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1617129814; cv=none; d=google.com; s=arc-20160816; b=FoshPBDpPzh2kcA+b7YnITIb21pllf9oA4a5++dEdjEpnnKaP6RMnmZvQ6606uFU30 wBrKKbjmgrFBUliy5UfRQ0m9bVbKLGdwpc4tYXcVE80SG1aN023Gw0Y67S+FknO902UB B0ocqwHDmKmK5465cGtYlvcBw+ntGnEPEvlQqZDIRTVbxeyNaO58/KT3JXiCNCR9SLvb 4O3585dnLjldLDbu9U8NFmMGSIw0JrEGNrOQcbvrg395bQG5SzaBpySM2/lvR+o1ZZF1 BZuSZGBLoi1mrgCYlwZvp5Vz7vqr9ErAsQzQxrh6qFQ/yhlEt9XGBjM42ZsVNJaebNYz c89A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=TOotaZ1871auGHiNnmRkwYZ1nFjr3BIDsT8kbzaoaws=; b=G25W82Fv0rVqOMr6mkwUaQr5MGa7rccI+ypJrlnOcMgi0uEhckLKBTK/DtR954h4zH BGTIG1YVWvKAdOkpmHOvEGdXfqoa1/ukSMiz7rwJwfMnvPKPmUii4uw3l9yBXDITvcsm ZTRZVSihn9/1GjInylyarNl2IGyppjspnAQcD9AioK5LfUoqdoZqjLSoz/uEIY+v3U+x 4tlDvCHAGUL77m6XSyjAJT5GzhE5lNT/gceeXC7Izq/rNLE3xo10uUR7vYwYNXCbnX02 Bo3Tu2Mb+0IfAX8N7T3B5XRn24mv4NP4DDK86Dss/4LAB6X1gBk9LibUA4+hGuwG5Q1v EGFA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=cfuA4VpJ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id gn21si15948000ejc.401.2021.03.30.11.43.09; Tue, 30 Mar 2021 11:43:34 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=cfuA4VpJ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232824AbhC3SmM (ORCPT + 99 others); Tue, 30 Mar 2021 14:42:12 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44140 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232901AbhC3Sli (ORCPT ); Tue, 30 Mar 2021 14:41:38 -0400 Received: from mail-lj1-x22f.google.com (mail-lj1-x22f.google.com [IPv6:2a00:1450:4864:20::22f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 65254C061762 for ; Tue, 30 Mar 2021 11:34:24 -0700 (PDT) Received: by mail-lj1-x22f.google.com with SMTP id u10so21013948lju.7 for ; Tue, 30 Mar 2021 11:34:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=TOotaZ1871auGHiNnmRkwYZ1nFjr3BIDsT8kbzaoaws=; b=cfuA4VpJEcRZl5XvDIML2vmkn+UgzqbnbcbuAV6s5fg1aj8Q50DvUOwnVTpGszPmhY NcAsNl/xB9LoKcxwGP92Q3MAuWERLsGuEDsaSU0nMT1dtV+XyBvoJi/6wDwNeBTZ932N 10orAzEaHmBSNLJRyv6b9PrbZYILSgRL/3V7gDXIh+yXhFEFICmsh/PZDcoGy5MM/0iQ cDruOYJuxalqbLTL2Y4RbAT3OfU/gU6xYU/K0w+hXY5NcZszr5mV8nuRqQ1LDu2qwSG+ mRwYv0TLVWEa2OcnoebF0tia4/zOSK1gJhD2uwFbbgIraTwqIT+F/d536ech7U5ncDep JezQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=TOotaZ1871auGHiNnmRkwYZ1nFjr3BIDsT8kbzaoaws=; b=OxxM5GTlVJcMojnGAAVuQmIk109tqwQGlZU6wlxk8bHjN/utC9ZRQlj4XupFcS0SsC /uI5fynTmwlwccp9QZPNVTpnUoWqGJNjuVpAFcGpcaIxEQv1K1nnnBQ3S5K0iqG3XnOG /wIP6YTgjue9z1e53elL4yKlXeX11KAq2uQk9Kz8m2HzZFLFM0h22VaQwB1ZrUCAvjm0 q0cor1B2P3Zzl2SdhUw+2kRN8yWMXUcllmJxAZiadr1/1fKbCGkm+7FW1kP9duBjEi11 kG4KDqVHRaq7NZLNTJUrKhRD1sL04RR8DqGaxdvvGlQcudDcWGQPwi+2sl4xnobC4MnP AqqA== X-Gm-Message-State: AOAM53027p14Qk5XaFYC2FOnEu7T+hywjeKJISsXSXHFCh/H8cGk3vMO ClU80KTzZOwBL1dZY0NWURZ+qi5f5CyIELJjpTayLA== X-Received: by 2002:a2e:8084:: with SMTP id i4mr22524055ljg.122.1617129262630; Tue, 30 Mar 2021 11:34:22 -0700 (PDT) MIME-Version: 1.0 References: <20210330101531.82752-1-songmuchun@bytedance.com> In-Reply-To: <20210330101531.82752-1-songmuchun@bytedance.com> From: Shakeel Butt Date: Tue, 30 Mar 2021 11:34:11 -0700 Message-ID: Subject: Re: [RFC PATCH 00/15] Use obj_cgroup APIs to charge the LRU pages To: Muchun Song , Greg Thelen Cc: Roman Gushchin , Johannes Weiner , Michal Hocko , Andrew Morton , Vladimir Davydov , LKML , Linux MM , Xiongchun duan Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Mar 30, 2021 at 3:20 AM Muchun Song wrote: > > Since the following patchsets applied. All the kernel memory are charged > with the new APIs of obj_cgroup. > > [v17,00/19] The new cgroup slab memory controller > [v5,0/7] Use obj_cgroup APIs to charge kmem pages > > But user memory allocations (LRU pages) pinning memcgs for a long time - > it exists at a larger scale and is causing recurring problems in the real > world: page cache doesn't get reclaimed for a long time, or is used by the > second, third, fourth, ... instance of the same job that was restarted into > a new cgroup every time. Unreclaimable dying cgroups pile up, waste memory, > and make page reclaim very inefficient. > > We can convert LRU pages and most other raw memcg pins to the objcg direction > to fix this problem, and then the LRU pages will not pin the memcgs. > > This patchset aims to make the LRU pages to drop the reference to memory > cgroup by using the APIs of obj_cgroup. Finally, we can see that the number > of the dying cgroups will not increase if we run the following test script. > > ```bash > #!/bin/bash > > cat /proc/cgroups | grep memory > > cd /sys/fs/cgroup/memory > > for i in range{1..500} > do > mkdir test > echo $$ > test/cgroup.procs > sleep 60 & > echo $$ > cgroup.procs > echo `cat test/cgroup.procs` > cgroup.procs > rmdir test > done > > cat /proc/cgroups | grep memory > ``` > > Patch 1 aims to fix page charging in page replacement. > Patch 2-5 are code cleanup and simplification. > Patch 6-15 convert LRU pages pin to the objcg direction. The main concern I have with *just* reparenting LRU pages is that for the long running systems, the root memcg will become a dumping ground. In addition a job running multiple times on a machine will see inconsistent memory usage if it re-accesses the file pages which were reparented to the root memcg. Please note that I do agree with the mentioned problem and we do see this issue in our fleet. Internally we have a "memcg mount option" feature which couples a file system with a memcg and all file pages allocated on that file system will be charged to that memcg. Multiple instances (concurrent or subsequent) of the job will use that file system (with a dedicated memcg) without leaving the zombies behind. I am not pushing for this solution as it comes with its own intricacies (e.g. if memcg coupled with a file system has a limit, the oom behavior would be awkward and therefore internally we don't put a limit on such memcgs). Though I want this to be part of discussion. I think the underlying reasons behind this issue are: 1) Filesystem shared by disjoint jobs. 2) For job dedicated filesystems, the lifetime of the filesystem is different from the lifetime of the job. For now, we have two potential solutions to the zombies-due-to-offlined-LRU-pages i.e. (1) reparent and (2) pairing memcg and filesystem. For reparenting, the cons are inconsistent memory usage and root memcg potentially becoming dumping ground. For pairing, the oom behavior is awkward which is the same for any type of remote charging. I am wondering how we can resolve the cons for each. To resolve the root-memcg-dump issue in the reparenting, maybe we uncharge the page when it reaches root and the next accesser will be charged. For inconsistent memory usage, either users accept the inconsistency or force reclaim before offline which will reduce the benefit of sharing filesystem with subsequent instances of the job. For the pairing, maybe we can punt to the user/admin to not set a limit on such memcg to avoid awkward oom situations. Thoughts?