Received: by 2002:a05:6358:7058:b0:131:369:b2a3 with SMTP id 24csp10159436rwp; Thu, 20 Jul 2023 15:57:28 -0700 (PDT) X-Google-Smtp-Source: APBJJlEUKTLWtBXbeRTVGtNPcSg8zB2pcwxeZwXw4pGbTTQuUWPwAHhyCIZbd3tlkIpett+oYZDc X-Received: by 2002:a17:906:3196:b0:991:c67d:87a2 with SMTP id 22-20020a170906319600b00991c67d87a2mr126133ejy.28.1689893848102; Thu, 20 Jul 2023 15:57:28 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1689893848; cv=none; d=google.com; s=arc-20160816; b=uPb7+C5uiiCAqrREIXAsptuRdhnei1eNG/62VKPJmleBL5BWrvlRI46rgJIEAX0Ha6 yCBYJWGO6ZOvUi/xec3P4l6YfD+GYCUtk1hfV6ppBZ12G1eiMs6hoC7mabKOOpi5uTd9 pOkoEffK00w3T5rW2QuxNQTpSYj075cUOEiRNlpnmgOOM79/Atr1M6Chr8E693yX4cYu +N/LNQ3NTXtuUeJk3O4mgoUD5XR4RfyETmhbwYo4qGuLbB3wEXbDyhLKda5t1iL62U1p oZtxzYTFK+FTyC+8C6ZSNp2l3RMDfqf7idPFA+W3bhgmNa2VBMTc5Bd5sONNfv2TcJfY cvwQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=G0owsP3MW0e32jBov7KLN+DK+7a5lEoZ//WqOCREwlQ=; fh=2jRb7WVPDbknhecJ6Qh4fc3Jgn9N7U+4X5ewRFVrHeI=; b=zu/FdJ90ipNVZ40lPb+MKRGKi0ttTpF7mQ3P+viztO+jcUuYR/OCIFyT1PTw9M1O3w S2ZVqiho3+FO+6wl3Kpn2i12c7zJozYMiFHwIE2FMIYYP2NZgIDQdmWMhSlWL0ttnr8Q gM0H3GtLhV+0OxOvbtagCmeBIoaLH5oEEXUTNQODqRAY6xhAszKDRpDHR6yD48u32Y/n yqIUQIiVJUlDAkUQSSUc0N7RvEFTZ8oVeIU4o2/9vsZHDHxBRXZ9Xa7uoJTwg6fE8E9I VGj8/sCbPAKtyFHwh8s0k6NI6mgrQVkC2LBA9XnzjcmqmlezPj8JxYLlJb9QLSiCBGL+ 2lKQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=jd6zJsXD; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id xa24-20020a170907b9d800b00992acf9b9c9si1206277ejc.621.2023.07.20.15.57.04; Thu, 20 Jul 2023 15:57:28 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=jd6zJsXD; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230134AbjGTWYn (ORCPT + 99 others); Thu, 20 Jul 2023 18:24:43 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52384 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229628AbjGTWYj (ORCPT ); Thu, 20 Jul 2023 18:24:39 -0400 Received: from mail-ej1-x62f.google.com (mail-ej1-x62f.google.com [IPv6:2a00:1450:4864:20::62f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 602A81BB for ; Thu, 20 Jul 2023 15:24:38 -0700 (PDT) Received: by mail-ej1-x62f.google.com with SMTP id a640c23a62f3a-992b66e5affso203439666b.3 for ; Thu, 20 Jul 2023 15:24:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1689891877; x=1690496677; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=G0owsP3MW0e32jBov7KLN+DK+7a5lEoZ//WqOCREwlQ=; b=jd6zJsXD5mYj6yvKRyJH5NBquT74jAAnKJcxJxjtVugZ74xdKdKUA/Xqh2qycTBeFg vPln7NlfYVWD/ksCNMT/BFt0LQbv78OLE46EKPak+6l1I1wQxZ8TsalyBCa4/uy2UGfP 4bQUMBpcGL0kubjXSaUgnImEDhLu8s2ELtZbAO8Vj9SBPIax+YSn04fWP2IBG5Yylr5X 8aFVUOFXvi3d0ege/03nCH4/jH9luNjJE8Hnq5/xwKNoEVplwnFF+7KsZF2+GABq1CH7 V2lajH7+XWj4nEhKf6uxSpT0wIJZS55VBvNUUqvbp2iLZnQ3ETD0dgV6lA42DVvyjOE4 /+vQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689891877; x=1690496677; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=G0owsP3MW0e32jBov7KLN+DK+7a5lEoZ//WqOCREwlQ=; b=Wvbwa5bFMjB2K6OhsoHOP7en60cyp5o57eWOZJ+imvVlhrx1p/D0ZlO4ytMFBoCAr7 wbHuqlT8Rrot2/nKChkS5WGoBKy8p8qxyu4v8K+RTpMj6YDcWysFaGzgWeCj1UBUKVkd 3FrRqdvYOXTuz2yofmk1N8EmMxMhH5Jusow5VCvfgiZzqXb/1FyZ728H4rN+nrM9MAnn 7Nu8LCf80LBYDiB5KJbK3HSc3I9HoPNd/XX4ffkOKj7Hdz067R+jGMhPpbfuUc789jv4 H8u0wN3rF1DhQ8waczajqifLWNsgyh7ZNqPqEj4FfdJ3ZtiaAwdlU9nbM5EccXXcn/WK 8Dyw== X-Gm-Message-State: ABy/qLYfmOqJGZgNdIV6fompFBT7Itdgs42JxOibELoozYxt9QzE+IuA FrcBKOtPIRKTwZCfec8jlwhQEpZBHez6Wff1c5GK0Oa7/VqXtCESOHA= X-Received: by 2002:a17:907:7810:b0:993:f2b4:13c9 with SMTP id la16-20020a170907781000b00993f2b413c9mr78921ejc.21.1689891876705; Thu, 20 Jul 2023 15:24:36 -0700 (PDT) MIME-Version: 1.0 References: <20230720070825.992023-1-yosryahmed@google.com> <20230720153515.GA1003248@cmpxchg.org> In-Reply-To: From: Yosry Ahmed Date: Thu, 20 Jul 2023 15:23:59 -0700 Message-ID: Subject: Re: [RFC PATCH 0/8] memory recharging for offline memcgs To: Tejun Heo Cc: Johannes Weiner , Andrew Morton , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , "Matthew Wilcox (Oracle)" , Zefan Li , Yu Zhao , Luis Chamberlain , Kees Cook , Iurii Zaikin , "T.J. Mercier" , Greg Thelen , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_BLOCKED,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jul 20, 2023 at 3:12=E2=80=AFPM Tejun Heo wrote: > > Hello, > > On Thu, Jul 20, 2023 at 02:34:16PM -0700, Yosry Ahmed wrote: > > > Or just create a nesting layer so that there's a cgroup which represe= nts the > > > persistent resources and a nested cgroup instance inside representing= the > > > current instance. > > > > In practice it is not easy to know exactly which resources are shared > > and used by which cgroups, especially in a large dynamic environment. > > Yeah, that only covers when resource persistence is confined in a known > scope. That said, I have a hard time seeing how recharding once after cgr= oup > destruction can be a solution for the situations you describe. What if A > touches it once first, B constantly uses it but C only very occasionally = and > after A dies C ends up owning it due to timing. This is very much possibl= e > in a large dynamic environment but neither the initial or final situation= is > satisfactory. That is indeed possible, but it would be more likely that the charge is moved to B. As I said, it's not perfect, but it is an improvement over what we have today. Even if C ends up owning it, it's better than staying with the dead A. > > To solve the problems you're describing, you actually would have to > guarantee that memory pages are charged to the current majority user (or > maybe even spread across current active users). Maybe it can be argued th= at > this is a step towards that but it's a very partial step and at least wou= ld > need a technically viable direction that this development can follow. Right, that would be a much larger effort (arguably memcg v3 ;) ). This proposal is focused on the painful artifact of the sharing/sticky resources problem: zombie memcgs. We can extend the automatic charge movement semantics later to cover more cases or be smarter, or ditch the existing charging semantics completely and start over with sharing/stickiness in mind. Either way, that would be a long-term effort. There is a problem that exists today though that ideally can be fixed/improved by this proposal. > > On its own, AFAICS, I'm not sure the scope of problems it can actually so= lve > is justifiably greater than what can be achieved with simple nesting. In our use case nesting is not a viable option. As I said, in a large fleet where a lot of different workloads are dynamically being scheduled on different machines, and where there is no way of knowing what resources are being shared among what workloads, and even if we do, it wouldn't be constant, it's very difficult to construct the hierarchy with nesting to keep the resources confined. Keep in mind that the environment is dynamic, workloads are constantly coming and going. Even if find the perfect nesting to appropriately scope resources, some rescheduling may render the hierarchy obsolete and require us to start over. > > Thanks. > > -- > tejun