Received: by 2002:a05:7412:b795:b0:e2:908c:2ebd with SMTP id iv21csp105126rdb; Wed, 1 Nov 2023 19:18:29 -0700 (PDT) X-Google-Smtp-Source: AGHT+IHSVM5EFNqK3qzIGNepNubUr3FiCfy5kYvWV4grwGqlk9RZUNKKikQUtICDg4jxAfTgwteE X-Received: by 2002:a17:907:94ce:b0:9d5:b7db:f571 with SMTP id dn14-20020a17090794ce00b009d5b7dbf571mr3033999ejc.17.1698891509571; Wed, 01 Nov 2023 19:18:29 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1698891509; cv=none; d=google.com; s=arc-20160816; b=H0dL1VM0gCzzZ+laQuL3D56btmcAb5l9JfEjNJKynXy9fDWIelQ5/F10vCJ04hL93A ZAFwxx3409fnMPF1/o0ylJm+0+fxUWUlULrLvuRY1K13VWKgHJJ7oo3PH1vhB6phKrWn J9pZRF+ywL04NRBJfMe/2MBxW8BOkGT5Ynob1TUTOLpIX/Cxrx3HO7fOzFs4mY5DVTYQ k1urwMa/kjbV1o5EH2CgfpFZfK2F1P9PcA3O0LlMCpSihf9xOA38n+D4qp5Ids67mzPN ky0ajaidoFyEGjGD/85I8yTR9nxOEY+Tr7sBmWQ1SwMbSY4cf/h7wsiDy3NGviBGq5bL 4+7Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:to:references:message-id :content-transfer-encoding:cc:date:in-reply-to:from:subject :mime-version:dkim-signature; bh=B+y5TA+dBJkHS1npJO+WnH+1LCxKJsUTCf46ix28dSM=; fh=XK/bkPt0CziZ8EVCF4nzI7N6MDeTSKWjXikhpG/NxkE=; b=YBL7Q9y0vZh15sn305awjILEfhz2EO046xg2Dw3e6GeSC6F+nkL3YbkqqeH7QO8pen Oe5MaQxvWkrTsrewqohwVwPlJ4z8yi7xNrIjNboLCUeFJiUr8r7xbUFJEU2JVl357n+C f52AjEo+7iqTr9qGVIIViTB5LykzU2m3VtkyKsEkRCq96G/Z0bLfrQWcPp+YzQ+fg1fF J+v0uUkY2Yt9e5GpIz6VMldFcoemxhMU5DXihbX1CXP4HUtFZkKxy6y/7GuQSgXj2vfZ NPaZGfZi2PbP400mPvWr5il38GMimZJ7kxGVnb8wX6z4k6YXz5WRU2NfvPNFLKcR1H4u YUtg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b="Em1XyI3/"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:2 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Return-Path: Received: from agentk.vger.email (agentk.vger.email. [2620:137:e000::3:2]) by mx.google.com with ESMTPS id oz14-20020a170906cd0e00b009b91a6dec28si442310ejb.867.2023.11.01.19.18.28 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 01 Nov 2023 19:18:29 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:2 as permitted sender) client-ip=2620:137:e000::3:2; Authentication-Results: mx.google.com; dkim=pass header.i=@linux.dev header.s=key1 header.b="Em1XyI3/"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:2 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.dev Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by agentk.vger.email (Postfix) with ESMTP id 16C318026490; Wed, 1 Nov 2023 19:18:23 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at agentk.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1348224AbjKBCRz (ORCPT + 99 others); Wed, 1 Nov 2023 22:17:55 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40926 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232963AbjKBCRy (ORCPT ); Wed, 1 Nov 2023 22:17:54 -0400 Received: from out-186.mta1.migadu.com (out-186.mta1.migadu.com [95.215.58.186]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9B6D3BD for ; Wed, 1 Nov 2023 19:17:48 -0700 (PDT) Content-Type: text/plain; charset=utf-8 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1698891466; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=B+y5TA+dBJkHS1npJO+WnH+1LCxKJsUTCf46ix28dSM=; b=Em1XyI3/iCPrthA4ond4YGR3E2WDklG8rIywXuErsHfD8y60SzEoc0FG+91im5EV7jX5G2 IbufFuGf41zfItJPCugj6PWOOpoUt0ajMr3ZMA8p1JsZs3ngpINCSzb/MhgtWta5iBzZaR xbuAunHLd6eEarkKHRkzXpHbTlbmYaA= Mime-Version: 1.0 Subject: Re: [PATCH v4 2/5] zswap: make shrinking memcg-aware X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Muchun Song In-Reply-To: Date: Thu, 2 Nov 2023 10:17:03 +0800 Cc: Andrew Morton , Johannes Weiner , cerasuolodomenico@gmail.com, Yosry Ahmed , sjenning@redhat.com, ddstreet@ieee.org, vitaly.wool@konsulko.com, Michal Hocko , Roman Gushchin , Shakeel Butt , Chris Li , Linux-MM , kernel-team@meta.com, LKML Content-Transfer-Encoding: quoted-printable Message-Id: References: <20231024203302.1920362-3-nphamcs@gmail.com> <20231101012614.186996-1-nphamcs@gmail.com> To: Nhat Pham X-Migadu-Flow: FLOW_OUT X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE,URIBL_BLOCKED autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on agentk.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (agentk.vger.email [0.0.0.0]); Wed, 01 Nov 2023 19:18:23 -0700 (PDT) > On Nov 2, 2023, at 01:44, Nhat Pham wrote: >=20 > On Tue, Oct 31, 2023 at 8:07=E2=80=AFPM Muchun Song = wrote: >>=20 >>=20 >>=20 >>> On Nov 1, 2023, at 09:26, Nhat Pham wrote: >>>=20 >>> cc-ing Johannes, Roman, Shakeel, Muchun since you all know much more >>> about memory controller + list_lru reparenting logic than me. >>>=20 >>> There seems to be a race between memcg offlining and zswap=E2=80=99s >>> cgroup-aware LRU implementation: >>>=20 >>> CPU0 CPU1 >>> zswap_lru_add() mem_cgroup_css_offline() >>> get_mem_cgroup_from_objcg() >>> memcg_offline_kmem() >>> memcg_reparent_objcgs() >>> memcg_reparent_list_lrus() >>> memcg_reparent_list_lru() >>> = memcg_reparent_list_lru_node() >>> list_lru_add() >>> memcg_list_lru_free() >>>=20 >>>=20 >>> Essentially: on CPU0, zswap gets the memcg from the entry's objcg >>> (before the objcgs are reparented). Then it performs list_lru_add() >>> after the list_lru entries reparenting = (memcg_reparent_list_lru_node()) >>> step. If the list_lru of the memcg being offlined has not been freed >>> (i.e before the memcg_list_lru_free() call), then the list_lru_add() >>> call would succeed - but the list will be freed soon after. The new >>=20 >> No worries. list_lru_add() will add the object to the lru list of >> the parent of the memcg being offlined, because the ->kmemcg_id of = the >> memcg being offlined will be changed to its parent's ->kmemcg_id = before memcg_reparent_list_lru(). >>=20 >=20 > Ohhh that is subtle. Thanks for pointing this out, Muchun! >=20 > In that case, I think Yosry is right after all! We don't even need to = get > a reference to the memcg: >=20 > rcu_read_lock(); > memcg =3D obj_cgroup_memcg(objcg); > list_lru_add(); > rcu_read_unlock(); >=20 > As long as we're inside this rcu section, we're guaranteed to get > an un-freed memcg. Now it could be offlined etc., but as Muchun has Right. Thanks. > pointed out, the list_lru_add() call will still does the right thing - = it will > either add the new entry to the parent list if this happens after the > kmemcg_id update, or the child list before the list_lru reparenting > action. Both of these scenarios are fine. >=20 >> Muchun, >> Thanks >>=20 >>> zswap entry as a result will not be subjected to future reclaim >>> attempt. IOW, this list_lru_add() call is effectively swallowed. And >>> worse, there might be a crash when we invalidate the zswap_entry in = the >>> future (which will perform a list_lru removal). >>>=20 >>> Within get_mem_cgroup_from_objcg(), none of the following seem >>> sufficient to prevent this race: >>>=20 >>> 1. Perform the objcg-to-memcg lookup inside a rcu_read_lock() >>> section. >>> 2. Checking if the memcg is freed yet (with css_tryget()) (what >>> we're currently doing in this patch series). >>> 3. Checking if the memcg is still online (with = css_tryget_online()) >>> The memcg can still be offlined down the line. >>>=20 >>>=20 >>> I've discussed this privately with Johannes, and it seems like the >>> cleanest solution here is to move the reparenting logic down to = release >>> stage. That way, when get_mem_cgroup_from_objcg() returns, >>> zswap_lru_add() is given an memcg that is reparenting-safe (until we >>> drop the obtained reference).