Received: by 2002:ab2:6857:0:b0:1ef:ffd0:ce49 with SMTP id l23csp2929840lqp; Mon, 25 Mar 2024 13:26:38 -0700 (PDT) X-Forwarded-Encrypted: i=3; AJvYcCXq2s4Gt01nVVVxvYWTXcIGOWkws0l1akFeVLN+X7tJ0sAsKJ7EWL6Ye9erd0HePi7RfJk6TgMbSt9OnBc+YmJedBP3dctERripbZtcGw== X-Google-Smtp-Source: AGHT+IEcz112uC+hlXgtMg9AYvrFuQjzvbCQLHUQyQXw6U4LcBi+k5oDGmyU7e7NGn0QR5QGTZkJ X-Received: by 2002:a17:903:124d:b0:1e0:9926:aded with SMTP id u13-20020a170903124d00b001e09926adedmr9857895plh.24.1711398398410; Mon, 25 Mar 2024 13:26:38 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1711398398; cv=pass; d=google.com; s=arc-20160816; b=kto2D0/LJP6wl/QYL/H1kj7C0N4I79eZdWlgjrMkP8W1/S8eHSZpIJ5OT7HUl0qYpN 62xVuDmkDPbd+3rmH4/oIi3ZbWPvdHKuLMLSG3G910vHEv6zZJ1BEWmVJwdm5lYhTWfA RC15FYmnlfhKyQRefhZBi/aktnqzISy62yhvlJzFlAW5boiust3dPHjlVxEf31NE0doK gcAoNqHdBdmR6tejKCreCqenrV4+soA97Unb/Ao7DZE35gg6ozMChxddVZQsno5YM6BN TiTFkKjGFTx2YwcqOFIcPf92TuPLb7i/1GidpEdgxmsmThvHRFgmRc6BKcuHlLszurEl 3DRQ== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:list-unsubscribe:list-subscribe :list-id:precedence:dkim-signature; bh=VnaO12DYPd96KK0ZxyUCj9krmBaYuVzmySYDUn5xLRY=; fh=W+P6orOkWTygc0QhipL0mhjtlC0lRvFRlQbi+Gg79j0=; b=Wnf1HKdlrXAeCroUd7e93DGQ6qOXS0wjozMeie7CXm0n6unTA6NW96J/IIIlT98Pkb y3x2dp9h8MDS1rQ3+mouS/nW1rn35GENJUFzoZyG1UkXvYdWN52of9WP5N/i3lvuPUnG d753jB2T33U67VmTmjK0iM0egHCoN5+yz5C4nDpx4EejjCoOjeU9oVnyU/T85tCcJrnt we8Akt0E/Bav+TLxTupkNyRi0A1SVRbyYzSMJzhVFCW5ZzhdcuhrjDV/HkftDOqoB4MU x5BZBdc7wc9DxNp/KuopUfLNH3rz4TaYzSf3k/PRsG+OFU5c4j6WEsF0TAymgbxGQ3Rx qRlw==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=rGd1qetv; arc=pass (i=1 spf=pass spfdomain=google.com dkim=pass dkdomain=google.com dmarc=pass fromdomain=google.com); spf=pass (google.com: domain of linux-kernel+bounces-117722-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.48.161 as permitted sender) smtp.mailfrom="linux-kernel+bounces-117722-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from sy.mirrors.kernel.org (sy.mirrors.kernel.org. [147.75.48.161]) by mx.google.com with ESMTPS id z9-20020a170902834900b001e0093ae691si5564493pln.408.2024.03.25.13.26.37 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 25 Mar 2024 13:26:38 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel+bounces-117722-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.48.161 as permitted sender) client-ip=147.75.48.161; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=rGd1qetv; arc=pass (i=1 spf=pass spfdomain=google.com dkim=pass dkdomain=google.com dmarc=pass fromdomain=google.com); spf=pass (google.com: domain of linux-kernel+bounces-117722-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.48.161 as permitted sender) smtp.mailfrom="linux-kernel+bounces-117722-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sy.mirrors.kernel.org (Postfix) with ESMTPS id 96FDBC483A6 for ; Mon, 25 Mar 2024 18:46:50 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id F337A8836; Mon, 25 Mar 2024 18:35:50 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="rGd1qetv" Received: from mail-ej1-f49.google.com (mail-ej1-f49.google.com [209.85.218.49]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 862954A2D for ; Mon, 25 Mar 2024 18:35:47 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.218.49 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1711391749; cv=none; b=iigdfDqNQKS9JKyhGgx9X5BiSGdSMVQA3HwXF5pPcu1ys8rlQlZ27gMFLU+QYPYwM3VHijBe1vO+8fberlBwEcTZgjrHLWIMIV3sK3i+2jA8thE2uAhBhxnZnF0c9H9hXSqXxriBF4gSCdTe32dZPHYI9HdRUXyVMvoCktt7+io= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1711391749; c=relaxed/simple; bh=h0yilnZ51BaEqGj5t3UHWGeuntB/GTqaVwYDJFg8lM0=; h=MIME-Version:References:In-Reply-To:From:Date:Message-ID:Subject: To:Cc:Content-Type; b=qaPHD/wdc40eX4h4bigWJC1wO6TGNOEkVrCEHIWk282aAIqThWpXDU4IoHAumaBViGFTJqldI/4Z4j//i/lVkB0r4GZR/KCKfrzA6w9h8rNIL/HvN7CiuAlvW9aX40PCU+HIZaZVyaRm1o1d+FZBuE50wBkfu+2WMS6lEUZvaQ0= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=rGd1qetv; arc=none smtp.client-ip=209.85.218.49 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=google.com Received: by mail-ej1-f49.google.com with SMTP id a640c23a62f3a-a47385a4379so495483766b.0 for ; Mon, 25 Mar 2024 11:35:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1711391746; x=1711996546; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=VnaO12DYPd96KK0ZxyUCj9krmBaYuVzmySYDUn5xLRY=; b=rGd1qetv55U0qhS2zkCYGtiAr0WqGVBv7A6yL04lXUvlDSyJbk7HBPmuDfBQKXjAKL M8z3vPA/puXX3FtQ2mnLh9SjOas+kxuFEHSAD+ldhJyOvmZZ72z3NZyH/wpTZj0mPaOx QoFWtRRlW1d4NPptl50n0r3oQAEDPlV7OFX/OlvX9UzxCvbXJ0lv0zh3q9FoBDIc0VKn /+12IhHIYbiBiHxX7+/c2VIYusPNhmkuhaYdm8ztBBxw5iIPGLKaS4yHdUYds9fODrq4 pwCO5cb6ejJqAwyfFCe6wnfGs7CzrQU/PepgToMcjLJENM+6DZIXS8jUhkVCqzyANRVy YiaQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1711391746; x=1711996546; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=VnaO12DYPd96KK0ZxyUCj9krmBaYuVzmySYDUn5xLRY=; b=X1hA3syUQ0fr6+ngzY0fCyLS/1gcNOMIkPbawlKTxJ81sau2nlNWe5G+dbFsORHyty 0xiHrkv9mWXVdId8IKgq/UZsIefqM907GXZexJDn+fbkuBRMqDqPfPJNicKbjT1QTe74 Ji88q87QIpPUPwSId/VvU9YGklJloQtKWhDMD4tvtYRO174jQfeM9Om9U7IVqI1ywqgS DjhljjMgiehEvfGoQib27jvevNuHkJZSbzO9b9YLsWhOeQKnKSfU0oGSJPiLOd9iExlV XMZ3RBm9iRPADa5KXzcbtT23DzZGUfqRC1i/CGuDCOSYo3djFKVVy+5y6MLwqv0ULW90 hX6g== X-Forwarded-Encrypted: i=1; AJvYcCUYl2i7Rgaw2ZiI3OFWItmlvYVO8dDWSCW+BYm4MulwJvSbIEQS8I/qEZs1GgLpCpP5oq3b1/Fx+c+6qe3muMUK1Qwpj/jCKOulb3PI X-Gm-Message-State: AOJu0YzUXd1sxCS0Bxyu7ObUW83wx7VOGoMZe5WuOJv6BDgmNb0U/8hW px67Ukx9bBQ8ACQZpKdN5V6eVDQ/Lc3yfgE9gdk2/TIpJ0/y6qT1MPx4WRgfBuXnBNaFa66MjlL 3id35yc2lRg+7VVBmBCd7GL93KInlGk6coQ/yCuRSLZIaEt6WoWTs X-Received: by 2002:a17:907:9722:b0:a47:48d7:d393 with SMTP id jg34-20020a170907972200b00a4748d7d393mr6682243ejc.33.1711391745531; Mon, 25 Mar 2024 11:35:45 -0700 (PDT) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 References: <20240324210447.956973-1-hannes@cmpxchg.org> <1e7ce417-b9dd-4d62-9f54-0adf1ccdae35@linux.dev> <339639f3-4334-44a8-a811-d20b3c578f74@linux.dev> In-Reply-To: <339639f3-4334-44a8-a811-d20b3c578f74@linux.dev> From: Yosry Ahmed Date: Mon, 25 Mar 2024 11:35:06 -0700 Message-ID: Subject: Re: [PATCH] mm: zswap: fix data loss on SWP_SYNCHRONOUS_IO devices To: Chengming Zhou Cc: Barry Song <21cnbao@gmail.com>, Johannes Weiner , Andrew Morton , Zhongkun He , Chengming Zhou , Chris Li , Nhat Pham , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Kairui Song , Minchan Kim Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Mon, Mar 25, 2024 at 2:46=E2=80=AFAM Chengming Zhou wrote: > > On 2024/3/25 17:40, Yosry Ahmed wrote: > > On Mon, Mar 25, 2024 at 2:22=E2=80=AFAM Chengming Zhou wrote: > >> > >> On 2024/3/25 16:38, Yosry Ahmed wrote: > >>> On Mon, Mar 25, 2024 at 12:33=E2=80=AFAM Chengming Zhou > >>> wrote: > >>>> > >>>> On 2024/3/25 15:06, Yosry Ahmed wrote: > >>>>> On Sun, Mar 24, 2024 at 9:54=E2=80=AFPM Barry Song <21cnbao@gmail.c= om> wrote: > >>>>>> > >>>>>> On Mon, Mar 25, 2024 at 10:23=E2=80=AFAM Yosry Ahmed wrote: > >>>>>>> > >>>>>>> On Sun, Mar 24, 2024 at 2:04=E2=80=AFPM Johannes Weiner wrote: > >>>>>>>> > >>>>>>>> Zhongkun He reports data corruption when combining zswap with zr= am. > >>>>>>>> > >>>>>>>> The issue is the exclusive loads we're doing in zswap. They assu= me > >>>>>>>> that all reads are going into the swapcache, which can assume > >>>>>>>> authoritative ownership of the data and so the zswap copy can go= . > >>>>>>>> > >>>>>>>> However, zram files are marked SWP_SYNCHRONOUS_IO, and faults wi= ll try > >>>>>>>> to bypass the swapcache. This results in an optimistic read of t= he > >>>>>>>> swap data into a page that will be dismissed if the fault fails = due to > >>>>>>>> races. In this case, zswap mustn't drop its authoritative copy. > >>>>>>>> > >>>>>>>> Link: https://lore.kernel.org/all/CACSyD1N+dUvsu8=3DzV9P691B9bVq= 33erwOXNTmEaUbi9DrDeJzw@mail.gmail.com/ > >>>>>>>> Reported-by: Zhongkun He > >>>>>>>> Fixes: b9c91c43412f ("mm: zswap: support exclusive loads") > >>>>>>>> Cc: stable@vger.kernel.org [6.5+] > >>>>>>>> Signed-off-by: Johannes Weiner > >>>>>>>> Tested-by: Zhongkun He > >>>>>> > >>>>>> Acked-by: Barry Song > >>>>>> > >>>>>>> > >>>>>>> Do we also want to mention somewhere (commit log or comment) that > >>>>>>> keeping the entry in the tree is fine because we are still protec= ted > >>>>>>> from concurrent loads/invalidations/writeback by swapcache_prepar= e() > >>>>>>> setting SWAP_HAS_CACHE or so? > >>>>>> > >>>>>> It seems that Kairui's patch comprehensively addresses the issue a= t hand. > >>>>>> Johannes's solution, on the other hand, appears to align zswap beh= avior > >>>>>> more closely with that of a traditional swap device, only releasin= g an entry > >>>>>> when the corresponding swap slot is freed, particularly in the syn= c-io case. > >>>>> > >>>>> It actually worked out quite well that Kairui's fix landed shortly > >>>>> before this bug was reported, as this fix wouldn't have been possib= le > >>>>> without it as far as I can tell. > >>>>> > >>>>>> > >>>>>> Johannes' patch has inspired me to consider whether zRAM could ach= ieve > >>>>>> a comparable outcome by immediately releasing objects in swap cach= e > >>>>>> scenarios. When I have the opportunity, I plan to experiment with= zRAM. > >>>>> > >>>>> That would be interesting. I am curious if it would be as > >>>>> straightforward in zram to just mark the folio as dirty in this cas= e > >>>>> like zswap does, given its implementation as a block device. > >>>>> > >>>> > >>>> This makes me wonder who is responsible for marking folio dirty in t= his swapcache > >>>> bypass case? Should we call folio_mark_dirty() after the swap_read_f= olio()? > >>> > >>> In shrink_folio_list(), we try to add anonymous folios to the > >>> swapcache if they are not there before checking if they are dirty. > >>> add_to_swap() calls folio_mark_dirty(), so this should take care of > >> > >> Right, thanks for your clarification, so should be no problem here. > >> Although it was a fix just for MADV_FREE case. > >> > >>> it. There is an interesting comment there though. It says that PTE > >>> should be dirty, so unmapping the folio should have already marked it > >>> as dirty by the time we are adding it to the swapcache, except for th= e > >>> MADV_FREE case. > >> > >> It seems to say the folio will be dirtied when unmap later, supposing = the > >> PTE is dirty. > > > > Oh yeah it could mean that the folio will be dirted later. > > > >> > >>> > >>> However, I think we actually unmap the folio after we add it to the > >>> swapcache in shrink_folio_list(). Also, I don't immediately see why > >>> the PTE would be dirty. In do_swap_page(), making the PTE dirty seems > >> > >> If all anon pages on LRU list are faulted by write, it should be true. > >> We could just use the zero page if faulted by read, right? > > > > This applies for the initial fault that creates the folio, but this is > > a swap fault. It could be a read fault and in that case we still need > > to make the folio dirty because it's not in the swapcache and we need > > to write it out if it's reclaimed, right? > > Yes, IMHO I think it should be marked as dirty here. > > But it should be no problem with that unconditional folio_mark_dirty() > in add_to_swap(). Not sure if there are other issues. I don't believe there are any issues now. Dirtying the folio in add_to_swap() was introduced before SWP_SYNCHRONOUS_IO, so I guess things have always worked. I think we should update the comment there though to mention that dirtying the folio is also needed for this case (not just MADV_FREE), or dirty the PTE during the fault. Otherwise, if someone is making MADV_FREE changes they could end up breaking SWP_SYNCHRONOUS_IO faults. Adding Minchan here in case he can confirm that we in fact rely on add_to_swap()->folio_mark_dirty() for SWP_SYNCHRONOUS_IO to work as intended.