Received: by 2002:a05:6358:3188:b0:123:57c1:9b43 with SMTP id q8csp2585170rwd; Mon, 22 May 2023 00:46:11 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ6MGRajSwA24RSbzxUMTx6GzRPMqKSAP5ixv5EWaeGAHkO+PCnC6uBT4cWBIkKQYcX7KM3o X-Received: by 2002:a05:6a21:9998:b0:10a:a0e1:96fa with SMTP id ve24-20020a056a21999800b0010aa0e196famr8252525pzb.22.1684741570667; Mon, 22 May 2023 00:46:10 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1684741570; cv=none; d=google.com; s=arc-20160816; b=vooBvugSHxrjN9N92dR6+bo2nQALzK/n68kM7HDXh3uJysK3gOHnMxBvikfyW2BG+5 L0/iGiN4kQDa7btqeVcsA0oPKHI9VKHMgojSYmnR/UAEV8xjZcH2kfarfjpwekWG2Ztp P2HZid1XHmwbbIqJ9SwP74yYp4TGJntNGwnfkKEJxU9CphTcZJqP/vlTL53mo7rWo16b URctTsa+PfBC5MkekOT0aUp+CJSMFNW2EA2fQLPxxdqyGjuL78Pldnl16YI/cbNdSeLD 2UDLMAj17BUTwTz9sax9gLR16FwTRXEiZZ+P8Z0/rxt7ZWv9srrLLNwXuZPF9k2cVJH9 qFPg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from; bh=1x2M+HHQVJOvgxrO4CAWRDGnEnukUJ3dss+O3+USrq4=; b=TLFgHdcks9SoytO2dIoAGf+ONspmgm/+nA3+mP6G3cgpXsCyHt2m6BbwJRHLVKDv9w 5NCoN4MBrTO7o380K4H0+NPUYDpq4Iwe/IzkJoA9p+tVKjrqyq3AJJRC4u3ZTQ2rmrA+ j7uzfBbj5EI152jIHaiI8C2YVuPtglmL6qa9hzizgnZIvc0ypNGYy8QHaSlH0Gc9M4n5 4k9kKcHSP9T1iL0nlsKe1uyQgXe37+KboEeuVlJMGthhFxuhKISRYrnw3gQqPCRvwwGn mlmDQ8dNHEyNhALNRZceThiof9z2wjNkok8dAfe1UfyY8P87h9bhusb4JnkxTHMeYLlh 2cXQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id k190-20020a6384c7000000b0051b1542d6efsi4235903pgd.213.2023.05.22.00.45.58; Mon, 22 May 2023 00:46:10 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229627AbjEVHWv (ORCPT + 99 others); Mon, 22 May 2023 03:22:51 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51342 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232526AbjEVHWf (ORCPT ); Mon, 22 May 2023 03:22:35 -0400 Received: from out30-110.freemail.mail.aliyun.com (out30-110.freemail.mail.aliyun.com [115.124.30.110]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 31D081FF7 for ; Mon, 22 May 2023 00:20:34 -0700 (PDT) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R171e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018046050;MF=hsiangkao@linux.alibaba.com;NM=1;PH=DS;RN=3;SR=0;TI=SMTPD_---0Vj9pHVZ_1684739999; Received: from e18g06460.et15sqa.tbsite.net(mailfrom:hsiangkao@linux.alibaba.com fp:SMTPD_---0Vj9pHVZ_1684739999) by smtp.aliyun-inc.com; Mon, 22 May 2023 15:20:05 +0800 From: Gao Xiang To: linux-erofs@lists.ozlabs.org Cc: LKML , Gao Xiang Subject: [PATCH v2] erofs: kill hooked chains to avoid loops on deduplicated compressed images Date: Mon, 22 May 2023 15:19:58 +0800 Message-Id: <20230522071958.110074-1-hsiangkao@linux.alibaba.com> X-Mailer: git-send-email 2.24.4 In-Reply-To: <20230519071219.41757-1-hsiangkao@linux.alibaba.com> References: <20230519071219.41757-1-hsiangkao@linux.alibaba.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-9.9 required=5.0 tests=BAYES_00, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE,UNPARSEABLE_RELAY,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org After heavily stressing EROFS with several images which include a hand-crafted image of repeated patterns for more than 46 days, I found two chains could be linked with each other almost simultaneously and form a loop so that the entire loop won't be submitted. As a consequence, the corresponding file pages will remain locked forever. It can be _only_ observed on data-deduplicated compressed images. For example, consider two chains with five pclusters in total: Chain 1: 2->3->4->5 -- The tail pcluster is 5; Chain 2: 5->1->2 -- The tail pcluster is 2. Chain 2 could link to Chain 1 with pcluster 5; and Chain 1 could link to Chain 2 at the same time with pcluster 2. Since hooked chains are all linked locklessly now, I have no idea how to simply avoid the race. Instead, let's avoid hooked chains entirely until a proper way could be worked out to fix this and end users finally tell us that it's needed to add it back. Actually, this optimization can be found with multi-threaded workloads (especially even more often on deduplicated compressed images), yet I'm not sure about the overall system impacts of not having this compared with implementation complexity. Fixes: 267f2492c8f7 ("erofs: introduce multi-reference pclusters (fully-referenced)") Signed-off-by: Gao Xiang --- changes since v1: - previous fix is still racy, so get rid of hooked chains entirely. fs/erofs/zdata.c | 69 +++++++----------------------------------------- 1 file changed, 10 insertions(+), 59 deletions(-) diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c index 45f21db2303a..92f3a01262cf 100644 --- a/fs/erofs/zdata.c +++ b/fs/erofs/zdata.c @@ -96,8 +96,6 @@ struct z_erofs_pcluster { /* the chained workgroup has't submitted io (still open) */ #define Z_EROFS_PCLUSTER_TAIL ((void *)0x5F0ECAFE) -/* the chained workgroup has already submitted io */ -#define Z_EROFS_PCLUSTER_TAIL_CLOSED ((void *)0x5F0EDEAD) #define Z_EROFS_PCLUSTER_NIL (NULL) @@ -501,20 +499,6 @@ int __init z_erofs_init_zip_subsystem(void) enum z_erofs_pclustermode { Z_EROFS_PCLUSTER_INFLIGHT, - /* - * The current pclusters was the tail of an exist chain, in addition - * that the previous processed chained pclusters are all decided to - * be hooked up to it. - * A new chain will be created for the remaining pclusters which are - * not processed yet, so different from Z_EROFS_PCLUSTER_FOLLOWED, - * the next pcluster cannot reuse the whole page safely for inplace I/O - * in the following scenario: - * ________________________________________________________________ - * | tail (partial) page | head (partial) page | - * | (belongs to the next pcl) | (belongs to the current pcl) | - * |_______PCLUSTER_FOLLOWED______|________PCLUSTER_HOOKED__________| - */ - Z_EROFS_PCLUSTER_HOOKED, /* * a weak form of Z_EROFS_PCLUSTER_FOLLOWED, the difference is that it * could be dispatched into bypass queue later due to uptodated managed @@ -532,8 +516,8 @@ enum z_erofs_pclustermode { * ________________________________________________________________ * | tail (partial) page | head (partial) page | * | (of the current cl) | (of the previous collection) | - * | PCLUSTER_FOLLOWED or | | - * |_____PCLUSTER_HOOKED__|___________PCLUSTER_FOLLOWED____________| + * | | | + * |__PCLUSTER_FOLLOWED___|___________PCLUSTER_FOLLOWED____________| * * [ (*) the above page can be used as inplace I/O. ] */ @@ -546,7 +530,7 @@ struct z_erofs_decompress_frontend { struct z_erofs_bvec_iter biter; struct page *candidate_bvpage; - struct z_erofs_pcluster *pcl, *tailpcl; + struct z_erofs_pcluster *pcl; z_erofs_next_pcluster_t owned_head; enum z_erofs_pclustermode mode; @@ -752,19 +736,7 @@ static void z_erofs_try_to_claim_pcluster(struct z_erofs_decompress_frontend *f) return; } - /* - * type 2, link to the end of an existing open chain, be careful - * that its submission is controlled by the original attached chain. - */ - if (*owned_head != &pcl->next && pcl != f->tailpcl && - cmpxchg(&pcl->next, Z_EROFS_PCLUSTER_TAIL, - *owned_head) == Z_EROFS_PCLUSTER_TAIL) { - *owned_head = Z_EROFS_PCLUSTER_TAIL; - f->mode = Z_EROFS_PCLUSTER_HOOKED; - f->tailpcl = NULL; - return; - } - /* type 3, it belongs to a chain, but it isn't the end of the chain */ + /* type 2, it belongs to an ongoing chain */ f->mode = Z_EROFS_PCLUSTER_INFLIGHT; } @@ -825,9 +797,6 @@ static int z_erofs_register_pcluster(struct z_erofs_decompress_frontend *fe) goto err_out; } } - /* used to check tail merging loop due to corrupted images */ - if (fe->owned_head == Z_EROFS_PCLUSTER_TAIL) - fe->tailpcl = pcl; fe->owned_head = &pcl->next; fe->pcl = pcl; return 0; @@ -848,7 +817,6 @@ static int z_erofs_collector_begin(struct z_erofs_decompress_frontend *fe) /* must be Z_EROFS_PCLUSTER_TAIL or pointed to previous pcluster */ DBG_BUGON(fe->owned_head == Z_EROFS_PCLUSTER_NIL); - DBG_BUGON(fe->owned_head == Z_EROFS_PCLUSTER_TAIL_CLOSED); if (!(map->m_flags & EROFS_MAP_META)) { grp = erofs_find_workgroup(fe->inode->i_sb, @@ -867,10 +835,6 @@ static int z_erofs_collector_begin(struct z_erofs_decompress_frontend *fe) if (ret == -EEXIST) { mutex_lock(&fe->pcl->lock); - /* used to check tail merging loop due to corrupted images */ - if (fe->owned_head == Z_EROFS_PCLUSTER_TAIL) - fe->tailpcl = fe->pcl; - z_erofs_try_to_claim_pcluster(fe); } else if (ret) { return ret; @@ -1027,8 +991,7 @@ static int z_erofs_do_read_page(struct z_erofs_decompress_frontend *fe, * those chains are handled asynchronously thus the page cannot be used * for inplace I/O or bvpage (should be processed in a strict order.) */ - tight &= (fe->mode >= Z_EROFS_PCLUSTER_HOOKED && - fe->mode != Z_EROFS_PCLUSTER_FOLLOWED_NOINPLACE); + tight &= (fe->mode > Z_EROFS_PCLUSTER_FOLLOWED_NOINPLACE); cur = end - min_t(unsigned int, offset + end - map->m_la, end); if (!(map->m_flags & EROFS_MAP_MAPPED)) { @@ -1406,10 +1369,7 @@ static void z_erofs_decompress_queue(const struct z_erofs_decompressqueue *io, }; z_erofs_next_pcluster_t owned = io->head; - while (owned != Z_EROFS_PCLUSTER_TAIL_CLOSED) { - /* impossible that 'owned' equals Z_EROFS_WORK_TPTR_TAIL */ - DBG_BUGON(owned == Z_EROFS_PCLUSTER_TAIL); - /* impossible that 'owned' equals Z_EROFS_PCLUSTER_NIL */ + while (owned != Z_EROFS_PCLUSTER_TAIL) { DBG_BUGON(owned == Z_EROFS_PCLUSTER_NIL); be.pcl = container_of(owned, struct z_erofs_pcluster, next); @@ -1426,7 +1386,7 @@ static void z_erofs_decompressqueue_work(struct work_struct *work) container_of(work, struct z_erofs_decompressqueue, u.work); struct page *pagepool = NULL; - DBG_BUGON(bgq->head == Z_EROFS_PCLUSTER_TAIL_CLOSED); + DBG_BUGON(bgq->head == Z_EROFS_PCLUSTER_TAIL); z_erofs_decompress_queue(bgq, &pagepool); erofs_release_pages(&pagepool); kvfree(bgq); @@ -1614,7 +1574,7 @@ static struct z_erofs_decompressqueue *jobqueue_init(struct super_block *sb, q->sync = true; } q->sb = sb; - q->head = Z_EROFS_PCLUSTER_TAIL_CLOSED; + q->head = Z_EROFS_PCLUSTER_TAIL; return q; } @@ -1632,11 +1592,7 @@ static void move_to_bypass_jobqueue(struct z_erofs_pcluster *pcl, z_erofs_next_pcluster_t *const submit_qtail = qtail[JQ_SUBMIT]; z_erofs_next_pcluster_t *const bypass_qtail = qtail[JQ_BYPASS]; - DBG_BUGON(owned_head == Z_EROFS_PCLUSTER_TAIL_CLOSED); - if (owned_head == Z_EROFS_PCLUSTER_TAIL) - owned_head = Z_EROFS_PCLUSTER_TAIL_CLOSED; - - WRITE_ONCE(pcl->next, Z_EROFS_PCLUSTER_TAIL_CLOSED); + WRITE_ONCE(pcl->next, Z_EROFS_PCLUSTER_TAIL); WRITE_ONCE(*submit_qtail, owned_head); WRITE_ONCE(*bypass_qtail, &pcl->next); @@ -1707,15 +1663,10 @@ static void z_erofs_submit_queue(struct z_erofs_decompress_frontend *f, unsigned int i = 0; bool bypass = true; - /* no possible 'owned_head' equals the following */ - DBG_BUGON(owned_head == Z_EROFS_PCLUSTER_TAIL_CLOSED); DBG_BUGON(owned_head == Z_EROFS_PCLUSTER_NIL); - pcl = container_of(owned_head, struct z_erofs_pcluster, next); + owned_head = READ_ONCE(pcl->next); - /* close the main owned chain at first */ - owned_head = cmpxchg(&pcl->next, Z_EROFS_PCLUSTER_TAIL, - Z_EROFS_PCLUSTER_TAIL_CLOSED); if (z_erofs_is_inline_pcluster(pcl)) { move_to_bypass_jobqueue(pcl, qtail, owned_head); continue; -- 2.24.4