Received: by 2002:a05:6a10:d5a5:0:0:0:0 with SMTP id gn37csp3477851pxb; Mon, 4 Oct 2021 03:14:30 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxR8lB6Ea+mSfbwZS2+IHz8DjSx1rjdIqZz00L5Dlrv8lRJoq12JYnEfiQoq4BN0RRxJHQO X-Received: by 2002:a17:90b:1bc6:: with SMTP id oa6mr35977261pjb.58.1633342470171; Mon, 04 Oct 2021 03:14:30 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1633342470; cv=none; d=google.com; s=arc-20160816; b=h59tbeOSayXqviYhcE5a/VmcBv/T1po8/stMyrDS43DMWYfzc+sk452GLhrs/hwA6R IT3uV2mWaydB6NGeoTJHGP/2ZP85+HVNcKcKDU6iz3J6qkGCUvEdhRACLntlc9JSLCRO VEyspJ5eHccxwvNZXqqRvTxPn1p++myAiCAsgqUT93b11gpYX80fSbB3xmJTpE6At4Cf u4zk07YO7GYsYJpu0hnb7wRUlTcGc4f9NwTM8VbBEl/F7lCJg2es8UFVXCbTANS4fyEH fFKD/fnna6y3Uz28nXemp2wXm77N4HoTnzR31RKKbNHrQPLhG/KiriVQSYRvHFPe9teK j2Ww== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=NAL9mLAkgGa2bVuVKIlO5/qdtq4uqNoA0qgRgd/IJ1o=; b=QznP+tXuxC925BYz5CrKb2O4qAxYrXf3dEhhfq54ZtD0Q9Qi614UUQrDKd+BIJrGPR zVQpgHJPZfFm/5ZG8O2a7AZ8gaGKA7YDA4c3ClXbSwDY++S2jD8/eD2qfRmI6CcWHhuZ ztFWiRf/L6K0hlSTobPu9rrCyLy+XoIB1AY/gGGt066uk+Qt534UE3Hr+IjJqo3yWifS /Q6tK+KnhL1nU+VX431usc6nqNqd5tc3V7vYo1Btb+DzvgnVD3CoQDZu5yAzqTVG4aIi 2b9bpvXCR03NCmlVIzeEPR29R0Njv3srM+X8Rx1c9LRRkcCMRSQ0g93Eezf7fEvFD9yP Pnfg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@geanix.com header.s=first header.b=VgvZbCw6; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=geanix.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id t5si20097369pfe.118.2021.10.04.03.14.18; Mon, 04 Oct 2021 03:14:30 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@geanix.com header.s=first header.b=VgvZbCw6; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=geanix.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230238AbhJDKOj (ORCPT + 99 others); Mon, 4 Oct 2021 06:14:39 -0400 Received: from first.geanix.com ([116.203.34.67]:37332 "EHLO first.geanix.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229478AbhJDKOj (ORCPT ); Mon, 4 Oct 2021 06:14:39 -0400 Received: from skn-laptop (_gateway [172.25.0.1]) by first.geanix.com (Postfix) with ESMTPSA id 71F35B3806; Mon, 4 Oct 2021 10:12:47 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=geanix.com; s=first; t=1633342367; bh=L2499d1/RFNk11E69m0/jD16IPdgLyjKxo97sCNucZU=; h=Date:From:To:Cc:Subject:References:In-Reply-To; b=VgvZbCw6AHl4eauSj1tiEtnHufPK8fEch3blupX2DPCK9+Zoxzqks9DFeK6Sfo++U r8/h4/hehhMCJuja5vJ69cUEXA7FwPhnJAP89dE5WoO3ed4W0sCsMpLLahq9bwj+YZ ScqrTY1F5rqNf+xzrd9qcxsYlJhqp/eCzhFhNpB2nmPyItZLZ+8pa4auxWy2KaMcjA jEKYtb/AhJAra2XdJoyAnJFlQRySuDPGka+TRPWU4iUM54nBa1/UcLFPTT6X8Dqwjg HKJC2MPbKHaoIDjwvoL0EHUR7Mu1P8YOt9vYFTc5SYd287WSTXutI+nuHaKFsuBMmZ oi4pWPmpxEgRQ== Date: Mon, 4 Oct 2021 12:12:46 +0200 From: Sean Nyekjaer To: Boris Brezillon Cc: Miquel Raynal , Richard Weinberger , Vignesh Raghavendra , Boris Brezillon , linux-mtd@lists.infradead.org, linux-kernel@vger.kernel.org Subject: Re: [RFC PATCH] mtd: rawnand: use mutex to protect access while in suspend Message-ID: <20211004101246.kagtezizympxupat@skn-laptop> References: <20211004065608.3190348-1-sean@geanix.com> <20211004104147.579f3b01@collabora.com> <20211004085509.iikxtdvxpt6bri5c@skn-laptop> <20211004115817.18739936@collabora.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20211004115817.18739936@collabora.com> X-Spam-Status: No, score=-1.2 required=4.0 tests=ALL_TRUSTED,BAYES_20, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,URIBL_BLOCKED autolearn=disabled version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on 13e2a5895688 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Oct 04, 2021 at 11:58:17AM +0200, Boris Brezillon wrote: > On Mon, 4 Oct 2021 10:55:09 +0200 > Sean Nyekjaer wrote: > > > On Mon, Oct 04, 2021 at 10:41:47AM +0200, Boris Brezillon wrote: > > > On Mon, 4 Oct 2021 08:56:09 +0200 > > > Sean Nyekjaer wrote: > > > > > > > This will prevent nand_get_device() from returning -EBUSY. > > > > It will force mtd_write()/mtd_read() to wait for the nand_resume() to unlock > > > > access to the mtd device. > > > > > > > > Then we avoid -EBUSY is returned to ubifsi via mtd_write()/mtd_read(), > > > > that will in turn hard error on every error returened. > > > > We have seen during ubifs tries to call mtd_write before the mtd device > > > > is resumed. > > > > > > I think the problem is here. Why would UBIFS/UBI try to write something > > > to a device that's not resumed yet (or has been suspended already, if > > > you hit this in the suspend path). > > > > > > > > > > > Exec_op[0] speed things up, so we see this race when the device is > > > > resuming. But it's actually "mtd: rawnand: Simplify the locking" that > > > > allows it to return -EBUSY, before that commit it would have waited for > > > > the mtd device to resume. > > > > > > Uh, wait. If nand_resume() was called before any writes/reads this > > > wouldn't happen. IMHO, the problem is not that we return -EBUSY without > > > blocking, the problem is that someone issues a write/read before calling > > > mtd_resume(). > > > > > > > The commit msg from "mtd: rawnand: Simplify the locking" states this clearly. > > > > """ > > Last important change to mention: we now return -EBUSY when someone > > tries to access a device that as been suspended, and propagate this > > error to the upper layer. > > """ > > > > IMHO "mtd: rawnand: Simplify the locking" should never had been merged > > before the upper layers was fixed to handle -EBUSY. ;) > > Which they still not are... > > That's not really the problem here. Upper layers should never get > -EBUSY in the first place if the MTD device was resumed before the UBI > device. Looks like we have a missing UBI -> MTD parenting link, which > would explain why things don't get resumed in the right order. Can you > try with the following diff applied? > > --- > diff --git a/drivers/mtd/ubi/build.c b/drivers/mtd/ubi/build.c > index f399edc82191..1981ce8f3a26 100644 > --- a/drivers/mtd/ubi/build.c > +++ b/drivers/mtd/ubi/build.c > @@ -905,6 +905,7 @@ int ubi_attach_mtd_dev(struct mtd_info *mtd, int > ubi_num, ubi->dev.release = dev_release; > ubi->dev.class = &ubi_class; > ubi->dev.groups = ubi_dev_groups; > + ubi->dev.parent = &mtd->dev; > > ubi->mtd = mtd; > ubi->ubi_num = ubi_num; > No change: [ 71.739193] Filesystems sync: 34.212 seconds [ 71.755044] Freezing user space processes ... (elapsed 0.004 seconds) done. [ 71.767289] OOM killer disabled. [ 71.770552] Freezing remaining freezable tasks ... (elapsed 0.004 seconds) done. [ 71.782182] printk: Suspending console(s) (use no_console_suspend to debug) [ 71.824391] nand_suspend [ 71.825177] gpmi_pm_suspend [ 71.825676] PM: suspend devices took 0.040 seconds [ 71.825971] nand_write_oob - nand_get_device() returned -EBUSY [ 71.825985] ubi0 error: ubi_io_write: error -16 while writing 4096 bytes to PEB 986:65536, written 0 bytes [ 71.826029] CPU: 0 PID: 7 Comm: kworker/u2:0 Not tainted 5.15.0-rc3-dirty #43 [ 71.826043] Hardware name: Freescale i.MX6 Ultralite (Device Tree) [ 71.826054] Workqueue: writeback wb_workfn (flush-ubifs_0_8) [ 71.826094] [] (unwind_backtrace) from [] (show_stack+0x10/0x14) [ 71.826122] [] (show_stack) from [] (dump_stack_lvl+0x40/0x4c) [ 71.826151] [] (dump_stack_lvl) from [] (ubi_io_write+0x510/0x6b0) [ 71.826178] [] (ubi_io_write) from [] (ubi_eba_write_leb+0xd0/0x968) [ 71.826204] [] (ubi_eba_write_leb) from [] (ubi_leb_write+0xd0/0xe8) [ 71.826232] [] (ubi_leb_write) from [] (ubifs_leb_write+0x68/0x104) [ 71.826263] [] (ubifs_leb_write) from [] (ubifs_wbuf_write_nolock+0x28c/0x74c) [ 71.826291] [] (ubifs_wbuf_write_nolock) from [] (ubifs_jnl_write_data+0x1b8/0x2b4) [ 71.826319] [] (ubifs_jnl_write_data) from [] (do_writepage+0x190/0x284) [ 71.826342] [] (do_writepage) from [] (__writepage+0x14/0x68) [ 71.826367] [] (__writepage) from [] (write_cache_pages+0x1c8/0x3f0) [ 71.826390] [] (write_cache_pages) from [] (do_writepages+0xcc/0x1f4) [ 71.826413] [] (do_writepages) from [] (__writeback_single_inode+0x2c/0x1b4) [ 71.826440] [] (__writeback_single_inode) from [] (writeback_sb_inodes+0x200/0x470) [ 71.826466] [] (writeback_sb_inodes) from [] (__writeback_inodes_wb+0x3c/0xf4) [ 71.826493] [] (__writeback_inodes_wb) from [] (wb_writeback+0x190/0x1f0) [ 71.826520] [] (wb_writeback) from [] (wb_workfn+0x2c0/0x3d4) [ 71.826545] [] (wb_workfn) from [] (process_one_work+0x1e0/0x440) [ 71.826574] [] (process_one_work) from [] (worker_thread+0x48/0x594) [ 71.826600] [] (worker_thread) from [] (kthread+0x134/0x15c) [ 71.826625] [] (kthread) from [] (ret_from_fork+0x14/0x24) [...] [ 71.921673] gpmi_pm_resume [ 71.923319] nand_resume [ 71.936120] PM: resume devices took 0.100 seconds [ 72.314551] ci_hdrc ci_hdrc.0: freeing queued request [ 72.521656] IPv6: ADDRCONF(NETDEV_CHANGE): usb0: link becomes ready [ 75.006404] OOM killer enabled. [ 75.009562] Restarting tasks ... [ 75.074123] done. [ 75.095540] PM: suspend exit With the RFC PATCH: [ 3702.682122] Filesystems sync: 33.416 seconds [ 3702.695350] Freezing user space processes ... (elapsed 0.001 seconds) done. [ 3702.704218] OOM killer disabled. [ 3702.707559] Freezing remaining freezable tasks ... (elapsed 0.003 seconds) done. [ 3702.718696] printk: Suspending console(s) (use no_console_suspend to debug) [ 3702.757660] nand_suspend [ 3702.758577] gpmi_pm_suspend [ 3702.759072] PM: suspend devices took 0.040 seconds [ 3702.761618] Disabling non-boot CPUs ... [ 3702.854985] gpmi_pm_resume [ 3702.856623] nand_resume [ 3702.867796] PM: resume devices took 0.110 seconds [ 3702.895019] OOM killer enabled. [ 3702.898291] Restarting tasks ... done. [ 3702.950723] PM: suspend exit