Received: by 2002:a05:6500:1b8a:b0:1ef:a0f1:aef6 with SMTP id df10csp144677lqb; Sun, 10 Mar 2024 08:27:42 -0700 (PDT) X-Forwarded-Encrypted: i=3; AJvYcCXXRTC42LNai9jXpjRiYe1GfCODLc2zLR+fUyWUvbheMv0iPj/yFQKp+T2ga7cOFaNPIqLYyuE+/7XfGaKRM9CGjZoIyjjzbCJtYHjXmQ== X-Google-Smtp-Source: AGHT+IF98mPSed+5K87i7xbmT9Lpi0ifVrX4RSZgcrU2bP7MOcwrTPHA8zXblNA2pXG9l/XnwSEf X-Received: by 2002:ae9:f00a:0:b0:788:47b7:5f31 with SMTP id l10-20020ae9f00a000000b0078847b75f31mr5247975qkg.3.1710084462470; Sun, 10 Mar 2024 08:27:42 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1710084462; cv=pass; d=google.com; s=arc-20160816; b=VivSDjMiieBBaMNz7St6rDoPt+C/12pC41cRYtVd8tvBWajknbm6Ztox33b4yYWUTS kLLxn0cDq1UiFexIOapZ2evjgPQDAsbUqpZZjq0wr66xjkf9zdaDGNoFYRtQ3jq4TfXd cDjMNIOVclojuEXs+1Rft7r+JAbT2DerEgz84fLeYX+afgPZKAZc5br73jZ/ojpGou+i 1G9IUOh/CVDsysqjOAEyKzJ3fecDucH6uIyKvKlr4aTLnGJXlk9EQxBTnbUlpdnooN/x tS3/n/H8tktxIQRUlKzX8hOVvhvmedJiPq7gDGbdK2pzrI/apf33D2JS0DovDkXeYM+M X2Xg== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:list-unsubscribe:list-subscribe:list-id:precedence :references:message-id:subject:cc:to:from:date; bh=Z0jH7J8SOz0u+8/iaO3ej/C+SWNNjTuhnH0u4Q10V/4=; fh=tlHnTZo/Z4V+9GcMbRv4j+f9eG+VITcKZqxVXe4DOwc=; b=WZs2tFl/nx0S7FJbF5bJ6qwsRCOSRmQQtkMz7r4wMNnt+t9B7CrqGpHN2fi6nN/7CE UHc9kjrCYvSHnPk50MyOauCLMTo8BWH296gbi0gxGA2nryDdBTMOOMJ8D6ih4WYqpbQn YuyOpKgaO4gbYYoqJJAfE+oQtVwBbxrHymk3wSvJKhn6Q+AvMnkXNaGCaWfkcaVEkGbq O/Opg+fR9bEw89762veKJWtnOuHyzRHuX2CpYoGcPDXazFnC8ia0PMi7q3XIqLp41UUN aZ5SkP08aqPKSoK1aV6W4FsW46T0dqwVHo87o3pRkhHsiPJVqcEaYdcqPoJvlvB3q7xU nO2w==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; arc=pass (i=1 spf=pass spfdomain=redhat.com); spf=pass (google.com: domain of linux-kernel+bounces-98255-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-98255-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from ny.mirrors.kernel.org (ny.mirrors.kernel.org. [2604:1380:45d1:ec00::1]) by mx.google.com with ESMTPS id s14-20020ae9f70e000000b007881e4bc513si3438520qkg.125.2024.03.10.08.27.42 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 10 Mar 2024 08:27:42 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel+bounces-98255-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) client-ip=2604:1380:45d1:ec00::1; Authentication-Results: mx.google.com; arc=pass (i=1 spf=pass spfdomain=redhat.com); spf=pass (google.com: domain of linux-kernel+bounces-98255-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-98255-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ny.mirrors.kernel.org (Postfix) with ESMTPS id 22DE91C208F6 for ; Sun, 10 Mar 2024 15:27:42 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id F28B13771E; Sun, 10 Mar 2024 15:27:27 +0000 (UTC) Received: from mail-oo1-f44.google.com (mail-oo1-f44.google.com [209.85.161.44]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3AFB3374C3 for ; Sun, 10 Mar 2024 15:27:25 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.161.44 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1710084447; cv=none; b=nrHeR9Z94tB/mzAVXv+889MKigYGzEZPLWjYvsxOKPxnjefggsFg3BQEHzgdW5FXrdMAWfIyd0rVhYAeLNoTlZHRH3enfMStzAkVOJID1VjGRdtiWddEOc0THIK807SMCbOO8HWi93PNNqaz+i4MGgsO5tveAZQvYw8Z/PprsVI= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1710084447; c=relaxed/simple; bh=Rck4rwY08vZeNb1x+s931KSI5QX6vz9bgu+V5718KSs=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=VwgBqgcQyihDQkMZo8hwezfU/1xH6zIPMlzo4Kaa0OrkAt5XEBGvTpnzXUHttNe8OBaa5/QXykza4CZCS/9c/okOzDUyVdMAbHrBVEb0fxPvMlL48lQfK9ppOLeah96j49/vkwd+bZHYtw/kd6e7NaUJgm+KRe8IbggGJv+QmUQ= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org; spf=pass smtp.mailfrom=redhat.com; arc=none smtp.client-ip=209.85.161.44 Authentication-Results: smtp.subspace.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Received: by mail-oo1-f44.google.com with SMTP id 006d021491bc7-5a1f24cc010so382350eaf.1 for ; Sun, 10 Mar 2024 08:27:25 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1710084444; x=1710689244; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=Z0jH7J8SOz0u+8/iaO3ej/C+SWNNjTuhnH0u4Q10V/4=; b=AsUBtdRX2Wbi9epGmDeiX4dAlzLmHHDTVbxrEzh1Vgxr7NJBi1jy0x0uzKs06OSTHP WdEbv0QtoxuFg3YRkRgVN/AcKCzmg0Jd/7+ntNOhyc6ZN7sfmxlC+NImRnAdsQD3rLYp rPcE3srWAQ1lUqQGzdhuuoyGR+UlSvX45jhUvIJ+SyVoUMAaObM6WEjrrBqhXlY8hHDL U/N1gl5zpw0TA1/i3jaDR2OW2XIkuFRR/FRAHCwsBGQfp8AayhQaU8ObTV2WCQ3Y6XEj J7X77oxQOXXR6I+GDrSvxMfMqpMCxpvRAnHBiFqCdmBE+8NQQFghmh2ahGQaV/d6ccE0 NC4Q== X-Forwarded-Encrypted: i=1; AJvYcCUzyTq6GcHMBqOS+Db2AW7bnRYpRMRNlnsHml/3FaBtYEA+osm9RxkCDh9tbgMCXxosxDlKAtO5YvavDLmm4/8lLBSkGHDwXJj7X2Pg X-Gm-Message-State: AOJu0YxSWQk8PQUO933hRJKPDz9y0E03IpzpkY2qWbSAxCC3z0Ll29BW soe6JIX+sdwW1tsvxxAEQFckl1LoIg4rILR7akggJn0COzwQD6CkWSOKZcwZ/g== X-Received: by 2002:a05:6358:5f14:b0:17b:583c:c4b7 with SMTP id y20-20020a0563585f1400b0017b583cc4b7mr6033787rwn.3.1710084444218; Sun, 10 Mar 2024 08:27:24 -0700 (PDT) Received: from localhost (pool-68-160-141-91.bstnma.fios.verizon.net. [68.160.141.91]) by smtp.gmail.com with ESMTPSA id b5-20020ac844c5000000b0042c61b99f42sm1905588qto.46.2024.03.10.08.27.23 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 10 Mar 2024 08:27:23 -0700 (PDT) Date: Sun, 10 Mar 2024 11:27:22 -0400 From: Mike Snitzer To: Ming Lei Cc: Patrick Plenefisch , Goffredo Baroncelli , linux-kernel@vger.kernel.org, Alasdair Kergon , Mikulas Patocka , Chris Mason , Josef Bacik , David Sterba , regressions@lists.linux.dev, dm-devel@lists.linux.dev, linux-btrfs@vger.kernel.org Subject: Re: LVM-on-LVM: error while submitting device barriers Message-ID: References: <672e88f2-8ac3-45fe-a2e9-730800017f53@libero.it> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: On Sun, Mar 10 2024 at 7:34P -0400, Ming Lei wrote: > On Sat, Mar 09, 2024 at 03:39:02PM -0500, Patrick Plenefisch wrote: > > On Wed, Mar 6, 2024 at 11:00 AM Ming Lei wrote: > > > > > > On Tue, Mar 05, 2024 at 12:45:13PM -0500, Mike Snitzer wrote: > > > > On Thu, Feb 29 2024 at 5:05P -0500, > > > > Goffredo Baroncelli wrote: > > > > > > > > > On 29/02/2024 21.22, Patrick Plenefisch wrote: > > > > > > On Thu, Feb 29, 2024 at 2:56 PM Goffredo Baroncelli wrote: > > > > > > > > > > > > > > > Your understanding is correct. The only thing that comes to my mind to > > > > > > > > cause the problem is asymmetry of the SATA devices. I have one 8TB > > > > > > > > device, plus a 1.5TB, 3TB, and 3TB drives. Doing math on the actual > > > > > > > > extents, lowerVG/single spans (3TB+3TB), and > > > > > > > > lowerVG/lvmPool/lvm/brokenDisk spans (3TB+1.5TB). Both obviously have > > > > > > > > the other leg of raid1 on the 8TB drive, but my thought was that the > > > > > > > > jump across the 1.5+3TB drive gap was at least "interesting" > > > > > > > > > > > > > > > > > > > > > what about lowerVG/works ? > > > > > > > > > > > > > > > > > > > That one is only on two disks, it doesn't span any gaps > > > > > > > > > > Sorry, but re-reading the original email I found something that I missed before: > > > > > > > > > > > BTRFS error (device dm-75): bdev /dev/mapper/lvm-brokenDisk errs: wr > > > > > > 0, rd 0, flush 1, corrupt 0, gen 0 > > > > > > BTRFS warning (device dm-75): chunk 13631488 missing 1 devices, max > > > > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > > > > > tolerance is 0 for writable mount > > > > > > BTRFS: error (device dm-75) in write_all_supers:4379: errno=-5 IO > > > > > > failure (errors while submitting device barriers.) > > > > > > > > > > Looking at the code, it seems that if a FLUSH commands fails, btrfs > > > > > considers that the disk is missing. The it cannot mount RW the device. > > > > > > > > > > I would investigate with the LVM developers, if it properly passes > > > > > the flush/barrier command through all the layers, when we have an > > > > > lvm over lvm (raid1). The fact that the lvm is a raid1, is important because > > > > > a flush command to be honored has to be honored by all the > > > > > devices involved. > > > > > > Hello Patrick & Goffredo, > > > > > > I can trigger this kind of btrfs complaint by simulating one FLUSH failure. > > > > > > If you can reproduce this issue easily, please collect log by the > > > following bpftrace script, which may show where the flush failure is, > > > and maybe it can help to narrow down the issue in the whole stack. > > > > > > > > > #!/usr/bin/bpftrace > > > > > > #ifndef BPFTRACE_HAVE_BTF > > > #include > > > #endif > > > > > > kprobe:submit_bio_noacct, > > > kprobe:submit_bio > > > / (((struct bio *)arg0)->bi_opf & (1 << __REQ_PREFLUSH)) != 0 / > > > { > > > $bio = (struct bio *)arg0; > > > @submit_stack[arg0] = kstack; > > > @tracked[arg0] = 1; > > > } > > > > > > kprobe:bio_endio > > > /@tracked[arg0] != 0/ > > > { > > > $bio = (struct bio *)arg0; > > > > > > if (($bio->bi_flags & (1 << BIO_CHAIN)) && $bio->__bi_remaining.counter > 1) { > > > return; > > > } > > > > > > if ($bio->bi_status != 0) { > > > printf("dev %s bio failed %d, submitter %s completion %s\n", > > > $bio->bi_bdev->bd_disk->disk_name, > > > $bio->bi_status, @submit_stack[arg0], kstack); > > > } > > > delete(@submit_stack[arg0]); > > > delete(@tracked[arg0]); > > > } > > > > > > END { > > > clear(@submit_stack); > > > clear(@tracked); > > > } > > > > > > > Attaching 4 probes... > > dev dm-77 bio failed 10, submitter > > submit_bio_noacct+5 > > __send_duplicate_bios+358 > > __send_empty_flush+179 > > dm_submit_bio+857 > > __submit_bio+132 > > submit_bio_noacct_nocheck+345 > > write_all_supers+1718 > > btrfs_commit_transaction+2342 > > transaction_kthread+345 > > kthread+229 > > ret_from_fork+49 > > ret_from_fork_asm+27 > > completion > > bio_endio+5 > > dm_submit_bio+955 > > __submit_bio+132 > > submit_bio_noacct_nocheck+345 > > write_all_supers+1718 > > btrfs_commit_transaction+2342 > > transaction_kthread+345 > > kthread+229 > > ret_from_fork+49 > > ret_from_fork_asm+27 > > > > dev dm-86 bio failed 10, submitter > > submit_bio_noacct+5 > > write_all_supers+1718 > > btrfs_commit_transaction+2342 > > transaction_kthread+345 > > kthread+229 > > ret_from_fork+49 > > ret_from_fork_asm+27 > > completion > > bio_endio+5 > > clone_endio+295 > > clone_endio+295 > > process_one_work+369 > > worker_thread+635 > > kthread+229 > > ret_from_fork+49 > > ret_from_fork_asm+27 > > > > > > For context, dm-86 is /dev/lvm/brokenDisk and dm-77 is /dev/lowerVG/lvmPool > > io_status is 10(BLK_STS_IOERR), which is produced in submission code path on > /dev/dm-77(/dev/lowerVG/lvmPool) first, so looks it is one device mapper issue. > > The error should be from the following code only: > > static void __map_bio(struct bio *clone) > > ... > if (r == DM_MAPIO_KILL) > dm_io_dec_pending(io, BLK_STS_IOERR); > else > dm_io_dec_pending(io, BLK_STS_DM_REQUEUE); > break; I agree that the above bpf stack traces for dm-77 indicate that dm_submit_bio failed, which would end up in the above branch if the target's ->map() returned DM_MAPIO_KILL or DM_MAPIO_REQUEUE. But such an early failure speaks to the flush bio never being submitted to the underlying storage. No? dm-raid.c:raid_map does return DM_MAPIO_REQUEUE with: /* * If we're reshaping to add disk(s)), ti->len and * mddev->array_sectors will differ during the process * (ti->len > mddev->array_sectors), so we have to requeue * bios with addresses > mddev->array_sectors here or * there will occur accesses past EOD of the component * data images thus erroring the raid set. */ if (unlikely(bio_end_sector(bio) > mddev->array_sectors)) return DM_MAPIO_REQUEUE; But a flush doesn't have an end_sector (it'd be 0 afaik).. so it seems weird relative to a flush. > Patrick, you mentioned lvmPool is raid1, can you explain how lvmPool is > built? It is dm-raid1 target or over plain raid1 device which is > build over /dev/lowerVG? In my earlier reply I asked Patrick for both: lsblk dmsetup table Picking over the described IO stacks provided earlier (or Goffredo's interpretation of it, via ascii art) isn't really a great way to see the IO stacks that are in use/question. > Mike, the logic in the following code doesn't change from v5.18-rc2 to > v5.19, but I still can't understand why STS_IOERR is set in > dm_io_complete() in case of BLK_STS_DM_REQUEUE && !__noflush_suspending(), > since DMF_NOFLUSH_SUSPENDING is only set in __dm_suspend() which > is supposed to not happen in Patrick's case. > > dm_io_complete() > ... > if (io->status == BLK_STS_DM_REQUEUE) { > unsigned long flags; > /* > * Target requested pushing back the I/O. > */ > spin_lock_irqsave(&md->deferred_lock, flags); > if (__noflush_suspending(md) && > !WARN_ON_ONCE(dm_is_zone_write(md, bio))) { > /* NOTE early return due to BLK_STS_DM_REQUEUE below */ > bio_list_add_head(&md->deferred, bio); > } else { > /* > * noflush suspend was interrupted or this is > * a write to a zoned target. > */ > io->status = BLK_STS_IOERR; > } > spin_unlock_irqrestore(&md->deferred_lock, flags); > } Given the reason from dm-raid.c:raid_map returning DM_MAPIO_REQUEUE I think the DM device could be suspending without flush. But regardless, given you logged BLK_STS_IOERR lets assume it isn't, the assumption that "noflush suspend was interrupted" seems like a stale comment -- especially given that target's like dm-raid are now using DM_MAPIO_REQUEUE without concern for the historic tight-coupling of noflush suspend (which was always the case for the biggest historic reason for this code: dm-multipath, see commit 2e93ccc1933d0 from 2006 -- predates my time with developing DM). So all said, this code seems flawed for dm-raid (and possibly other targets that return DM_MAPIO_REQUEUE). I'll look closer this week. Mike