Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp1580953imu; Wed, 28 Nov 2018 11:40:02 -0800 (PST) X-Google-Smtp-Source: AFSGD/WgFNNtGhe4LhnKjJ4EEhcnznc0a0qXHc35BL0/tgrvJownl1yrmf90pBPgMGXftekS27Z4 X-Received: by 2002:a17:902:780c:: with SMTP id p12mr37469241pll.197.1543434002566; Wed, 28 Nov 2018 11:40:02 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1543434002; cv=none; d=google.com; s=arc-20160816; b=lC+dHzOh32WGl91uM3nlkdZl0tUJmTNX4wOVC5jQSmsTAhM3xiobqQgyPTPwoXjzkt CJSMdAFS3XDbeD/YrU4z7PlnIHMz0WcsjN99KMOWkmK8bg94Isgrwpu0zgAYhdxAr+cv gX/zKcbU5suBK3HVF9QDMF+HGU5pYmZCuNkep2+d/JsOR2fFbs5H1am18zp/wbv0aBSP lGMrk2gszoLBOTTwc8Zal7IftBSw0z73ncdo1cpK1F5cCIE+h6DeQmjZeAAreMLPZjzb 9Fax40sFnOqESgwIkpvJP7NyNvBBu0zi0oI4gZ2AQp4M+1P1M6bCAwM+/cUMZ8flOV4N qacA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:references:to:cc:in-reply-to:date:subject :mime-version:message-id:from:dkim-signature; bh=8bQFWKBjiGrYiU8FacAN+YIRavoiuZsYEdpz9LRoF0U=; b=zqqW9WHNwkal3NEaTYHbGZIeO1U+86oAOD2yKie0xg0S+R3tl/Pr5Lz09c9tVOGkCd wyBVUkp36Ud8XNtr3ncpt1vsW0HQIgoImqlYw+tXIfmS4wh27m75gMy5vUTbXlwcePn6 MkpR/26wGpvXCVqc6xJ8PBh+G7GP4fkhKTe4Gv/RFQITcvPIVAfJDYc+56LyxUmJLaO4 2wMvRG75oI2Md4kLuiNpvCbfSiR7+Cpx/FFY8qJPIC8PM9rFA9oKNEbeypwK/lFETLW+ mcMm5eujCW3IRE/EmH/xZai95k4VDM1Mebba2sUxTyoRc/orLTKH07AcDbn5Wkv5/Ch/ iXOg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@dilger-ca.20150623.gappssmtp.com header.s=20150623 header.b="QyBGaT/C"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id q17si9756713pfc.198.2018.11.28.11.39.47; Wed, 28 Nov 2018 11:40:02 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@dilger-ca.20150623.gappssmtp.com header.s=20150623 header.b="QyBGaT/C"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729530AbeK2Glg (ORCPT + 99 others); Thu, 29 Nov 2018 01:41:36 -0500 Received: from mail-pl1-f195.google.com ([209.85.214.195]:43405 "EHLO mail-pl1-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728446AbeK2Glg (ORCPT ); Thu, 29 Nov 2018 01:41:36 -0500 Received: by mail-pl1-f195.google.com with SMTP id gn14so17885867plb.10 for ; Wed, 28 Nov 2018 11:38:51 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=dilger-ca.20150623.gappssmtp.com; s=20150623; h=from:message-id:mime-version:subject:date:in-reply-to:cc:to :references; bh=8bQFWKBjiGrYiU8FacAN+YIRavoiuZsYEdpz9LRoF0U=; b=QyBGaT/C4HEetrqDyESuU9yPgennW/eIcdW/G6XY0bnLZquyFMTcRDR70V/AqDFXBs qUh/goUGSySK5DcE0qrYo8a0SVT+MbHcvRRLeZdLtdJ1UKO0rhXgZ7Cep5rO06S28XVG SOrlh8GyE7ql5DOP+6ecv83IM+PbduDX6h2L2OZdle5532eCWE35Afj1pIXGQ8QbeoGb PmaLwPXFtX8Ku+4fTGY8xHonJjLpsR6vVyd6LbbJoF0dwj+cLC5mUfVZgSCmh477MCRz xW/sxAtVFn14aV3sIVH9OIVuwOFUlzx9JRl3M/rnnepF3r8g3cmqUIN/4ebYw9g2Ao3k XUGQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:message-id:mime-version:subject:date :in-reply-to:cc:to:references; bh=8bQFWKBjiGrYiU8FacAN+YIRavoiuZsYEdpz9LRoF0U=; b=GdjVrKvOFgfjSk3uDJouJMGh+UwVCu3TxzaVPLI7nlRmqrNbe3kmXn6UYmzk5pEH0s Ag8o/nWsm5eAXO3HNLhtf844eQJEDgcQKs4wKLPOlcQh47a8h4AOWug4l6R9s5JnvAeB m1Tb//RgqBKjBM8Q1aWVrU8FYV9OCakZrvareuW9obk67Ghoq1fzqyb5BS/Ifn7w8179 B34B/RzJpouO+2RM0BZQt0QdVLsiLJVm3G2CW+5a34szFgBiDNUzpkh3MUQihc9yunvz jZJrbODC+Zq466hs/rvSCEUfvPQrqtaXnRz8s4muMXk/xZ8jAMTocsFCwpgllVwbOU5N Mn8Q== X-Gm-Message-State: AA+aEWbEnNv9USLAW3ZSmJQhPDJ1XjFPlwGtPWIDCENL/bohxUrRWb2I cr4vlPuHC3fai/GVv7F4n0OM9w== X-Received: by 2002:a17:902:29ab:: with SMTP id h40mr38088486plb.238.1543433930933; Wed, 28 Nov 2018 11:38:50 -0800 (PST) Received: from cabot-wlan.adilger.int (S0106a84e3fe4b223.cg.shawcable.net. [70.77.216.213]) by smtp.gmail.com with ESMTPSA id g2sm10028732pfi.95.2018.11.28.11.38.49 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 28 Nov 2018 11:38:50 -0800 (PST) From: Andreas Dilger Message-Id: <4C297F47-C508-4DC9-8360-5E0873873833@dilger.ca> Content-Type: multipart/signed; boundary="Apple-Mail=_E18FEC91-F3F9-48CF-BEDA-01D625B3A34F"; protocol="application/pgp-signature"; micalg=pgp-sha256 Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\)) Subject: Re: [RFC PATCH v1 0/7] Block/XFS: Support alternative mirror device retry Date: Wed, 28 Nov 2018 12:38:47 -0700 In-Reply-To: <20181128054923.GF8125@magnolia> Cc: Dave Chinner , Allison Henderson , linux-block@vger.kernel.org, linux-xfs@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, Martin Petersen , shirley.ma@oracle.com, bob.liu@oracle.com To: "Darrick J. Wong" References: <1543376991-5764-1-git-send-email-allison.henderson@oracle.com> <20181128053303.GL6311@dastard> <20181128054923.GF8125@magnolia> X-Mailer: Apple Mail (2.3273) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org --Apple-Mail=_E18FEC91-F3F9-48CF-BEDA-01D625B3A34F Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii On Nov 27, 2018, at 10:49 PM, Darrick J. Wong = wrote: > On Wed, Nov 28, 2018 at 04:33:03PM +1100, Dave Chinner wrote: >> On Tue, Nov 27, 2018 at 08:49:44PM -0700, Allison Henderson wrote: >>> Motivation: >>> When fs data/metadata checksum mismatch, lower block devices may = have other >>> correct copies. e.g. If XFS successfully reads a metadata buffer off = a raid1 >>> but decides that the metadata is garbage, today it will shut down = the entire >>> filesystem without trying any of the other mirrors. This is a = severe >>> loss of service, and we propose these patches to have XFS try harder = to >>> avoid failure. >>>=20 >>> This patch prototype this mirror retry idea by: >>> * Adding @nr_mirrors to struct request_queue which is similar as >>> blk_queue_nonrot(), filesystem can grab device request queue and = check max >>> mirrors this block device has. >>> Helper functions were also added to get/set the nr_mirrors. >>>=20 >>> * Expanding bi_write_hint to bi_rw_hint, now @bi_rw_hint has three = meanings. >>> 1.Original write_hint. >>> 2.end_io() will update @bi_rw_hint to reflect which mirror this i/o = really happened. >>> 3.Fs set @bi_rw_hint to force driver e.g raid1 read from a specific = mirror. >>>=20 >>> * Modify md/raid1 to support this retry feature. >>>=20 >>> * Add b_rw_hint to xfs_buf >>> This patch adds a new field b_rw_hint to xfs_buf. We will use this = to set the >>> new bio->bi_rw_hint when submitting the read request, and also to = store the >>> returned mirror when the read completes >=20 >> the retry iterations. That allows us to let he block layer ot pick >> whatever leg it wants for the initial read, but if we get a failure >> we directly control the mirror we retry from and all bios in the >> buffer go to that same mirror. >> - is it generic/abstract enough to be able to work with >> RAID5/6 to trigger verification/recovery from the parity >> information in the stripe? >=20 > In theory we could supply a raid5 implementation, wherein rw_hint =3D=3D= 0 > lets the raid do as it pleases; rw_hint =3D=3D 1 reads from the = stripe; and > rw_hint =3D=3D 2 forces stripe recovery for the given block. Definitely this API needs to be useful for RAID-5/6 storage as well, and I don't think that needs too complex an interface to achieve. Basically, the "nr_mirrors" parameter would instead be "nr_retries" or similar, so that the caller knows how many possible data combinations there are to try and validate. For mirrors this is easy, and as it is currently implemented. For RAID-5/6 this would essentially be the number of data rebuild combinations in the RAID group (e.g. 8 in a RAID-5 8+1 setup, and 16 in a RAID-6 8+2). For each call with nr_retries !=3D 0, the MD RAID-5/6 driver would skip one of the data drives, and rebuild that part of the data from parity. This wouldn't take too long, since the blocks are already in memory, they just need the parity to be recomputed in a few different ways to try and find a combination that returns valid data (e.g. if a drive failed and the parity also has a latent corrupt sector, not uncommon). The next step is to have an API that says "retry=3DN returned the = correct data, rebuild the parity/drive with that combination of devices" so that the corrupt parity sector isn't used during the rebuild. Cheers, Andreas --Apple-Mail=_E18FEC91-F3F9-48CF-BEDA-01D625B3A34F Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename=signature.asc Content-Type: application/pgp-signature; name=signature.asc Content-Description: Message signed with OpenPGP -----BEGIN PGP SIGNATURE----- Comment: GPGTools - http://gpgtools.org iQIzBAEBCAAdFiEEDb73u6ZejP5ZMprvcqXauRfMH+AFAlv+7sgACgkQcqXauRfM H+BeOw//Uxw6p/zGQaKjPUSXqnxgyftkyceAkIzeVS82sSgLCxntgcAitmsam3+H r6aNJglW47gAn0SufpWkidA1pJGiV5zxSU4frApyGbGaVOl9AqzgfV1JGoWxVggw O0cr/MSECJmQD/Za/9hIEY5ENF5PMhIOyaXe8vP0sEH45yKzlL02Fafyy3vR2Fxj CRijCgOj7AYv+0BZayjmHT0yPK1psbchWxR7eezXNK1ggebJ3EaMFCSmalRNTUV6 k2XvUHl+SyiRn6zr+CORoZXPnu+mmOpvdERxcjmeAF0fDRtyER3kbEJc0unqL7Tm 5JiXWD/SlCLRMEMNsv0XdexNdzjD4fvnScJZ/GG92EktfgkjD/6IpStEmwmTflTQ u98S8Fws/hOrvJSW1pDdobmxrSmI4UviVj98BDzZQbkPLvUHW8fJy+us7a7iwBS4 euECiA4J0XAREwtROnfpvJSyYQ1W758669H7bylydI8Pu29gt4XISWgIJA18ZaIo YXTO8YAJCTpF/okwBgEfJ6i6ad05bYjQjAXjKNZf/Y7fV9f4rkoUNhnBTkETZsvR SZlZkwljw9nAqPtjqbuyoQ2hZAK8o6IlyNwhHHhqw3POKIYdmsnqNtZplKauVw98 HZSChSjU156Xc+axJ4YpYdMTAI5aQf61uvUWtr9hdnHJF8DRaIQ= =DyFV -----END PGP SIGNATURE----- --Apple-Mail=_E18FEC91-F3F9-48CF-BEDA-01D625B3A34F--