2011-03-28 22:35:24

by Thomas Gleixner

[permalink] [raw]
Subject: [Regression] Please revert a91a2785b20

Out of the blue all my perfectly fine working test machines which use
RAID stopped working with the very helpful error message:

md/raid1:md1: active with 2 out of 2 mirrors
md: pers->run() failed ...

Reverting a91a2785b20 fixes the problem.

Neither the subject line:

block: Require subsystems to explicitly allocate bio_set integrity mempool

nor the changelog have any hint why that wreckage is in any way
sensible.

The wreckage happens due to:

- md_integrity_register(mddev);
- return 0;
+ return md_integrity_register(mddev);

But the changelog does not give the courtesy of explaining these
changes. Also there is no fcking reason why the kernel cannot deal
with the missing integrity capabilities of a drive just by emitting a
warning msg and dealing gracefully with the outcome.

All my RAID setups have been working perfectly fine until now, so
what's the rationale to break this?

Did anyone test this shite on a non enterprise class hardware with a
distro default config ? Obviously _NOT_.

FYI, the config files of those machines are based off a fedora default
config, so this would hit all raid users based on popular distro
configs sooner than later.

Thanks for stealing my time,

tglx


2011-03-28 22:43:09

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [Regression] Please revert a91a2785b20

Forgot to cc Jens and fixed up the borked mail address of Mike which
I took from the changelog :(

On Tue, 29 Mar 2011, Thomas Gleixner wrote:

> Out of the blue all my perfectly fine working test machines which use
> RAID stopped working with the very helpful error message:
>
> md/raid1:md1: active with 2 out of 2 mirrors
> md: pers->run() failed ...
>
> Reverting a91a2785b20 fixes the problem.
>
> Neither the subject line:
>
> block: Require subsystems to explicitly allocate bio_set integrity mempool
>
> nor the changelog have any hint why that wreckage is in any way
> sensible.
>
> The wreckage happens due to:
>
> - md_integrity_register(mddev);
> - return 0;
> + return md_integrity_register(mddev);
>
> But the changelog does not give the courtesy of explaining these
> changes. Also there is no fcking reason why the kernel cannot deal
> with the missing integrity capabilities of a drive just by emitting a
> warning msg and dealing gracefully with the outcome.
>
> All my RAID setups have been working perfectly fine until now, so
> what's the rationale to break this?
>
> Did anyone test this shite on a non enterprise class hardware with a
> distro default config ? Obviously _NOT_.
>
> FYI, the config files of those machines are based off a fedora default
> config, so this would hit all raid users based on popular distro
> configs sooner than later.
>
> Thanks for stealing my time,
>
> tglx
>

2011-03-28 22:47:08

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [Regression] Please revert a91a2785b20

>>>>> "Thomas" == Thomas Gleixner <[email protected]> writes:

Thomas,

Thomas> But the changelog does not give the courtesy of explaining these
Thomas> changes. Also there is no fcking reason why the kernel cannot
Thomas> deal with the missing integrity capabilities of a drive just by
Thomas> emitting a warning msg and dealing gracefully with the outcome.

My mistake. I was made aware of it earlier today and I'm working on a
patch. Surprised we didn't see any reports of this in -next. It's been
in there for a while.


Thomas> All my RAID setups have been working perfectly fine until now,
Thomas> so what's the rationale to break this?

People were complaining about excessive mempool usage with the block
integrity bits enabled (thanks to MD and DM allocating a bioset per
device to prevent deadlocks).

Making allocation conditional meant we had to deal with memory
allocation errors in the setup path. I tested various combinations of on
and off but apparently not all off. Sorry about that.

--
Martin K. Petersen Oracle Linux Engineering

2011-03-28 23:03:27

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [Regression] Please revert a91a2785b20

On Mon, 28 Mar 2011, Martin K. Petersen wrote:
> >>>>> "Thomas" == Thomas Gleixner <[email protected]> writes:
>
> Thomas,
>
> Thomas> But the changelog does not give the courtesy of explaining these
> Thomas> changes. Also there is no fcking reason why the kernel cannot
> Thomas> deal with the missing integrity capabilities of a drive just by
> Thomas> emitting a warning msg and dealing gracefully with the outcome.
>
> My mistake. I was made aware of it earlier today and I'm working on a

Why didn't you send a revert to Linus right away?

Darn. I sent a pull request earlier today, which I immediately
revoked, when I noticed that it had a late reported testing
failure. It did not hit Linus public tree fortunately. I could have
said "I'm working on a fix" as well. But that's the wrong thing to do.

So for your thing, it was already in Linus tree. Though if you get
aware of it and it's revertable w/o creating lots of mess, then it's
the right thing to revert it immediately. Do not drag out regressions
longer than necessary, please.

> Surprised we didn't see any reports of this in -next. It's been
> in there for a while.

next does unfortunately get not the full exposure and it's neither a
replacement for common sense nor an excuse for not testing the common
case (i.e. non enterprise hardware with default distro configs)

Thanks,

tglx

2011-03-28 23:04:27

by Mike Snitzer

[permalink] [raw]
Subject: Re: Please revert a91a2785b20

On Mon, Mar 28 2011 at 6:43pm -0400,
Thomas Gleixner <[email protected]> wrote:

> Forgot to cc Jens and fixed up the borked mail address of Mike which
> I took from the changelog :(
>
> On Tue, 29 Mar 2011, Thomas Gleixner wrote:
>
> > Out of the blue all my perfectly fine working test machines which use
> > RAID stopped working with the very helpful error message:
> >
> > md/raid1:md1: active with 2 out of 2 mirrors
> > md: pers->run() failed ...
> >
> > Reverting a91a2785b20 fixes the problem.
> >
> > Neither the subject line:
> >
> > block: Require subsystems to explicitly allocate bio_set integrity mempool
> >
> > nor the changelog have any hint why that wreckage is in any way
> > sensible.
> >
> > The wreckage happens due to:
> >
> > - md_integrity_register(mddev);
> > - return 0;
> > + return md_integrity_register(mddev);

Right, a kernel.org BZ was filed on this here:
https://bugzilla.kernel.org/show_bug.cgi?id=32062

Martin is working to "conjure up a patch" to fix the common case where
no devices in the MD have DIF/DIX capabilities.

> > But the changelog does not give the courtesy of explaining these
> > changes. Also there is no fcking reason why the kernel cannot deal
> > with the missing integrity capabilities of a drive just by emitting a
> > warning msg and dealing gracefully with the outcome.
> >
> > All my RAID setups have been working perfectly fine until now, so
> > what's the rationale to break this?
> >
> > Did anyone test this shite on a non enterprise class hardware with a
> > distro default config ? Obviously _NOT_.

Seems not. I didn't even look at the MD changes when I ack'd the DM
changes. But I clearly stated as much and also cc'd neilb:
http://www.redhat.com/archives/dm-devel/2011-March/msg00066.html
and again for the final version that got committed:
http://www.redhat.com/archives/dm-devel/2011-March/msg00098.html

I should've just looked at the MD changes too! As Neil would say, that
damn DM/MD walled garden... luckily that wall is slowly crumbling!

> > FYI, the config files of those machines are based off a fedora default
> > config, so this would hit all raid users based on popular distro
> > configs sooner than later.
> >
> > Thanks for stealing my time,

Sorry this screwed you, the following patch is the stop-gap I suggested
in the kernel.org BZ (it just reverts the MD error propagation, keeping
the good aspects of Martin's commit):

---
drivers/md/linear.c | 3 ++-
drivers/md/multipath.c | 7 ++-----
drivers/md/raid0.c | 3 ++-
drivers/md/raid1.c | 5 +++--
drivers/md/raid10.c | 7 ++-----
5 files changed, 11 insertions(+), 14 deletions(-)

diff --git a/drivers/md/linear.c b/drivers/md/linear.c
index abfb59a..338804f8 100644
--- a/drivers/md/linear.c
+++ b/drivers/md/linear.c
@@ -210,7 +210,8 @@ static int linear_run (mddev_t *mddev)
blk_queue_merge_bvec(mddev->queue, linear_mergeable_bvec);
mddev->queue->backing_dev_info.congested_fn = linear_congested;
mddev->queue->backing_dev_info.congested_data = mddev;
- return md_integrity_register(mddev);
+ md_integrity_register(mddev);
+ return 0;
}

static void free_conf(struct rcu_head *head)
diff --git a/drivers/md/multipath.c b/drivers/md/multipath.c
index c358909..5e694b1 100644
--- a/drivers/md/multipath.c
+++ b/drivers/md/multipath.c
@@ -315,7 +315,7 @@ static int multipath_remove_disk(mddev_t *mddev, int number)
p->rdev = rdev;
goto abort;
}
- err = md_integrity_register(mddev);
+ md_integrity_register(mddev);
}
abort:

@@ -489,10 +489,7 @@ static int multipath_run (mddev_t *mddev)

mddev->queue->backing_dev_info.congested_fn = multipath_congested;
mddev->queue->backing_dev_info.congested_data = mddev;
-
- if (md_integrity_register(mddev))
- goto out_free_conf;
-
+ md_integrity_register(mddev);
return 0;

out_free_conf:
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index e86bf36..95916fd 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -379,7 +379,8 @@ static int raid0_run(mddev_t *mddev)

blk_queue_merge_bvec(mddev->queue, raid0_mergeable_bvec);
dump_zones(mddev);
- return md_integrity_register(mddev);
+ md_integrity_register(mddev);
+ return 0;
}

static int raid0_stop(mddev_t *mddev)
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index c2a21ae..8f34ad5 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1132,7 +1132,7 @@ static int raid1_remove_disk(mddev_t *mddev, int number)
p->rdev = rdev;
goto abort;
}
- err = md_integrity_register(mddev);
+ md_integrity_register(mddev);
}
abort:

@@ -2017,7 +2017,8 @@ static int run(mddev_t *mddev)

mddev->queue->backing_dev_info.congested_fn = raid1_congested;
mddev->queue->backing_dev_info.congested_data = mddev;
- return md_integrity_register(mddev);
+ md_integrity_register(mddev);
+ return 0;
}

static int stop(mddev_t *mddev)
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index f7b6237..c0d0f5f 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1188,7 +1188,7 @@ static int raid10_remove_disk(mddev_t *mddev, int number)
p->rdev = rdev;
goto abort;
}
- err = md_integrity_register(mddev);
+ md_integrity_register(mddev);
}
abort:

@@ -2343,10 +2343,7 @@ static int run(mddev_t *mddev)

if (conf->near_copies < conf->raid_disks)
blk_queue_merge_bvec(mddev->queue, raid10_mergeable_bvec);
-
- if (md_integrity_register(mddev))
- goto out_free_conf;
-
+ md_integrity_register(mddev);
return 0;

out_free_conf:

2011-03-29 00:09:42

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [Regression] Please revert a91a2785b20

>>>>> "Thomas" == Thomas Gleixner <[email protected]> writes:

Thomas> Why didn't you send a revert to Linus right away?

Simply didn't think of it. Didn't get the bug report until this
afternoon and few people build with BLK_DEV_INTEGRITY enabled. But I
guess Fedora has it on by default.


Thomas> So for your thing, it was already in Linus tree. Though if you
Thomas> get aware of it and it's revertable w/o creating lots of mess,
Thomas> then it's the right thing to revert it immediately. Do not drag
Thomas> out regressions longer than necessary, please.

Noted.

Patch below...

--
Martin K. Petersen Oracle Linux Engineering


[PATCH] md: Fix integrity registration error when no devices are capable

We incorrectly returned -EINVAL when none of the devices in the array
had an integrity profile. This in turn prevented mdadm from starting the
metadevice. Fix this so we only return errors on mismatched profiles and
memory allocation failures.

Reported-by: Giacomo Catenazzi <[email protected]>
Reported-by: Thomas Gleixner <[email protected]>

diff --git a/drivers/md/md.c b/drivers/md/md.c
index f11e0bc..aab112f 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1778,12 +1778,6 @@ int md_integrity_register(mddev_t *mddev)
continue;
if (rdev->raid_disk < 0)
continue;
- /*
- * If at least one rdev is not integrity capable, we can not
- * enable data integrity for the md device.
- */
- if (!bdev_get_integrity(rdev->bdev))
- return -EINVAL;
if (!reference) {
/* Use the first rdev as the reference */
reference = rdev;
@@ -1794,6 +1788,8 @@ int md_integrity_register(mddev_t *mddev)
rdev->bdev->bd_disk) < 0)
return -EINVAL;
}
+ if (!reference || !bdev_get_integrity(reference->bdev))
+ return 0;
/*
* All component devices are integrity capable and have matching
* profiles, register the common profile for the md device.

2011-03-29 00:12:30

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [Regression] Please revert a91a2785b20

>>>>> "Martin" == Martin K Petersen <[email protected]> writes:

Forgot to add:

Signed-off-by: Martin K. Petersen <[email protected]>

--
Martin K. Petersen Oracle Linux Engineering

2011-03-29 02:32:50

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [Regression] Please revert a91a2785b20

On Mon, 28 Mar 2011, Martin K. Petersen wrote:
>
> Reported-by: Giacomo Catenazzi <[email protected]>
> Reported-by: Thomas Gleixner <[email protected]>

Just for the record: Tested-by-me

Thanks Martin for the prompt response and sorry if I was overly
grumpy.

Thanks,

tglx

2011-03-29 05:37:50

by Giacomo Catenazzi

[permalink] [raw]
Subject: Re: [Regression] Please revert a91a2785b20

On 03/29/2011 02:09 AM, Martin K. Petersen wrote:
>>>>>> "Thomas" == Thomas Gleixner <[email protected]> writes:
>
> Thomas> Why didn't you send a revert to Linus right away?
>
> Simply didn't think of it. Didn't get the bug report until this
> afternoon and few people build with BLK_DEV_INTEGRITY enabled. But I
> guess Fedora has it on by default.

BTW I've BLK_DEV_INTEGRITY disabled, so I'm confused with your analysis.

Anyway your patch solve the problem on my system.

ciao
cate

2011-03-29 06:59:26

by Jens Axboe

[permalink] [raw]
Subject: Re: Please revert a91a2785b20

On 2011-03-29 01:03, Mike Snitzer wrote:
> On Mon, Mar 28 2011 at 6:43pm -0400,
> Thomas Gleixner <[email protected]> wrote:
>
>> Forgot to cc Jens and fixed up the borked mail address of Mike which
>> I took from the changelog :(
>>
>> On Tue, 29 Mar 2011, Thomas Gleixner wrote:
>>
>>> Out of the blue all my perfectly fine working test machines which use
>>> RAID stopped working with the very helpful error message:
>>>
>>> md/raid1:md1: active with 2 out of 2 mirrors
>>> md: pers->run() failed ...
>>>
>>> Reverting a91a2785b20 fixes the problem.
>>>
>>> Neither the subject line:
>>>
>>> block: Require subsystems to explicitly allocate bio_set integrity mempool
>>>
>>> nor the changelog have any hint why that wreckage is in any way
>>> sensible.
>>>
>>> The wreckage happens due to:
>>>
>>> - md_integrity_register(mddev);
>>> - return 0;
>>> + return md_integrity_register(mddev);
>
> Right, a kernel.org BZ was filed on this here:
> https://bugzilla.kernel.org/show_bug.cgi?id=32062
>
> Martin is working to "conjure up a patch" to fix the common case where
> no devices in the MD have DIF/DIX capabilities.

And I see that has been merged now. So all is good?

--
Jens Axboe

2011-03-29 13:21:28

by Mike Snitzer

[permalink] [raw]
Subject: Re: Please revert a91a2785b20

On Tue, Mar 29 2011 at 2:59am -0400,
Jens Axboe <[email protected]> wrote:

> On 2011-03-29 01:03, Mike Snitzer wrote:
> > On Mon, Mar 28 2011 at 6:43pm -0400,
> > Thomas Gleixner <[email protected]> wrote:
> >
> >> Forgot to cc Jens and fixed up the borked mail address of Mike which
> >> I took from the changelog :(
> >>
> >> On Tue, 29 Mar 2011, Thomas Gleixner wrote:
> >>
> >>> Out of the blue all my perfectly fine working test machines which use
> >>> RAID stopped working with the very helpful error message:
> >>>
> >>> md/raid1:md1: active with 2 out of 2 mirrors
> >>> md: pers->run() failed ...
> >>>
> >>> Reverting a91a2785b20 fixes the problem.
> >>>
> >>> Neither the subject line:
> >>>
> >>> block: Require subsystems to explicitly allocate bio_set integrity mempool
> >>>
> >>> nor the changelog have any hint why that wreckage is in any way
> >>> sensible.
> >>>
> >>> The wreckage happens due to:
> >>>
> >>> - md_integrity_register(mddev);
> >>> - return 0;
> >>> + return md_integrity_register(mddev);
> >
> > Right, a kernel.org BZ was filed on this here:
> > https://bugzilla.kernel.org/show_bug.cgi?id=32062
> >
> > Martin is working to "conjure up a patch" to fix the common case where
> > no devices in the MD have DIF/DIX capabilities.
>
> And I see that has been merged now. So all is good?

Yes, commit 89078d572e clearly addresses the immediate concern.

But I think we have a related issue that needs discussion, given that an
integrity profile mismatch will cause MD's assemble to fail (rather than
warn and continue to assemble without integrity support).

DM doesn't fail to load a DM device due to a integrity profile mismatch;
it just emits a warning and continues.

In contrast, MD will now disallow adding a normal disk (without
integrity support) to an array that has historically had a symmetric
integrity profile across all members.

So this begs the question: what is the correct approach?

At the moment I prefer the more forgiving approach that DM provides.
But I can appreciate the other school of thought where a mismatch is a
hard failure during device assembly (avoids possibility of falling back
to not using integrity capabilities).

Mike

2011-03-29 13:35:18

by James Bottomley

[permalink] [raw]
Subject: Re: Please revert a91a2785b20

On Tue, 2011-03-29 at 09:20 -0400, Mike Snitzer wrote:
> On Tue, Mar 29 2011 at 2:59am -0400,
> Jens Axboe <[email protected]> wrote:
>
> > On 2011-03-29 01:03, Mike Snitzer wrote:
> > > On Mon, Mar 28 2011 at 6:43pm -0400,
> > > Thomas Gleixner <[email protected]> wrote:
> > >
> > >> Forgot to cc Jens and fixed up the borked mail address of Mike which
> > >> I took from the changelog :(
> > >>
> > >> On Tue, 29 Mar 2011, Thomas Gleixner wrote:
> > >>
> > >>> Out of the blue all my perfectly fine working test machines which use
> > >>> RAID stopped working with the very helpful error message:
> > >>>
> > >>> md/raid1:md1: active with 2 out of 2 mirrors
> > >>> md: pers->run() failed ...
> > >>>
> > >>> Reverting a91a2785b20 fixes the problem.
> > >>>
> > >>> Neither the subject line:
> > >>>
> > >>> block: Require subsystems to explicitly allocate bio_set integrity mempool
> > >>>
> > >>> nor the changelog have any hint why that wreckage is in any way
> > >>> sensible.
> > >>>
> > >>> The wreckage happens due to:
> > >>>
> > >>> - md_integrity_register(mddev);
> > >>> - return 0;
> > >>> + return md_integrity_register(mddev);
> > >
> > > Right, a kernel.org BZ was filed on this here:
> > > https://bugzilla.kernel.org/show_bug.cgi?id=32062
> > >
> > > Martin is working to "conjure up a patch" to fix the common case where
> > > no devices in the MD have DIF/DIX capabilities.
> >
> > And I see that has been merged now. So all is good?
>
> Yes, commit 89078d572e clearly addresses the immediate concern.
>
> But I think we have a related issue that needs discussion, given that an
> integrity profile mismatch will cause MD's assemble to fail (rather than
> warn and continue to assemble without integrity support).
>
> DM doesn't fail to load a DM device due to a integrity profile mismatch;
> it just emits a warning and continues.
>
> In contrast, MD will now disallow adding a normal disk (without
> integrity support) to an array that has historically had a symmetric
> integrity profile across all members.
>
> So this begs the question: what is the correct approach?

That's easy: Anything that would cause a change in the previously
specified user profile for the array is an error (principle of least
surprise) so it can't happen automatically. Silently adding a non
integrity device would be nasty for the admin because they could think
they were adding an integrity disk when they weren't (for a variety of
reasons, like some problem with the HBA or incorrect settings on the
disk). So adding a non-integrity device to an integrity profile array
should return a soft error, with an explanation. However, the user
should be able to force the array online and change the profile. How
it's done (tooling or kernel), I'm not particularly bothered.

James

2011-03-29 13:42:52

by Martin K. Petersen

[permalink] [raw]
Subject: Re: Please revert a91a2785b20

>>>>> "Mike" == Mike Snitzer <[email protected]> writes:

Mike,

Mike> But I think we have a related issue that needs discussion, given
Mike> that an integrity profile mismatch will cause MD's assemble to
Mike> fail (rather than warn and continue to assemble without integrity
Mike> support).

Mike> DM doesn't fail to load a DM device due to a integrity profile
Mike> mismatch; it just emits a warning and continues.

Mike> In contrast, MD will now disallow adding a normal disk (without
Mike> integrity support) to an array that has historically had a
Mike> symmetric integrity profile across all members.

You would invalidate all your existing integrity metadata, tagging,
etc. on existing metadevice members. That seems to be a policy decision,
so if we go down that path it would have to be keyed off a force
assembly option passed down from userland tooling. Turning off features
and/or losing metadata really should not be done without the user's
explicit consent.

Also, let's assume you run an integrity-aware app on a DM device and you
add a non-integrity drive. The DM device is then no longer capable of
carrying integrity metadata out to storage. What happens to the app?
What about outstanding writes with metadata attached?

Good discussion topic for next week, methinks...

--
Martin K. Petersen Oracle Linux Engineering

2011-03-29 13:58:44

by Mike Snitzer

[permalink] [raw]
Subject: Need to refactor DM's integrity profile support a bit (was: Re: Please revert a91a2785b20)

On Tue, Mar 29 2011 at 9:42am -0400,
Martin K. Petersen <[email protected]> wrote:

> >>>>> "Mike" == Mike Snitzer <[email protected]> writes:
>
> Mike,
>
> Mike> But I think we have a related issue that needs discussion, given
> Mike> that an integrity profile mismatch will cause MD's assemble to
> Mike> fail (rather than warn and continue to assemble without integrity
> Mike> support).
>
> Mike> DM doesn't fail to load a DM device due to a integrity profile
> Mike> mismatch; it just emits a warning and continues.
>
> Mike> In contrast, MD will now disallow adding a normal disk (without
> Mike> integrity support) to an array that has historically had a
> Mike> symmetric integrity profile across all members.
>
> You would invalidate all your existing integrity metadata, tagging,
> etc. on existing metadevice members. That seems to be a policy decision,
> so if we go down that path it would have to be keyed off a force
> assembly option passed down from userland tooling. Turning off features
> and/or losing metadata really should not be done without the user's
> explicit consent.
>
> Also, let's assume you run an integrity-aware app on a DM device and you
> add a non-integrity drive. The DM device is then no longer capable of
> carrying integrity metadata out to storage. What happens to the app?
> What about outstanding writes with metadata attached?
>
> Good discussion topic for next week, methinks...

Sure, I'm just trying to reconcile the difference in behavior between MD
and DM. Seems DM missed out on the integrity profile error propagation
treatment that MD just received.

Easily resolved (though there are some dragons lying in wait relative to
where this inconsistency is detected by DM, hint: must be during DM
table load). Current hook, dm_table_set_integrity, disables the
integrity support of the DM device well beyond the point of no return
(when resuming a DM device that was allowed to have a new table loaded).

I welcome discussing this in a bit more detail in the hallway track at
LSF ;)

Thanks,
Mike

2011-04-01 17:43:49

by Mike Snitzer

[permalink] [raw]
Subject: [PATCH] dm: improve block integrity support

The current block integrity (DIF/DIX) support in DM is verifying that
all devices' integrity profiles match during DM device resume (which
is past the point of no return). To some degree that is unavoidable
(stacked DM devices force this late checking). But for most DM
devices (which aren't stacking on other DM devices) the ideal time to
verify all integrity profiles match is during table load.

Introduce the notion of an "initialized" integrity profile: a profile
that was blk_integrity_register()'d with a non-NULL 'blk_integrity'
template. Add blk_integrity_is_initialized() to allow checking if a
profile was initialized.

Update DM integrity support to:
- check all devices with _initialized_ integrity profiles match
during table load; uninitialized profiles (e.g. for underlying DM
device(s) of a stacked DM device) are ignored.
- disallow a table load that would result in an integrity profile that
conflicts with a DM device's existing (in-use) integrity profile
- avoid clearing an existing integrity profile
- validate all integrity profiles match during resume; but if they
don't all we can do is report the mismatch (during resume we're past
the point of no return)

Signed-off-by: Mike Snitzer <[email protected]>
Cc: Martin K. Petersen <[email protected]>
---
block/blk-integrity.c | 12 +++++-
drivers/md/dm-table.c | 114 +++++++++++++++++++++++++++++++++--------------
include/linux/blkdev.h | 2 +
3 files changed, 93 insertions(+), 35 deletions(-)

diff --git a/block/blk-integrity.c b/block/blk-integrity.c
index 54bcba6..129b9e2 100644
--- a/block/blk-integrity.c
+++ b/block/blk-integrity.c
@@ -30,6 +30,8 @@

static struct kmem_cache *integrity_cachep;

+static const char *bi_unsupported_name = "unsupported";
+
/**
* blk_rq_count_integrity_sg - Count number of integrity scatterlist elements
* @q: request queue
@@ -358,6 +360,14 @@ static struct kobj_type integrity_ktype = {
.release = blk_integrity_release,
};

+bool blk_integrity_is_initialized(struct gendisk *disk)
+{
+ struct blk_integrity *bi = blk_get_integrity(disk);
+
+ return (bi && bi->name && strcmp(bi->name, bi_unsupported_name) != 0);
+}
+EXPORT_SYMBOL(blk_integrity_is_initialized);
+
/**
* blk_integrity_register - Register a gendisk as being integrity-capable
* @disk: struct gendisk pointer to make integrity-aware
@@ -407,7 +417,7 @@ int blk_integrity_register(struct gendisk *disk, struct blk_integrity *template)
bi->get_tag_fn = template->get_tag_fn;
bi->tag_size = template->tag_size;
} else
- bi->name = "unsupported";
+ bi->name = bi_unsupported_name;

return 0;
}
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 416d4e2..cb8380c 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -927,20 +927,80 @@ static int dm_table_build_index(struct dm_table *t)
}

/*
+ * Get a disk whose integrity profile reflects the table's profile.
+ * If %match_all is true, all devices' profiles must match.
+ * If %match_all is false, all devices must at least have an
+ * allocated integrity profile; but uninitialized is ok.
+ * Returns NULL if integrity support was inconsistent or unavailable.
+ */
+static struct gendisk * dm_table_get_integrity_disk(struct dm_table *t,
+ bool match_all)
+{
+ struct list_head *devices = dm_table_get_devices(t);
+ struct dm_dev_internal *dd = NULL;
+ struct gendisk *prev_disk = NULL, *template_disk = NULL;
+
+ list_for_each_entry(dd, devices, list) {
+ template_disk = dd->dm_dev.bdev->bd_disk;
+ if (!blk_get_integrity(template_disk))
+ goto no_integrity;
+ if (!match_all && !blk_integrity_is_initialized(template_disk))
+ continue; /* skip uninitialized profiles */
+ else if (prev_disk &&
+ blk_integrity_compare(prev_disk, template_disk) < 0)
+ goto no_integrity;
+ prev_disk = template_disk;
+ }
+
+ return template_disk;
+
+no_integrity:
+ if (prev_disk)
+ DMWARN("%s: integrity not set: %s and %s profile mismatch",
+ dm_device_name(t->md),
+ prev_disk->disk_name,
+ template_disk->disk_name);
+ return NULL;
+}
+
+/*
* Register the mapped device for blk_integrity support if
- * the underlying devices support it.
+ * the underlying devices have an integrity profile. But all devices
+ * may not have matching profiles (checking all devices isn't reliable
+ * during table load because this table may use other DM device(s) which
+ * must be resumed before they will have an initialized integity profile).
+ * Stacked DM devices force a 2 stage integrity profile validation:
+ * 1 - during load, validate all initialized integrity profiles match
+ * 2 - during resume, validate all integrity profiles match
*/
static int dm_table_prealloc_integrity(struct dm_table *t, struct mapped_device *md)
{
- struct list_head *devices = dm_table_get_devices(t);
- struct dm_dev_internal *dd;
+ struct gendisk *template_disk = NULL;

- list_for_each_entry(dd, devices, list)
- if (bdev_get_integrity(dd->dm_dev.bdev)) {
- t->integrity_supported = 1;
- return blk_integrity_register(dm_disk(md), NULL);
- }
+ template_disk = dm_table_get_integrity_disk(t, false);
+ if (!template_disk)
+ return 0;

+ if (!blk_integrity_is_initialized(dm_disk(md))) {
+ t->integrity_supported = 1;
+ return blk_integrity_register(dm_disk(md), NULL);
+ }
+
+ /*
+ * If DM device already has an initalized integrity
+ * profile the new profile should not conflict.
+ */
+ if (blk_integrity_is_initialized(template_disk) &&
+ blk_integrity_compare(dm_disk(md), template_disk) < 0) {
+ DMWARN("%s: conflict with existing integrity profile: "
+ "%s profile mismatch",
+ dm_device_name(t->md),
+ template_disk->disk_name);
+ return 1;
+ }
+
+ /* Preserve existing initialized integrity profile */
+ t->integrity_supported = 1;
return 0;
}

@@ -1094,41 +1154,27 @@ combine_limits:

/*
* Set the integrity profile for this device if all devices used have
- * matching profiles.
+ * matching profiles. We're quite deep in the resume path but still
+ * don't know if all devices (particularly DM devices this device
+ * may be stacked on) have matching profiles. Even if the profiles
+ * don't match we have no way to fail (to resume) at this point.
*/
static void dm_table_set_integrity(struct dm_table *t)
{
- struct list_head *devices = dm_table_get_devices(t);
- struct dm_dev_internal *prev = NULL, *dd = NULL;
+ struct gendisk *template_disk = NULL;

if (!blk_get_integrity(dm_disk(t->md)))
return;

- list_for_each_entry(dd, devices, list) {
- if (prev &&
- blk_integrity_compare(prev->dm_dev.bdev->bd_disk,
- dd->dm_dev.bdev->bd_disk) < 0) {
- DMWARN("%s: integrity not set: %s and %s mismatch",
- dm_device_name(t->md),
- prev->dm_dev.bdev->bd_disk->disk_name,
- dd->dm_dev.bdev->bd_disk->disk_name);
- goto no_integrity;
- }
- prev = dd;
+ template_disk = dm_table_get_integrity_disk(t, true);
+ if (!template_disk &&
+ blk_integrity_is_initialized(dm_disk(t->md))) {
+ DMWARN("%s: device no longer has a valid integrity profile",
+ dm_device_name(t->md));
+ return;
}
-
- if (!prev || !bdev_get_integrity(prev->dm_dev.bdev))
- goto no_integrity;
-
blk_integrity_register(dm_disk(t->md),
- bdev_get_integrity(prev->dm_dev.bdev));
-
- return;
-
-no_integrity:
- blk_integrity_register(dm_disk(t->md), NULL);
-
- return;
+ blk_get_integrity(template_disk));
}

void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 16a902f..32176cc 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1206,6 +1206,7 @@ struct blk_integrity {
struct kobject kobj;
};

+extern bool blk_integrity_is_initialized(struct gendisk *);
extern int blk_integrity_register(struct gendisk *, struct blk_integrity *);
extern void blk_integrity_unregister(struct gendisk *);
extern int blk_integrity_compare(struct gendisk *, struct gendisk *);
@@ -1262,6 +1263,7 @@ queue_max_integrity_segments(struct request_queue *q)
#define queue_max_integrity_segments(a) (0)
#define blk_integrity_merge_rq(a, b, c) (0)
#define blk_integrity_merge_bio(a, b, c) (0)
+#define blk_integrity_is_initialized(a) (0)

#endif /* CONFIG_BLK_DEV_INTEGRITY */

2011-04-05 02:10:00

by NeilBrown

[permalink] [raw]
Subject: Re: Please revert a91a2785b20

On Tue, 29 Mar 2011 09:42:08 -0400 "Martin K. Petersen"
<[email protected]> wrote:

> >>>>> "Mike" == Mike Snitzer <[email protected]> writes:
>
> Mike,
>
> Mike> But I think we have a related issue that needs discussion, given
> Mike> that an integrity profile mismatch will cause MD's assemble to
> Mike> fail (rather than warn and continue to assemble without integrity
> Mike> support).
>
> Mike> DM doesn't fail to load a DM device due to a integrity profile
> Mike> mismatch; it just emits a warning and continues.
>
> Mike> In contrast, MD will now disallow adding a normal disk (without
> Mike> integrity support) to an array that has historically had a
> Mike> symmetric integrity profile across all members.
>
> You would invalidate all your existing integrity metadata, tagging,
> etc. on existing metadevice members. That seems to be a policy decision,
> so if we go down that path it would have to be keyed off a force
> assembly option passed down from userland tooling. Turning off features
> and/or losing metadata really should not be done without the user's
> explicit consent.

I've been distracted by other things for a while so I'm just looking at this
now, and I've never really paid much attention to the 'integrity' stuff
(seems all wrong to me anyway) so I might need some help coming up to
speed...

My reading of data-integrity.txt suggest that the IMD consists of two basic
components.
One is a fixed-format chunk which contains a light-weight checksum possibly
with some other summary information (address?). This is created at
whichever level doesn't trust the levels below, verified by the drive
firmware, and written to disk (together with whatever much stronger CRC the
drive firmware wants. It can then be verified on read by anyone who cares.

It seems to me that if one device in the array doesn't support this, then it
cannot be visible above the array but can still be computed and checked below
the array level, which is nearly as safe(??)
So changing an array from homogeneous to mixed just moves where the checksum
is calculated - not a big deal (maybe).

The other component of the IMD is an application tag which is not understood
or checked by the drive firmware (though presumably it is included in the
light-weight checksum so minimal checking is possible). This is used by the
filesystem to get an extra few bytes of data per-block which is known to be
updated atomically with the rest of the block.
This is currently completely unsupported by any redundant md array as there
is no attempt to copy this info when recovering to a spare etc.

So for this part of the IMD, RAID0 and LINEAR are the only levels that might
support it, and they don't have new devices added while active. LINEAR can
be extended by adding a device but that is used so rarely that I'm not really
fussed exactly how it gets handled.

So it seem to me there is no justification for disallowing the adding of a
device to an active array just because of some incompatibility with integrity
management.

Am I missing something???


Longer term - it is conceivable that RAID1/RAID10 could be taught to copy
integrity data, and we might even be able to make RAID5/6 handle it, though
the idea certainly doesn't appeal to me.

In that case we really need to know whether the sysadmin is expecting
integrity support or not. It think it is only justified to refuse an
explicit request to add a device to an array if we know it to be incompatible
with some other request.
I don't know how integrity should be requested - maybe mkfs, maybe mount,
maybe ioctl... maybe an mdadm option.
Once md finds out that it has been explicitly requested the array could be
flagged to say that integrity is in use, and then we could do all the extra
work to provide it, and reject new devices which don't support it.


Does any of that make sense? Is there something I have completely
misunderstood?

>
> Also, let's assume you run an integrity-aware app on a DM device and you
> add a non-integrity drive. The DM device is then no longer capable of
> carrying integrity metadata out to storage. What happens to the app?
> What about outstanding writes with metadata attached?
>
> Good discussion topic for next week, methinks...
>

Yeah, have fun - but remember that not all storage people are there :-)

NeilBrown

2011-04-14 14:09:39

by Mike Snitzer

[permalink] [raw]
Subject: Re: dm: improve block integrity support

[trimming CCs so as not to pester as many people directly]

On Fri, Apr 01 2011 at 1:42pm -0400,
Mike Snitzer <[email protected]> wrote:

> The current block integrity (DIF/DIX) support in DM is verifying that
> all devices' integrity profiles match during DM device resume (which
> is past the point of no return). To some degree that is unavoidable
> (stacked DM devices force this late checking). But for most DM
> devices (which aren't stacking on other DM devices) the ideal time to
> verify all integrity profiles match is during table load.
>
> Introduce the notion of an "initialized" integrity profile: a profile
> that was blk_integrity_register()'d with a non-NULL 'blk_integrity'
> template. Add blk_integrity_is_initialized() to allow checking if a
> profile was initialized.
>
> Update DM integrity support to:
> - check all devices with _initialized_ integrity profiles match
> during table load; uninitialized profiles (e.g. for underlying DM
> device(s) of a stacked DM device) are ignored.
> - disallow a table load that would result in an integrity profile that
> conflicts with a DM device's existing (in-use) integrity profile
> - avoid clearing an existing integrity profile
> - validate all integrity profiles match during resume; but if they
> don't all we can do is report the mismatch (during resume we're past
> the point of no return)
>
> Signed-off-by: Mike Snitzer <[email protected]>
> Cc: Martin K. Petersen <[email protected]>

Hi Martin,

Any chance you've had a look at this? I'm most interested in whether
the code works with the various integrity profiles you have.

I'd really like to get this reviewed and queued for upstream so that it
doesn't die on the vine.

Thanks,
Mike

2011-04-14 14:42:21

by Mike Snitzer

[permalink] [raw]
Subject: Re: dm: improve block integrity support

On Thu, Apr 14 2011 at 10:09am -0400,
Mike Snitzer <[email protected]> wrote:

> [trimming CCs so as not to pester as many people directly]
>
> On Fri, Apr 01 2011 at 1:42pm -0400,
> Mike Snitzer <[email protected]> wrote:
>
> > The current block integrity (DIF/DIX) support in DM is verifying that
> > all devices' integrity profiles match during DM device resume (which
> > is past the point of no return). To some degree that is unavoidable
> > (stacked DM devices force this late checking). But for most DM
> > devices (which aren't stacking on other DM devices) the ideal time to
> > verify all integrity profiles match is during table load.
> >
> > Introduce the notion of an "initialized" integrity profile: a profile
> > that was blk_integrity_register()'d with a non-NULL 'blk_integrity'
> > template. Add blk_integrity_is_initialized() to allow checking if a
> > profile was initialized.
> >
> > Update DM integrity support to:
> > - check all devices with _initialized_ integrity profiles match
> > during table load; uninitialized profiles (e.g. for underlying DM
> > device(s) of a stacked DM device) are ignored.
> > - disallow a table load that would result in an integrity profile that
> > conflicts with a DM device's existing (in-use) integrity profile
> > - avoid clearing an existing integrity profile
> > - validate all integrity profiles match during resume; but if they
> > don't all we can do is report the mismatch (during resume we're past
> > the point of no return)
> >
> > Signed-off-by: Mike Snitzer <[email protected]>
> > Cc: Martin K. Petersen <[email protected]>
>
> Hi Martin,
>
> Any chance you've had a look at this? I'm most interested in whether
> the code works with the various integrity profiles you have.
>
> I'd really like to get this reviewed and queued for upstream so that it
> doesn't die on the vine.

Ah wow, just noticed this got in through the block tree over a week ago:
http://git.kernel.org/linus/a63a5cf84dac7a23a57

Thanks Jens!

Martin,
I've done enough testing that I am confident the DM integrity support is
in a better place with this change. But I'd still really appreciate it
if you could verify all is well.

Mike

2011-04-15 04:59:28

by Martin K. Petersen

[permalink] [raw]
Subject: Re: dm: improve block integrity support

>>>>> "Mike" == Mike Snitzer <[email protected]> writes:

Mike> Martin, I've done enough testing that I am confident the DM
Mike> integrity support is in a better place with this change. But I'd
Mike> still really appreciate it if you could verify all is well.

Still playing catchup on mail. I'll check it out and test it tomorrow...

--
Martin K. Petersen Oracle Linux Engineering