2011-05-18 20:32:52

by Tony Luck

[permalink] [raw]
Subject: New boot time message: detected capacity change

Today's pull from Linus' tree (HEAD = 258-ga2b9c1f) gave me some new
messages during boot:

sda: detected capacity change from 0 to 146815737856
sdb: detected capacity change from 0 to 146815737856

They weren't there yesterday (HEAD = 211-gc1d10d1) ... nor do they
show up in any of my saved boot time dmesg files for the last few
months.

Harmless? Or something to worry about in the last few commits
before 2.6.39 goes final?

-Tony


2011-05-18 20:51:16

by Linus Torvalds

[permalink] [raw]
Subject: Re: New boot time message: detected capacity change

On Wed, May 18, 2011 at 1:32 PM, Tony Luck <[email protected]> wrote:
> Today's pull from Linus' tree (HEAD = 258-ga2b9c1f) gave me some new
> messages during boot:
>
> sda: detected capacity change from 0 to 146815737856
> sdb: detected capacity change from 0 to 146815737856
>
> They weren't there yesterday (HEAD = 211-gc1d10d1) ... nor do they
> show up in any of my saved boot time dmesg files for the last few
> months.
>
> Harmless? ?Or something to worry about in the last few commits
> before 2.6.39 goes final?

I htink it's 02e352287a40 ("block: rescan partitions on invalidated
devices on -ENOMEDIA too"), which was reported to fix a bugzilla
entry.

However, now that I look closer, that bugzilla entry was two years old
and reported for 2.6.29.

So it wasn't a regression fix like the changelog made me think (with a
stable pointer for 38)

Jens, Tejun - stop this messing around! The block layer has been one
of the problem children in the last releases, the *LAST* thing we need
is things like this happening this late in the -rc series!

Seriously. I'm really upset. I need to be able to trust you, and you
are not being trust-worthy. F*&^ you, in other words. This was *NOT* a
regression.

I don't care if it fixes a long-standing bug, you do not send fixes
like that to me. It should have gone into the merge window for 40, and
at *that* point it might be marked for stable.

As it was, I feel that those commit descriptions were actively
misleading me into thinking this was a regression.

Maybe it won't cause any problems, but -rc7 is not the time to make
these kinds of experiments!

Linus

2011-05-19 05:29:12

by Tejun Heo

[permalink] [raw]
Subject: Re: New boot time message: detected capacity change

Hello, Linus.

On Wed, May 18, 2011 at 01:50:19PM -0700, Linus Torvalds wrote:
> As it was, I feel that those commit descriptions were actively
> misleading me into thinking this was a regression.

The commit message is incomplete rather than misleading. The problem
has been there for a long time but hasn't affected cdrom so not many
people have noticed it. The changes to media revalidation exposed the
bug for cdrom too, so it became a regression for 2.6.38. Please refer
to the following thread.

https://lkml.org/lkml/2011/3/31/436

As the root cause was the same as the 2009 bug, I just referenced that
one. I should have included the link to the newer thread or explained
how it affected recent conversion from ->media_changed to
->check_events. My bad.

> Maybe it won't cause any problems, but -rc7 is not the time to make
> these kinds of experiments!

The patch was posted Apr 6th. It somehow took quite long to reach
your tree. Sorry about that too.

As for the warning message, I think it wouldn't be harmful but at -rc7
I'm a bit nervous too. I'll take a deeper look.

Thank you.

--
tejun

2011-05-19 06:01:07

by Tejun Heo

[permalink] [raw]
Subject: Re: New boot time message: detected capacity change

Hello, again.

On Thu, May 19, 2011 at 07:29:03AM +0200, Tejun Heo wrote:
> The commit message is incomplete rather than misleading. The problem
> has been there for a long time but hasn't affected cdrom so not many
> people have noticed it. The changes to media revalidation exposed the
> bug for cdrom too, so it became a regression for 2.6.38. Please refer
> to the following thread.
>
> https://lkml.org/lkml/2011/3/31/436
>
> As the root cause was the same as the 2009 bug, I just referenced that
> one. I should have included the link to the newer thread or explained
> how it affected recent conversion from ->media_changed to
> ->check_events. My bad.

I was a bit confused here, so there were two separate problems. One
affected sr, which was fixed by bf2253a6f0 and the other affecting
partition scan on sd fixed by 02e352287a. The two were mixed in my
head, and I explained the wrong one.

The cdrom problem was a plain kernel regression. The latter (the one
being shouted at in this thread) is slightly more complex. Due to the
way devices were polled for media change from userland, the problem
wasn't noticeable. The kernel misbehaved but userland polling masked
it. 2.6.38 added in-kernel polling and the userland workaround no
longer applied and the problem became visible - similar story with the
cdrom problem but involves userland behavior change too.

Both were very simliar in the cause too. cdrom was unnecessarily
skipping check_disk_change() while the block layer was unnecessarily
skipping partition check.

Anyways, I'll go to my office and look into the warning message, but
please revert 02e352287a if necessary. We can do that with -stable
later.

Thanks.

--
tejun

2011-05-19 07:25:39

by Jens Axboe

[permalink] [raw]
Subject: Re: New boot time message: detected capacity change

On 2011-05-18 22:50, Linus Torvalds wrote:
> On Wed, May 18, 2011 at 1:32 PM, Tony Luck <[email protected]> wrote:
>> Today's pull from Linus' tree (HEAD = 258-ga2b9c1f) gave me some new
>> messages during boot:
>>
>> sda: detected capacity change from 0 to 146815737856
>> sdb: detected capacity change from 0 to 146815737856
>>
>> They weren't there yesterday (HEAD = 211-gc1d10d1) ... nor do they
>> show up in any of my saved boot time dmesg files for the last few
>> months.
>>
>> Harmless? Or something to worry about in the last few commits
>> before 2.6.39 goes final?
>
> I htink it's 02e352287a40 ("block: rescan partitions on invalidated
> devices on -ENOMEDIA too"), which was reported to fix a bugzilla
> entry.
>
> However, now that I look closer, that bugzilla entry was two years old
> and reported for 2.6.29.
>
> So it wasn't a regression fix like the changelog made me think (with a
> stable pointer for 38)
>
> Jens, Tejun - stop this messing around! The block layer has been one
> of the problem children in the last releases, the *LAST* thing we need
> is things like this happening this late in the -rc series!

This release has not been great, mostly early in the cycle. I will take
complete blame for pushing this change so late, that's why I asked Tejun
about the three patches queues up yesterday. We've had more churn in
this cycle due to both the plugging and media event notification
changes, both have caused way more commits and later in the cycle that
I'm usually comfortable with. I do _always_ try to push the big stuff
before -rc1, so I don't think it's completely fair to quote earlier
releases as problematic.

> Seriously. I'm really upset. I need to be able to trust you, and you
> are not being trust-worthy. F*&^ you, in other words. This was *NOT* a
> regression.
>
> I don't care if it fixes a long-standing bug, you do not send fixes
> like that to me. It should have gone into the merge window for 40, and
> at *that* point it might be marked for stable.
>
> As it was, I feel that those commit descriptions were actively
> misleading me into thinking this was a regression.
>
> Maybe it won't cause any problems, but -rc7 is not the time to make
> these kinds of experiments!

I agree. And not that it's an excuse, but it has been well tested here
on my test and primary machines. Arguably this particular patch should
just have waited for 2.6.40 and with a stable backport instead.

Will not happen again.

--
Jens Axboe

2011-05-19 07:27:41

by Jens Axboe

[permalink] [raw]
Subject: Re: New boot time message: detected capacity change

On 2011-05-19 08:01, Tejun Heo wrote:
> Hello, again.
>
> On Thu, May 19, 2011 at 07:29:03AM +0200, Tejun Heo wrote:
>> The commit message is incomplete rather than misleading. The problem
>> has been there for a long time but hasn't affected cdrom so not many
>> people have noticed it. The changes to media revalidation exposed the
>> bug for cdrom too, so it became a regression for 2.6.38. Please refer
>> to the following thread.
>>
>> https://lkml.org/lkml/2011/3/31/436
>>
>> As the root cause was the same as the 2009 bug, I just referenced that
>> one. I should have included the link to the newer thread or explained
>> how it affected recent conversion from ->media_changed to
>> ->check_events. My bad.
>
> I was a bit confused here, so there were two separate problems. One
> affected sr, which was fixed by bf2253a6f0 and the other affecting
> partition scan on sd fixed by 02e352287a. The two were mixed in my
> head, and I explained the wrong one.
>
> The cdrom problem was a plain kernel regression. The latter (the one
> being shouted at in this thread) is slightly more complex. Due to the
> way devices were polled for media change from userland, the problem
> wasn't noticeable. The kernel misbehaved but userland polling masked
> it. 2.6.38 added in-kernel polling and the userland workaround no
> longer applied and the problem became visible - similar story with the
> cdrom problem but involves userland behavior change too.
>
> Both were very simliar in the cause too. cdrom was unnecessarily
> skipping check_disk_change() while the block layer was unnecessarily
> skipping partition check.
>
> Anyways, I'll go to my office and look into the warning message, but
> please revert 02e352287a if necessary. We can do that with -stable
> later.

It's too late at this point really, since 2.6.39 is tagged and done. The
capacity message should be purely harmless, even if it does look
somewhat odd. The placement is fairly logical, so should not cause too
many confused users.

--
Jens Axboe

2011-05-19 09:30:32

by Tejun Heo

[permalink] [raw]
Subject: Re: New boot time message: detected capacity change

On Wed, May 18, 2011 at 01:32:50PM -0700, Tony Luck wrote:
> Today's pull from Linus' tree (HEAD = 258-ga2b9c1f) gave me some new
> messages during boot:
>
> sda: detected capacity change from 0 to 146815737856
> sdb: detected capacity change from 0 to 146815737856
>
> They weren't there yesterday (HEAD = 211-gc1d10d1) ... nor do they
> show up in any of my saved boot time dmesg files for the last few
> months.
>
> Harmless? Or something to worry about in the last few commits
> before 2.6.39 goes final?

Yes, it is harmless. The order was changed between
rescan_partitions() and bd_set_size() - rescan_partitions() checked
for size change too and it resulted in simpler condition checks.
However, rescan_partitions() triggers the above message when it
detects size change unlike the silient bd_set_size().

Does the following patch remove the messages for you?

Thanks.

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 257b00e..bf9c7a7 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1120,6 +1120,15 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
goto restart;
}
}
+
+ if (!ret && !bdev->bd_openers) {
+ bd_set_size(bdev,(loff_t)get_capacity(disk)<<9);
+ bdi = blk_get_backing_dev_info(bdev);
+ if (bdi == NULL)
+ bdi = &default_backing_dev_info;
+ bdev_inode_switch_bdi(bdev->bd_inode, bdi);
+ }
+
/*
* If the device is invalidated, rescan partition
* if open succeeded or failed with -ENOMEDIUM.
@@ -1130,14 +1139,6 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
rescan_partitions(disk, bdev);
if (ret)
goto out_clear;
-
- if (!bdev->bd_openers) {
- bd_set_size(bdev,(loff_t)get_capacity(disk)<<9);
- bdi = blk_get_backing_dev_info(bdev);
- if (bdi == NULL)
- bdi = &default_backing_dev_info;
- bdev_inode_switch_bdi(bdev->bd_inode, bdi);
- }
} else {
struct block_device *whole;
whole = bdget_disk(disk, 0);

2011-05-19 17:39:44

by Tony Luck

[permalink] [raw]
Subject: Re: New boot time message: detected capacity change

On Thu, May 19, 2011 at 2:30 AM, Tejun Heo <[email protected]> wrote:
> Yes, it is harmless. ?The order was changed between
> rescan_partitions() and bd_set_size() - rescan_partitions() checked
> for size change too and it resulted in simpler condition checks.
> However, rescan_partitions() triggers the above message when it
> detects size change unlike the silient bd_set_size().
>
> Does the following patch remove the messages for you?

Yes - messages are gone with this patch applied.

Thanks

-Tony

2011-05-23 11:25:01

by Tejun Heo

[permalink] [raw]
Subject: [PATCH] block: move bd_set_size() above rescan_partitions() in __blkdev_get()

02e352287a4 (block: rescan partitions on invalidated devices on
-ENOMEDIA too) relocated partition rescan above explicit bd_set_size()
to simplify condition check. As rescan_partitions() does its own bdev
size setting, this doesn't break anything; however,
rescan_partitions() prints out the following messages when adjusting
bdev size, which can be confusing.

sda: detected capacity change from 0 to 146815737856
sdb: detected capacity change from 0 to 146815737856

This patch restores the original order and remove the warning
messages.

stable: Please apply together with 02e352287a4 (block: rescan
partitions on invalidated devices on -ENOMEDIA too).

Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Tony Luck <[email protected]>
Tested-by: Tony Luck <[email protected]>
Cc: [email protected]
---
fs/block_dev.c | 17 +++++++++--------
1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 257b00e..bf9c7a7 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1120,6 +1120,15 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
goto restart;
}
}
+
+ if (!ret && !bdev->bd_openers) {
+ bd_set_size(bdev,(loff_t)get_capacity(disk)<<9);
+ bdi = blk_get_backing_dev_info(bdev);
+ if (bdi == NULL)
+ bdi = &default_backing_dev_info;
+ bdev_inode_switch_bdi(bdev->bd_inode, bdi);
+ }
+
/*
* If the device is invalidated, rescan partition
* if open succeeded or failed with -ENOMEDIUM.
@@ -1130,14 +1139,6 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
rescan_partitions(disk, bdev);
if (ret)
goto out_clear;
-
- if (!bdev->bd_openers) {
- bd_set_size(bdev,(loff_t)get_capacity(disk)<<9);
- bdi = blk_get_backing_dev_info(bdev);
- if (bdi == NULL)
- bdi = &default_backing_dev_info;
- bdev_inode_switch_bdi(bdev->bd_inode, bdi);
- }
} else {
struct block_device *whole;
whole = bdget_disk(disk, 0);

2011-05-23 15:51:58

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] block: move bd_set_size() above rescan_partitions() in __blkdev_get()

On Mon, May 23, 2011 at 4:24 AM, Tejun Heo <[email protected]> wrote:
>
> 02e352287a4 (block: rescan partitions on invalidated devices on
> -ENOMEDIA too) relocated partition rescan above explicit bd_set_size()
> to simplify condition check. ?As rescan_partitions() does its own bdev
> size setting, this doesn't break anything; however,
> rescan_partitions() prints out the following messages when adjusting
> bdev size, which can be confusing.

Applied,

Linus

2011-05-23 16:31:38

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] block: move bd_set_size() above rescan_partitions() in __blkdev_get()

On 2011-05-23 17:51, Linus Torvalds wrote:
> On Mon, May 23, 2011 at 4:24 AM, Tejun Heo <[email protected]> wrote:
>>
>> 02e352287a4 (block: rescan partitions on invalidated devices on
>> -ENOMEDIA too) relocated partition rescan above explicit bd_set_size()
>> to simplify condition check. As rescan_partitions() does its own bdev
>> size setting, this doesn't break anything; however,
>> rescan_partitions() prints out the following messages when adjusting
>> bdev size, which can be confusing.
>
> Applied,

OK, added it when Tejun sent it as well, JFYI.

--
Jens Axboe