2014-06-30 10:32:20

by Paul Mackerras

[permalink] [raw]
Subject: Regression in 3.15 on POWER8 with multipath SCSI

I have a machine on which 3.15 usually fails to boot, and 3.14 boots
every time. The machine is a POWER8 2-socket server with 20 cores
(thus 160 CPUs), 128GB of RAM, and 7 SCSI disks connected via a
hardware-RAID-capable adapter which appears as two IPR controllers
which are both connected to each disk. I am booting from a disk that
has Fedora 20 installed on it.

After over two weeks of bisections, I can finally point to the commits
that cause the problems. The culprits are:

3e9f1be1 dm mpath: remove process_queued_ios()
e8099177 dm mpath: push back requests instead of queueing
bcccff93 kobject: don't block for each kobject_uevent

The interesting thing is that neither e8099177 nor bcccff93 cause
failures on their own, but with both commits in there are failures
where the system will fail to find /home on some occasions.

With 3e9f1be1 included, the system appears to be prone to a deadlock
condition which typically causes the boot process to hang with this
message showing:

A start job is running for Monitoring of LVM2 mirror...rogress polling

(with a [*** ] thing before it where the asterisks move back and
forth).

If I revert 63d832c3 ("dm mpath: really fix lockdep warning") ,
4cdd2ad7 ("dm mpath: fix lock order inconsistency in
multipath_ioctl"), 3e9f1be1 and bcccff93, in that order, I get a
kernel that will boot every time. The first two are later commits
that fix some problems with 3e9f1be1 (though not the problems I am
seeing).

Can anyone see any reason why e8099177 and bcccff93 would interfere
with each other?

-----

The rest of this email outlines the steps I took to identify these
commits. I first identified that 3.15-rc1 would sometimes fail to
boot, and did a bisection between 3.15 and 3.15-rc1 that identified
3e9f1be1 as the bad commit. I then took 3.15-rc8 and reverted
63d832c3, 4cdd2ad7 and 3e9f1be1, and tested that. That didn't fail
with the deadlock, but was still prone to fail to find root or /home
and thus fail to boot.

To debug this second problem, I tested the commit before Linus merged
in the dm modifications: 3f583bc2 ("Merge tag 'iommu-updates-v3.15' of
git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu"). It was
fine. I then took 0596661f ("dm cache: fix a lock-inversion"), which
is what Linus merged in during the 3.15 merge window, reverted
3e9f1be1 on top of that, and tested that, and it also was fine.
The ID of that revert commit was 9cfd3fe8 (that ID doesn't appear in
any public tree, of course).

Interestingly, the merge of 3f583bc2 with 9cfd3fe8 was bad. To track
this down, I first rebased the commits from the dm-3.15-changes branch
except for 3e9f1be1 on top of 3f583bc2, and bisected between 3f583bc2
and the tip of that branch. That bisection pointed to e8099177. I
tried reverting that from 3.15-rc8, but it doesn't revert cleanly, and
was too complex for me to work out how to manually revert it.

Next I did a git bisection between 3.14 and 3f583bc2, merging in
9cfd3fe8 at each point before testing. That identified bcccff93 as
the first bad commit, and indeed 3.15 with bcccff93 reverted was not
prone to failing to find root or /home.

Paul.


2014-06-30 10:52:35

by Hannes Reinecke

[permalink] [raw]
Subject: Re: Regression in 3.15 on POWER8 with multipath SCSI

On 06/30/2014 12:30 PM, Paul Mackerras wrote:
> I have a machine on which 3.15 usually fails to boot, and 3.14 boots
> every time. The machine is a POWER8 2-socket server with 20 cores
> (thus 160 CPUs), 128GB of RAM, and 7 SCSI disks connected via a
> hardware-RAID-capable adapter which appears as two IPR controllers
> which are both connected to each disk. I am booting from a disk that
> has Fedora 20 installed on it.
>
> After over two weeks of bisections, I can finally point to the commits
> that cause the problems. The culprits are:
>
> 3e9f1be1 dm mpath: remove process_queued_ios()
> e8099177 dm mpath: push back requests instead of queueing
> bcccff93 kobject: don't block for each kobject_uevent
>
> The interesting thing is that neither e8099177 nor bcccff93 cause
> failures on their own, but with both commits in there are failures
> where the system will fail to find /home on some occasions.
>
> With 3e9f1be1 included, the system appears to be prone to a deadlock
> condition which typically causes the boot process to hang with this
> message showing:
>
> A start job is running for Monitoring of LVM2 mirror...rogress polling
>
> (with a [*** ] thing before it where the asterisks move back and
> forth).
>
> If I revert 63d832c3 ("dm mpath: really fix lockdep warning") ,
> 4cdd2ad7 ("dm mpath: fix lock order inconsistency in
> multipath_ioctl"), 3e9f1be1 and bcccff93, in that order, I get a
> kernel that will boot every time. The first two are later commits
> that fix some problems with 3e9f1be1 (though not the problems I am
> seeing).
>
> Can anyone see any reason why e8099177 and bcccff93 would interfere
> with each other?
>
It might be running afoul with the 'cookie' mechanism.
Device-mapper is using inserting a 'cookie' with the ioctl, and
listens to any event containing the cookie to ensure udev has
finished processing that device and hence the device node is
accessible. Added to this is the problem that we don't have any good
means of detecting any changes to device-mapper devices.

EG look at this sequence of events:

add dm-1
remove dm-1
add dm-1

Originally udev would pick up the event, read the details from
sysfs, and return control to the kernel.
With bcccff93 udev will _not_have a chance to read the details
from sysfs for 'dm-1', as anything read from sysfs relating to
'dm-1' might infact refer to the _second_ 'add' event, which might
be a totally different device.
As far as I know udev doesn't have any mechanism to drop events,
so it'll always process all events. Assuming that the sysfs
attributes it reads _do_ relate to that event. If they don't things
become interesting ...

(Actually, this issue was always present, especially with
multipathing. multipath occasionally can become sluggish when
processing events, so the same might happen with it. We've tried to
work around this, but never found a fool-proof way of doing so).

Adding Kay as he might have some more insight here.

Another thing:
Do you run LVM on top of multipathing?
If so, could you setup your system with _not_ using LVM and
disabling the LVM service?

Reasoning here is that multipath should not be that susceptible to
changes here than LVM2 is (don't nail me on this, I not _that_ into
LVM2 details).

And as the system is stuck while waiting for LVM it might indeed be
an side-effect when running LVM on top of multipathing.

Cheers,

Hannes
--
Dr. Hannes Reinecke zSeries & Storage
[email protected] +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: J. Hawn, J. Guild, F. Imend?rffer, HRB 16746 (AG N?rnberg)

2014-06-30 11:02:14

by Paul Mackerras

[permalink] [raw]
Subject: Re: Regression in 3.15 on POWER8 with multipath SCSI

On Mon, Jun 30, 2014 at 12:52:29PM +0200, Hannes Reinecke wrote:
> On 06/30/2014 12:30 PM, Paul Mackerras wrote:
> >I have a machine on which 3.15 usually fails to boot, and 3.14 boots
> >every time. The machine is a POWER8 2-socket server with 20 cores
> >(thus 160 CPUs), 128GB of RAM, and 7 SCSI disks connected via a
> >hardware-RAID-capable adapter which appears as two IPR controllers
> >which are both connected to each disk. I am booting from a disk that
> >has Fedora 20 installed on it.
> >
> >After over two weeks of bisections, I can finally point to the commits
> >that cause the problems. The culprits are:
> >
> >3e9f1be1 dm mpath: remove process_queued_ios()
> >e8099177 dm mpath: push back requests instead of queueing
> >bcccff93 kobject: don't block for each kobject_uevent
> >
> >The interesting thing is that neither e8099177 nor bcccff93 cause
> >failures on their own, but with both commits in there are failures
> >where the system will fail to find /home on some occasions.
> >
> >With 3e9f1be1 included, the system appears to be prone to a deadlock
> >condition which typically causes the boot process to hang with this
> >message showing:
> >
> >A start job is running for Monitoring of LVM2 mirror...rogress polling
> >
> >(with a [*** ] thing before it where the asterisks move back and
> >forth).
> >
> >If I revert 63d832c3 ("dm mpath: really fix lockdep warning") ,
> >4cdd2ad7 ("dm mpath: fix lock order inconsistency in
> >multipath_ioctl"), 3e9f1be1 and bcccff93, in that order, I get a
> >kernel that will boot every time. The first two are later commits
> >that fix some problems with 3e9f1be1 (though not the problems I am
> >seeing).
> >
> >Can anyone see any reason why e8099177 and bcccff93 would interfere
> >with each other?
> >
> It might be running afoul with the 'cookie' mechanism.
> Device-mapper is using inserting a 'cookie' with the ioctl, and listens to
> any event containing the cookie to ensure udev has finished processing that
> device and hence the device node is accessible. Added to this is the problem
> that we don't have any good means of detecting any changes to device-mapper
> devices.
>
> EG look at this sequence of events:
>
> add dm-1
> remove dm-1
> add dm-1
>
> Originally udev would pick up the event, read the details from sysfs, and
> return control to the kernel.
> With bcccff93 udev will _not_have a chance to read the details
> from sysfs for 'dm-1', as anything read from sysfs relating to 'dm-1' might
> infact refer to the _second_ 'add' event, which might be a totally different
> device.
> As far as I know udev doesn't have any mechanism to drop events,
> so it'll always process all events. Assuming that the sysfs attributes it
> reads _do_ relate to that event. If they don't things become interesting ...
>
> (Actually, this issue was always present, especially with multipathing.
> multipath occasionally can become sluggish when processing events, so the
> same might happen with it. We've tried to work around this, but never found
> a fool-proof way of doing so).
>
> Adding Kay as he might have some more insight here.
>
> Another thing:
> Do you run LVM on top of multipathing?
> If so, could you setup your system with _not_ using LVM and disabling the
> LVM service?

No, I'm not using LVM, and in fact I deleted all the physical volumes
that were on any of the disks (they were installations of other
distros), so there are no physical or logical volumes anywhere on any
disk. I haven't tried disabling the LVM service completely, though.
What would it mean if disabling the LVM service made a difference?

> Reasoning here is that multipath should not be that susceptible to changes
> here than LVM2 is (don't nail me on this, I not _that_ into LVM2 details).
>
> And as the system is stuck while waiting for LVM it might indeed be an
> side-effect when running LVM on top of multipathing.

Yes, I thought so too early on, and that's why I deleted all LVM
physical and logical volumes, but that didn't help.

Paul.

2014-06-30 11:35:54

by Hannes Reinecke

[permalink] [raw]
Subject: Re: Regression in 3.15 on POWER8 with multipath SCSI

On 06/30/2014 01:02 PM, Paul Mackerras wrote:
> On Mon, Jun 30, 2014 at 12:52:29PM +0200, Hannes Reinecke wrote:
>> On 06/30/2014 12:30 PM, Paul Mackerras wrote:
>>> I have a machine on which 3.15 usually fails to boot, and 3.14 boots
>>> every time. The machine is a POWER8 2-socket server with 20 cores
>>> (thus 160 CPUs), 128GB of RAM, and 7 SCSI disks connected via a
>>> hardware-RAID-capable adapter which appears as two IPR controllers
>>> which are both connected to each disk. I am booting from a disk that
>>> has Fedora 20 installed on it.
>>>
>>> After over two weeks of bisections, I can finally point to the commits
>>> that cause the problems. The culprits are:
>>>
>>> 3e9f1be1 dm mpath: remove process_queued_ios()
>>> e8099177 dm mpath: push back requests instead of queueing
>>> bcccff93 kobject: don't block for each kobject_uevent
>>>
>>> The interesting thing is that neither e8099177 nor bcccff93 cause
>>> failures on their own, but with both commits in there are failures
>>> where the system will fail to find /home on some occasions.
>>>
>>> With 3e9f1be1 included, the system appears to be prone to a deadlock
>>> condition which typically causes the boot process to hang with this
>>> message showing:
>>>
>>> A start job is running for Monitoring of LVM2 mirror...rogress polling
>>>
>>> (with a [*** ] thing before it where the asterisks move back and
>>> forth).
>>>
>>> If I revert 63d832c3 ("dm mpath: really fix lockdep warning") ,
>>> 4cdd2ad7 ("dm mpath: fix lock order inconsistency in
>>> multipath_ioctl"), 3e9f1be1 and bcccff93, in that order, I get a
>>> kernel that will boot every time. The first two are later commits
>>> that fix some problems with 3e9f1be1 (though not the problems I am
>>> seeing).
>>>
>>> Can anyone see any reason why e8099177 and bcccff93 would interfere
>>> with each other?
>>>
>> It might be running afoul with the 'cookie' mechanism.
>> Device-mapper is using inserting a 'cookie' with the ioctl, and listens to
>> any event containing the cookie to ensure udev has finished processing that
>> device and hence the device node is accessible. Added to this is the problem
>> that we don't have any good means of detecting any changes to device-mapper
>> devices.
>>
>> EG look at this sequence of events:
>>
>> add dm-1
>> remove dm-1
>> add dm-1
>>
>> Originally udev would pick up the event, read the details from sysfs, and
>> return control to the kernel.
>> With bcccff93 udev will _not_have a chance to read the details
>> from sysfs for 'dm-1', as anything read from sysfs relating to 'dm-1' might
>> infact refer to the _second_ 'add' event, which might be a totally different
>> device.
>> As far as I know udev doesn't have any mechanism to drop events,
>> so it'll always process all events. Assuming that the sysfs attributes it
>> reads _do_ relate to that event. If they don't things become interesting ...
>>
>> (Actually, this issue was always present, especially with multipathing.
>> multipath occasionally can become sluggish when processing events, so the
>> same might happen with it. We've tried to work around this, but never found
>> a fool-proof way of doing so).
>>
>> Adding Kay as he might have some more insight here.
>>
>> Another thing:
>> Do you run LVM on top of multipathing?
>> If so, could you setup your system with _not_ using LVM and disabling the
>> LVM service?
>
> No, I'm not using LVM, and in fact I deleted all the physical volumes
> that were on any of the disks (they were installations of other
> distros), so there are no physical or logical volumes anywhere on any
> disk. I haven't tried disabling the LVM service completely, though.
> What would it mean if disabling the LVM service made a difference?
>
Yes. LVM integration with systemd is a science unto itself.
I'm reasonably confident with multipath, but not LVM.
Plus the fact the the LVM service apparently is waiting for
something sort of points into that direction.

So please do disable the lvm service.

Cheers,

Hannes
--
Dr. Hannes Reinecke zSeries & Storage
[email protected] +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: J. Hawn, J. Guild, F. Imend?rffer, HRB 16746 (AG N?rnberg)

2014-06-30 21:30:21

by Paul Mackerras

[permalink] [raw]
Subject: Re: Regression in 3.15 on POWER8 with multipath SCSI

On Mon, Jun 30, 2014 at 01:35:20PM +0200, Hannes Reinecke wrote:
> On 06/30/2014 01:02 PM, Paul Mackerras wrote:
> >On Mon, Jun 30, 2014 at 12:52:29PM +0200, Hannes Reinecke wrote:
> >>On 06/30/2014 12:30 PM, Paul Mackerras wrote:
> >>>I have a machine on which 3.15 usually fails to boot, and 3.14 boots
> >>>every time. The machine is a POWER8 2-socket server with 20 cores
> >>>(thus 160 CPUs), 128GB of RAM, and 7 SCSI disks connected via a
> >>>hardware-RAID-capable adapter which appears as two IPR controllers
> >>>which are both connected to each disk. I am booting from a disk that
> >>>has Fedora 20 installed on it.
> >>>
> >>>After over two weeks of bisections, I can finally point to the commits
> >>>that cause the problems. The culprits are:
> >>>
> >>>3e9f1be1 dm mpath: remove process_queued_ios()
> >>>e8099177 dm mpath: push back requests instead of queueing
> >>>bcccff93 kobject: don't block for each kobject_uevent
> >>>
> >>>The interesting thing is that neither e8099177 nor bcccff93 cause
> >>>failures on their own, but with both commits in there are failures
> >>>where the system will fail to find /home on some occasions.
> >>>
> >>>With 3e9f1be1 included, the system appears to be prone to a deadlock
> >>>condition which typically causes the boot process to hang with this
> >>>message showing:
> >>>
> >>>A start job is running for Monitoring of LVM2 mirror...rogress polling
> >>>
> >>>(with a [*** ] thing before it where the asterisks move back and
> >>>forth).
> >>>
> >>>If I revert 63d832c3 ("dm mpath: really fix lockdep warning") ,
> >>>4cdd2ad7 ("dm mpath: fix lock order inconsistency in
> >>>multipath_ioctl"), 3e9f1be1 and bcccff93, in that order, I get a
> >>>kernel that will boot every time. The first two are later commits
> >>>that fix some problems with 3e9f1be1 (though not the problems I am
> >>>seeing).
> >>>
> >>>Can anyone see any reason why e8099177 and bcccff93 would interfere
> >>>with each other?
> >>>
> >>It might be running afoul with the 'cookie' mechanism.
> >>Device-mapper is using inserting a 'cookie' with the ioctl, and listens to
> >>any event containing the cookie to ensure udev has finished processing that
> >>device and hence the device node is accessible. Added to this is the problem
> >>that we don't have any good means of detecting any changes to device-mapper
> >>devices.
> >>
> >>EG look at this sequence of events:
> >>
> >>add dm-1
> >>remove dm-1
> >>add dm-1
> >>
> >>Originally udev would pick up the event, read the details from sysfs, and
> >>return control to the kernel.
> >>With bcccff93 udev will _not_have a chance to read the details
> >>from sysfs for 'dm-1', as anything read from sysfs relating to 'dm-1' might
> >>infact refer to the _second_ 'add' event, which might be a totally different
> >>device.
> >>As far as I know udev doesn't have any mechanism to drop events,
> >>so it'll always process all events. Assuming that the sysfs attributes it
> >>reads _do_ relate to that event. If they don't things become interesting ...
> >>
> >>(Actually, this issue was always present, especially with multipathing.
> >>multipath occasionally can become sluggish when processing events, so the
> >>same might happen with it. We've tried to work around this, but never found
> >>a fool-proof way of doing so).
> >>
> >>Adding Kay as he might have some more insight here.
> >>
> >>Another thing:
> >>Do you run LVM on top of multipathing?
> >>If so, could you setup your system with _not_ using LVM and disabling the
> >>LVM service?
> >
> >No, I'm not using LVM, and in fact I deleted all the physical volumes
> >that were on any of the disks (they were installations of other
> >distros), so there are no physical or logical volumes anywhere on any
> >disk. I haven't tried disabling the LVM service completely, though.
> >What would it mean if disabling the LVM service made a difference?
> >
> Yes. LVM integration with systemd is a science unto itself.
> I'm reasonably confident with multipath, but not LVM.
> Plus the fact the the LVM service apparently is waiting for something sort
> of points into that direction.
>
> So please do disable the lvm service.

I disabled the LVM service, and it's still bad. Unmodified 3.15
booted successfully in only 18 out of 50 attempts with LVM disabled.

So it's not LVM. In any case LVM was fine with a 3.14 kernel.

Paul.

2014-06-30 21:30:19

by Paul Mackerras

[permalink] [raw]
Subject: Re: Regression in 3.15 on POWER8 with multipath SCSI

On Mon, Jun 30, 2014 at 12:52:29PM +0200, Hannes Reinecke wrote:
> On 06/30/2014 12:30 PM, Paul Mackerras wrote:
> >I have a machine on which 3.15 usually fails to boot, and 3.14 boots
> >every time. The machine is a POWER8 2-socket server with 20 cores
> >(thus 160 CPUs), 128GB of RAM, and 7 SCSI disks connected via a
> >hardware-RAID-capable adapter which appears as two IPR controllers
> >which are both connected to each disk. I am booting from a disk that
> >has Fedora 20 installed on it.
> >
> >After over two weeks of bisections, I can finally point to the commits
> >that cause the problems. The culprits are:
> >
> >3e9f1be1 dm mpath: remove process_queued_ios()
> >e8099177 dm mpath: push back requests instead of queueing
> >bcccff93 kobject: don't block for each kobject_uevent
> >
> >The interesting thing is that neither e8099177 nor bcccff93 cause
> >failures on their own, but with both commits in there are failures
> >where the system will fail to find /home on some occasions.
> >
> >With 3e9f1be1 included, the system appears to be prone to a deadlock
> >condition which typically causes the boot process to hang with this
> >message showing:
> >
> >A start job is running for Monitoring of LVM2 mirror...rogress polling
> >
> >(with a [*** ] thing before it where the asterisks move back and
> >forth).
> >
> >If I revert 63d832c3 ("dm mpath: really fix lockdep warning") ,
> >4cdd2ad7 ("dm mpath: fix lock order inconsistency in
> >multipath_ioctl"), 3e9f1be1 and bcccff93, in that order, I get a
> >kernel that will boot every time. The first two are later commits
> >that fix some problems with 3e9f1be1 (though not the problems I am
> >seeing).
> >
> >Can anyone see any reason why e8099177 and bcccff93 would interfere
> >with each other?
> >
> It might be running afoul with the 'cookie' mechanism.
> Device-mapper is using inserting a 'cookie' with the ioctl, and listens to
> any event containing the cookie to ensure udev has finished processing that
> device and hence the device node is accessible. Added to this is the problem
> that we don't have any good means of detecting any changes to device-mapper
> devices.

How does that relate to e8099177? Did e8099177 introduce this cookie
mechanism? If not, what is it about e8099177 that makes the async
processing problematic?

Paul.

2014-07-01 05:58:00

by Hannes Reinecke

[permalink] [raw]
Subject: Re: Regression in 3.15 on POWER8 with multipath SCSI

On 06/30/2014 11:28 PM, Paul Mackerras wrote:
> On Mon, Jun 30, 2014 at 01:35:20PM +0200, Hannes Reinecke wrote:
>> On 06/30/2014 01:02 PM, Paul Mackerras wrote:
[ .. ]
>>>
>>> No, I'm not using LVM, and in fact I deleted all the physical volumes
>>> that were on any of the disks (they were installations of other
>>> distros), so there are no physical or logical volumes anywhere on any
>>> disk. I haven't tried disabling the LVM service completely, though.
>>> What would it mean if disabling the LVM service made a difference?
>>>
>> Yes. LVM integration with systemd is a science unto itself.
>> I'm reasonably confident with multipath, but not LVM.
>> Plus the fact the the LVM service apparently is waiting for something sort
>> of points into that direction.
>>
>> So please do disable the lvm service.
>
> I disabled the LVM service, and it's still bad. Unmodified 3.15
> booted successfully in only 18 out of 50 attempts with LVM disabled.
>
> So it's not LVM. In any case LVM was fine with a 3.14 kernel.
>
Right, that was just a cross-check to eliminate any variables.

I'll be checking here at my end.

Cheers,

Hannes
--
Dr. Hannes Reinecke zSeries & Storage
[email protected] +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: J. Hawn, J. Guild, F. Imend?rffer, HRB 16746 (AG N?rnberg)

2014-07-01 19:39:18

by Mike Snitzer

[permalink] [raw]
Subject: Re: Regression in 3.15 on POWER8 with multipath SCSI

On Mon, Jun 30 2014 at 6:30am -0400,
Paul Mackerras <[email protected]> wrote:

> I have a machine on which 3.15 usually fails to boot, and 3.14 boots
> every time. The machine is a POWER8 2-socket server with 20 cores
> (thus 160 CPUs), 128GB of RAM, and 7 SCSI disks connected via a
> hardware-RAID-capable adapter which appears as two IPR controllers
> which are both connected to each disk. I am booting from a disk that
> has Fedora 20 installed on it.
>
> After over two weeks of bisections, I can finally point to the commits
> that cause the problems. The culprits are:
>
> 3e9f1be1 dm mpath: remove process_queued_ios()
> e8099177 dm mpath: push back requests instead of queueing
> bcccff93 kobject: don't block for each kobject_uevent
>
> The interesting thing is that neither e8099177 nor bcccff93 cause
> failures on their own, but with both commits in there are failures
> where the system will fail to find /home on some occasions.
>
> With 3e9f1be1 included, the system appears to be prone to a deadlock
> condition which typically causes the boot process to hang with this
> message showing:
>
> A start job is running for Monitoring of LVM2 mirror...rogress polling
>
> (with a [*** ] thing before it where the asterisks move back and
> forth).
>
> If I revert 63d832c3 ("dm mpath: really fix lockdep warning") ,
> 4cdd2ad7 ("dm mpath: fix lock order inconsistency in
> multipath_ioctl"), 3e9f1be1 and bcccff93, in that order, I get a
> kernel that will boot every time. The first two are later commits
> that fix some problems with 3e9f1be1 (though not the problems I am
> seeing).
>
> Can anyone see any reason why e8099177 and bcccff93 would interfere
> with each other?

No, not seeing any obvious relation.

But even though you listed e8099177 as a culprit you didn't list it as a
commit you reverted. Did you leave e8099177 simply because attempting
to revert it fails (if you don't first revert other dm-mpath.c commits)?

(btw, Bart Van Assche also has issues with commit e8099177 due to hangs
during cable pull testing of mpath devices -- Bart: curious to know if
your cable pull tests pass if you just revert bcccff93).

Mike

2014-07-02 15:30:54

by Bart Van Assche

[permalink] [raw]
Subject: Re: Regression in 3.15 on POWER8 with multipath SCSI

On 07/01/14 21:39, Mike Snitzer wrote:
> (btw, Bart Van Assche also has issues with commit e8099177 due to hangs
> during cable pull testing of mpath devices -- Bart: curious to know if
> your cable pull tests pass if you just revert bcccff93).

Sorry but even with bcccff93 reverted after a few iterations my cable
pull simulation test still causes several tasks to hang in sleep_on_page().

Bart.

2014-07-08 11:02:36

by Junichi Nomura

[permalink] [raw]
Subject: Re: Regression in 3.15 on POWER8 with multipath SCSI

On 07/02/14 04:39, Mike Snitzer wrote:
> On Mon, Jun 30 2014 at 6:30am -0400,
> Paul Mackerras <[email protected]> wrote:
>
>> I have a machine on which 3.15 usually fails to boot, and 3.14 boots
>> every time. The machine is a POWER8 2-socket server with 20 cores
>> (thus 160 CPUs), 128GB of RAM, and 7 SCSI disks connected via a
>> hardware-RAID-capable adapter which appears as two IPR controllers
>> which are both connected to each disk. I am booting from a disk that
>> has Fedora 20 installed on it.
>>
>> After over two weeks of bisections, I can finally point to the commits
>> that cause the problems. The culprits are:
>>
>> 3e9f1be1 dm mpath: remove process_queued_ios()
>> e8099177 dm mpath: push back requests instead of queueing
>> bcccff93 kobject: don't block for each kobject_uevent
>>
>> The interesting thing is that neither e8099177 nor bcccff93 cause
>> failures on their own, but with both commits in there are failures
>> where the system will fail to find /home on some occasions.
>>
>> With 3e9f1be1 included, the system appears to be prone to a deadlock
>> condition which typically causes the boot process to hang with this
>> message showing:
>>
>> A start job is running for Monitoring of LVM2 mirror...rogress polling
>>
>> (with a [*** ] thing before it where the asterisks move back and
>> forth).
>>
>> If I revert 63d832c3 ("dm mpath: really fix lockdep warning") ,
>> 4cdd2ad7 ("dm mpath: fix lock order inconsistency in
>> multipath_ioctl"), 3e9f1be1 and bcccff93, in that order, I get a
>> kernel that will boot every time. The first two are later commits
>> that fix some problems with 3e9f1be1 (though not the problems I am
>> seeing).
>>
>> Can anyone see any reason why e8099177 and bcccff93 would interfere
>> with each other?
>
> No, not seeing any obvious relation.
>
> But even though you listed e8099177 as a culprit you didn't list it as a
> commit you reverted. Did you leave e8099177 simply because attempting
> to revert it fails (if you don't first revert other dm-mpath.c commits)?
>
> (btw, Bart Van Assche also has issues with commit e8099177 due to hangs
> during cable pull testing of mpath devices -- Bart: curious to know if
> your cable pull tests pass if you just revert bcccff93).

It seems Bart's issue has gone with the attached patch:
http://www.redhat.com/archives/dm-devel/2014-July/msg00035.html
Could you try if it makes any difference on your issue?

The problem is dm-mpath's state machine stall due to e8099177
but ioctl to the device can kick the state machine running again.
That might be related to why bcccff93 affects the reproducibility.
Also, 3e9f1be1 integrates some codes into the one which is affected
by this problem. So it makes sense why the problem becomes easier
to occur with that.

-
Jun'ichi Nomura, NEC Corporation


pg_ready() checks the current state of the multipath and may return
false even if a new IO is needed to change the state.

OTOH, if multipath_busy() returns busy, a new IO will not be sent
to multipath target and the state change won't happen. That results
in lock up.

The intent of multipath_busy() is to avoid unnecessary cycles of
dequeue + request_fn + requeue if it is known that multipath device
will requeue.

Such situation would be:
- path group is being activated
- there is no path and the multipath is setup to requeue if no path

This patch should fix the problem introduced as a part of this commit:
commit e809917735ebf1b9a56c24e877ce0d320baee2ec
dm mpath: push back requests instead of queueing

diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index ebfa411..d58343e 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -1620,8 +1620,9 @@ static int multipath_busy(struct dm_target *ti)

spin_lock_irqsave(&m->lock, flags);

- /* pg_init in progress, requeue until done */
- if (!pg_ready(m)) {
+ /* pg_init in progress or no paths available */
+ if (m->pg_init_in_progress ||
+ (!m->nr_valid_paths && m->queue_if_no_path)) {
busy = 1;
goto out;
}-

2014-07-09 03:55:34

by Alexey Kardashevskiy

[permalink] [raw]
Subject: Re: Regression in 3.15 on POWER8 with multipath SCSI

On 07/08/2014 08:28 PM, Junichi Nomura wrote:
> On 07/02/14 04:39, Mike Snitzer wrote:
>> On Mon, Jun 30 2014 at 6:30am -0400,
>> Paul Mackerras <[email protected]> wrote:
>>
>>> I have a machine on which 3.15 usually fails to boot, and 3.14 boots
>>> every time. The machine is a POWER8 2-socket server with 20 cores
>>> (thus 160 CPUs), 128GB of RAM, and 7 SCSI disks connected via a
>>> hardware-RAID-capable adapter which appears as two IPR controllers
>>> which are both connected to each disk. I am booting from a disk that
>>> has Fedora 20 installed on it.
>>>
>>> After over two weeks of bisections, I can finally point to the commits
>>> that cause the problems. The culprits are:
>>>
>>> 3e9f1be1 dm mpath: remove process_queued_ios()
>>> e8099177 dm mpath: push back requests instead of queueing
>>> bcccff93 kobject: don't block for each kobject_uevent
>>>
>>> The interesting thing is that neither e8099177 nor bcccff93 cause
>>> failures on their own, but with both commits in there are failures
>>> where the system will fail to find /home on some occasions.
>>>
>>> With 3e9f1be1 included, the system appears to be prone to a deadlock
>>> condition which typically causes the boot process to hang with this
>>> message showing:
>>>
>>> A start job is running for Monitoring of LVM2 mirror...rogress polling
>>>
>>> (with a [*** ] thing before it where the asterisks move back and
>>> forth).
>>>
>>> If I revert 63d832c3 ("dm mpath: really fix lockdep warning") ,
>>> 4cdd2ad7 ("dm mpath: fix lock order inconsistency in
>>> multipath_ioctl"), 3e9f1be1 and bcccff93, in that order, I get a
>>> kernel that will boot every time. The first two are later commits
>>> that fix some problems with 3e9f1be1 (though not the problems I am
>>> seeing).
>>>
>>> Can anyone see any reason why e8099177 and bcccff93 would interfere
>>> with each other?
>>
>> No, not seeing any obvious relation.
>>
>> But even though you listed e8099177 as a culprit you didn't list it as a
>> commit you reverted. Did you leave e8099177 simply because attempting
>> to revert it fails (if you don't first revert other dm-mpath.c commits)?
>>
>> (btw, Bart Van Assche also has issues with commit e8099177 due to hangs
>> during cable pull testing of mpath devices -- Bart: curious to know if
>> your cable pull tests pass if you just revert bcccff93).
>
> It seems Bart's issue has gone with the attached patch:
> http://www.redhat.com/archives/dm-devel/2014-July/msg00035.html
> Could you try if it makes any difference on your issue?
>
> The problem is dm-mpath's state machine stall due to e8099177
> but ioctl to the device can kick the state machine running again.
> That might be related to why bcccff93 affects the reproducibility.
> Also, 3e9f1be1 integrates some codes into the one which is affected
> by this problem. So it makes sense why the problem becomes easier
> to occur with that.
>
> -
> Jun'ichi Nomura, NEC Corporation
>
>
> pg_ready() checks the current state of the multipath and may return
> false even if a new IO is needed to change the state.
>
> OTOH, if multipath_busy() returns busy, a new IO will not be sent
> to multipath target and the state change won't happen. That results
> in lock up.
>
> The intent of multipath_busy() is to avoid unnecessary cycles of
> dequeue + request_fn + requeue if it is known that multipath device
> will requeue.
>
> Such situation would be:
> - path group is being activated
> - there is no path and the multipath is setup to requeue if no path
>
> This patch should fix the problem introduced as a part of this commit:
> commit e809917735ebf1b9a56c24e877ce0d320baee2ec
> dm mpath: push back requests instead of queueing
>
> diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
> index ebfa411..d58343e 100644
> --- a/drivers/md/dm-mpath.c
> +++ b/drivers/md/dm-mpath.c
> @@ -1620,8 +1620,9 @@ static int multipath_busy(struct dm_target *ti)
>
> spin_lock_irqsave(&m->lock, flags);
>
> - /* pg_init in progress, requeue until done */
> - if (!pg_ready(m)) {
> + /* pg_init in progress or no paths available */
> + if (m->pg_init_in_progress ||
> + (!m->nr_valid_paths && m->queue_if_no_path)) {
> busy = 1;
> goto out;
> }

This patch fixes IPR SCSI for my POWER8 box, e8099177 was the problem.




--
Alexey

2014-07-09 12:14:12

by Junichi Nomura

[permalink] [raw]
Subject: Re: [dm-devel] Regression in 3.15 on POWER8 with multipath SCSI

On 07/09/14 12:55, Alexey Kardashevskiy wrote:
> On 07/08/2014 08:28 PM, Junichi Nomura wrote:
>> It seems Bart's issue has gone with the attached patch:
>> http://www.redhat.com/archives/dm-devel/2014-July/msg00035.html
>> Could you try if it makes any difference on your issue?
..
> This patch fixes IPR SCSI for my POWER8 box, e8099177 was the problem.

Thank you for the testing.

Mike Snitzer has picked up this patch for his tree:
https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=for-next&id=75c76c45b76e53b7c2f025d30e7e308bfe331004

--
Jun'ichi Nomura, NEC Corporation