2015-02-03 00:28:24

by Gavin Guo

[permalink] [raw]
Subject: Re: General protection fault in iscsi_rx_thread_pre_handler

Hi Nicholas,

On Sun, Feb 1, 2015 at 11:47 AM, Gavin Guo <[email protected]> wrote:
> Hi Nicholas,
>
> On Sat, Jan 31, 2015 at 6:53 AM, Nicholas A. Bellinger
> <[email protected]> wrote:
>> On Fri, 2015-01-23 at 09:30 +0800, Gavin Guo wrote:
>>> Hi Nicholas,
>>>
>>> On Fri, Jan 23, 2015 at 1:35 AM, Nicholas A. Bellinger
>>> <[email protected]> wrote:
>>> > On Thu, 2015-01-22 at 23:56 +0800, Gavin Guo wrote:
>>> >> Hi Nicolas,
>>> >>
>>> >> On Thu, Jan 22, 2015 at 5:50 PM, Nicholas A. Bellinger
>>> >> <[email protected]> wrote:
>>> >> > Hi Gavin,
>>> >> >
>>> >> > On Thu, 2015-01-22 at 06:38 +0800, Gavin Guo wrote:
>>> >> >> Hi all,
>>> >> >>
>>> >> >> The general protection fault screenshot is attached.
>>> >> >>
>>> >> >> Summary:
>>> >> >> The kernel is Ubuntu-3.13.0-39.66. I've done basic analysis and found
>>> >> >> the fault is in list_del of iscsi_del_ts_from_active_list. And it
>>> >> >> looks like deleting the iscsi_thread_set *ts two times. The point to
>>> >> >> delete including iscsi_get_ts_from_inactive_list, was also checked but
>>> >> >> still can't find the clue. Really appreciate if anyone can provide any
>>> >> >> idea on the bug.
>>> >> >>
>>> >
>>> > <SNIP>
>>> >
>>> >> >
>>> >> > Thanks for your detailed analysis.
>>> >> >
>>> >> > A similar bug was reported off-list some months back by a person using
>>> >> > iser-target + RoCE export on v3.12.y code. Just to confirm, your
>>> >> > environment is using traditional iscsi-target + TCP export, right..?
>>> >>
>>> >> I am sorry that I'm not an expert of the field and already google RoCE
>>> >> on the internet but still don't really know what RoCE is. However, I
>>> >> can provide the informations. We used iscsiadm on the initiator side
>>> >> and lio_node and tcm_node commands to create the targets for
>>> >> connection. I think it should be normal iscsi-target using TCP
>>> >> export.
>>> >>
>>> >
>>> > Yep, that would be traditional iscsi-target + TCP export.
>>> >
>>> >> >
>>> >> > At the time, a different set of iser-target related changes ended up
>>> >> > avoiding this issue on his particular setup, so we thought it was likely
>>> >> > a race triggered by login failures specific to iser-target code.
>>> >> >
>>> >> > There was a untested patch (included inline below) to drop the legacy
>>> >> > active_ts_list usage all-together, but IIRC he was not able to reproduce
>>> >> > further so the patch didn't get picked up for mainline.
>>> >> >
>>> >> > If your able to reliability reproduce, please try with the following
>>> >> > patch and let us know your progress.
>>> >>
>>> >> Thanks for your time reading the mail. I'll let you know the result.
>>> >
>>> > Just curious, are you able to reliability reproduce this bug in a VM..?
>>>
>>> Thanks for your caring, the machine is on the customer side, I've
>>> asked and now waiting for their response.
>>
>> Hi Gavin,
>>
>> Just curious if there has been any update on this yet..?
>>
>> --nab
>>
>
> Really thanks for your attention. I'm also currently waiting for the
> customer's reply and will send the email again to ask for the result.
> However, I think the symptom may be hard to replicate that's why the
> customer didn't reply me for a long time. Thanks for your time again.
>
> Thanks,
> Gavin

Sorry for making you wait so long. I just got the response from the
customer, they said the general protection fault happened just 2 times
in the past and cannot be reliably reproduced. And I am now waiting
for the verification test.

Thanks,
Gavin


2015-02-12 07:16:26

by Nicholas A. Bellinger

[permalink] [raw]
Subject: Re: General protection fault in iscsi_rx_thread_pre_handler

Hi Gavin,

On Tue, 2015-02-03 at 08:28 +0800, Gavin Guo wrote:
> Hi Nicholas,
>
> On Sun, Feb 1, 2015 at 11:47 AM, Gavin Guo <[email protected]> wrote:
> > Hi Nicholas,
> >
> > On Sat, Jan 31, 2015 at 6:53 AM, Nicholas A. Bellinger
> > <[email protected]> wrote:
> >> On Fri, 2015-01-23 at 09:30 +0800, Gavin Guo wrote:
> >>> Hi Nicholas,
> >>>
> >>> On Fri, Jan 23, 2015 at 1:35 AM, Nicholas A. Bellinger
> >>> <[email protected]> wrote:
> >>> > On Thu, 2015-01-22 at 23:56 +0800, Gavin Guo wrote:
> >>> >> Hi Nicolas,
> >>> >>
> >>> >> On Thu, Jan 22, 2015 at 5:50 PM, Nicholas A. Bellinger

<SNIP>

> >>> >> > At the time, a different set of iser-target related changes ended up
> >>> >> > avoiding this issue on his particular setup, so we thought it was likely
> >>> >> > a race triggered by login failures specific to iser-target code.
> >>> >> >
> >>> >> > There was a untested patch (included inline below) to drop the legacy
> >>> >> > active_ts_list usage all-together, but IIRC he was not able to reproduce
> >>> >> > further so the patch didn't get picked up for mainline.
> >>> >> >
> >>> >> > If your able to reliability reproduce, please try with the following
> >>> >> > patch and let us know your progress.
> >>> >>
> >>> >> Thanks for your time reading the mail. I'll let you know the result.
> >>> >
> >>> > Just curious, are you able to reliability reproduce this bug in a VM..?
> >>>
> >>> Thanks for your caring, the machine is on the customer side, I've
> >>> asked and now waiting for their response.
> >>
> >> Hi Gavin,
> >>
> >> Just curious if there has been any update on this yet..?
> >>
> >> --nab
> >>
> >
> > Really thanks for your attention. I'm also currently waiting for the
> > customer's reply and will send the email again to ask for the result.
> > However, I think the symptom may be hard to replicate that's why the
> > customer didn't reply me for a long time. Thanks for your time again.
> >
> > Thanks,
> > Gavin
>
> Sorry for making you wait so long. I just got the response from the
> customer, they said the general protection fault happened just 2 times
> in the past and cannot be reliably reproduced. And I am now waiting
> for the verification test.
>

Just a heads up that I'm planning to include this patch in the v3.20-rc1
PULL request.

Please let me know if you have any objections.

Thank you,

--nab

2015-02-16 10:52:22

by Gavin Guo

[permalink] [raw]
Subject: Re: General protection fault in iscsi_rx_thread_pre_handler

Hi Nicholas,

On Thu, Feb 12, 2015 at 3:16 PM, Nicholas A. Bellinger
<[email protected]> wrote:
> Hi Gavin,
>
> On Tue, 2015-02-03 at 08:28 +0800, Gavin Guo wrote:
>> Hi Nicholas,
>>
>> On Sun, Feb 1, 2015 at 11:47 AM, Gavin Guo <[email protected]> wrote:
>> > Hi Nicholas,
>> >
>> > On Sat, Jan 31, 2015 at 6:53 AM, Nicholas A. Bellinger
>> > <[email protected]> wrote:
>> >> On Fri, 2015-01-23 at 09:30 +0800, Gavin Guo wrote:
>> >>> Hi Nicholas,
>> >>>
>> >>> On Fri, Jan 23, 2015 at 1:35 AM, Nicholas A. Bellinger
>> >>> <[email protected]> wrote:
>> >>> > On Thu, 2015-01-22 at 23:56 +0800, Gavin Guo wrote:
>> >>> >> Hi Nicolas,
>> >>> >>
>> >>> >> On Thu, Jan 22, 2015 at 5:50 PM, Nicholas A. Bellinger
>
> <SNIP>
>
>> >>> >> > At the time, a different set of iser-target related changes ended up
>> >>> >> > avoiding this issue on his particular setup, so we thought it was likely
>> >>> >> > a race triggered by login failures specific to iser-target code.
>> >>> >> >
>> >>> >> > There was a untested patch (included inline below) to drop the legacy
>> >>> >> > active_ts_list usage all-together, but IIRC he was not able to reproduce
>> >>> >> > further so the patch didn't get picked up for mainline.
>> >>> >> >
>> >>> >> > If your able to reliability reproduce, please try with the following
>> >>> >> > patch and let us know your progress.
>> >>> >>
>> >>> >> Thanks for your time reading the mail. I'll let you know the result.
>> >>> >
>> >>> > Just curious, are you able to reliability reproduce this bug in a VM..?
>> >>>
>> >>> Thanks for your caring, the machine is on the customer side, I've
>> >>> asked and now waiting for their response.
>> >>
>> >> Hi Gavin,
>> >>
>> >> Just curious if there has been any update on this yet..?
>> >>
>> >> --nab
>> >>
>> >
>> > Really thanks for your attention. I'm also currently waiting for the
>> > customer's reply and will send the email again to ask for the result.
>> > However, I think the symptom may be hard to replicate that's why the
>> > customer didn't reply me for a long time. Thanks for your time again.
>> >
>> > Thanks,
>> > Gavin
>>
>> Sorry for making you wait so long. I just got the response from the
>> customer, they said the general protection fault happened just 2 times
>> in the past and cannot be reliably reproduced. And I am now waiting
>> for the verification test.
>>
>
> Just a heads up that I'm planning to include this patch in the v3.20-rc1
> PULL request.
>
> Please let me know if you have any objections.
>
> Thank you,
>
> --nab
>

The bug

2015-02-16 10:56:34

by Gavin Guo

[permalink] [raw]
Subject: Re: General protection fault in iscsi_rx_thread_pre_handler

Hi Nicholas,

On Mon, Feb 16, 2015 at 6:52 PM, Gavin Guo <[email protected]> wrote:
> Hi Nicholas,
>
> On Thu, Feb 12, 2015 at 3:16 PM, Nicholas A. Bellinger
> <[email protected]> wrote:
>> Hi Gavin,
>>
>> On Tue, 2015-02-03 at 08:28 +0800, Gavin Guo wrote:
>>> Hi Nicholas,
>>>
>>> On Sun, Feb 1, 2015 at 11:47 AM, Gavin Guo <[email protected]> wrote:
>>> > Hi Nicholas,
>>> >
>>> > On Sat, Jan 31, 2015 at 6:53 AM, Nicholas A. Bellinger
>>> > <[email protected]> wrote:
>>> >> On Fri, 2015-01-23 at 09:30 +0800, Gavin Guo wrote:
>>> >>> Hi Nicholas,
>>> >>>
>>> >>> On Fri, Jan 23, 2015 at 1:35 AM, Nicholas A. Bellinger
>>> >>> <[email protected]> wrote:
>>> >>> > On Thu, 2015-01-22 at 23:56 +0800, Gavin Guo wrote:
>>> >>> >> Hi Nicolas,
>>> >>> >>
>>> >>> >> On Thu, Jan 22, 2015 at 5:50 PM, Nicholas A. Bellinger
>>
>> <SNIP>
>>
>>> >>> >> > At the time, a different set of iser-target related changes ended up
>>> >>> >> > avoiding this issue on his particular setup, so we thought it was likely
>>> >>> >> > a race triggered by login failures specific to iser-target code.
>>> >>> >> >
>>> >>> >> > There was a untested patch (included inline below) to drop the legacy
>>> >>> >> > active_ts_list usage all-together, but IIRC he was not able to reproduce
>>> >>> >> > further so the patch didn't get picked up for mainline.
>>> >>> >> >
>>> >>> >> > If your able to reliability reproduce, please try with the following
>>> >>> >> > patch and let us know your progress.
>>> >>> >>
>>> >>> >> Thanks for your time reading the mail. I'll let you know the result.
>>> >>> >
>>> >>> > Just curious, are you able to reliability reproduce this bug in a VM..?
>>> >>>
>>> >>> Thanks for your caring, the machine is on the customer side, I've
>>> >>> asked and now waiting for their response.
>>> >>
>>> >> Hi Gavin,
>>> >>
>>> >> Just curious if there has been any update on this yet..?
>>> >>
>>> >> --nab
>>> >>
>>> >
>>> > Really thanks for your attention. I'm also currently waiting for the
>>> > customer's reply and will send the email again to ask for the result.
>>> > However, I think the symptom may be hard to replicate that's why the
>>> > customer didn't reply me for a long time. Thanks for your time again.
>>> >
>>> > Thanks,
>>> > Gavin
>>>
>>> Sorry for making you wait so long. I just got the response from the
>>> customer, they said the general protection fault happened just 2 times
>>> in the past and cannot be reliably reproduced. And I am now waiting
>>> for the verification test.
>>>
>>
>> Just a heads up that I'm planning to include this patch in the v3.20-rc1
>> PULL request.
>>
>> Please let me know if you have any objections.
>>
>> Thank you,
>>
>> --nab
>>
>
> The bug

Sorry, I mistakenly press the send button last time.

The bug doesn't appear after the customer upgraded the kernel with the
patch. Really thanks for your help. I'll keep you posted if the bug
appears again.

Thanks,
Gavin