2016-12-07 03:55:13

by Oleg Drokin

[permalink] [raw]
Subject: [PATCH] staging/lustre/osc: Revert erroneous list_for_each_entry_safe use

I have been having a lot of unexplainable crashes in osc_lru_shrink
lately that I could not see a good explanation for and then I found
this patch that slip under the radar somehow that incorrectly
converted while loop for lru list iteration into
list_for_each_entry_safe totally ignoring that in the body of
the loop we drop spinlocks guarding this list and move list entries
around.
Not sure why it was not showing up right away, perhaps some of the
more recent LRU changes committed caused some extra pressure on this
code that finally highlighted the breakage.

Reverts: 8adddc36b1fc ("staging: lustre: osc: Use list_for_each_entry_safe")
CC: Bhaktipriya Shridhar <[email protected]>
Signed-off-by: Oleg Drokin <[email protected]>
---
I also do not see this patch in any of the mailing lists I am subscribed to.
I wonder if there's a way to subscribe to those Greg's
"This is a note to let you know that I've just added the patch ...."
emails that concern Lustre to get them even if I am not on the CC list in
the patch itself?

drivers/staging/lustre/lustre/osc/osc_page.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/staging/lustre/lustre/osc/osc_page.c b/drivers/staging/lustre/lustre/osc/osc_page.c
index c5129d1..e356e4a 100644
--- a/drivers/staging/lustre/lustre/osc/osc_page.c
+++ b/drivers/staging/lustre/lustre/osc/osc_page.c
@@ -537,7 +537,6 @@ long osc_lru_shrink(const struct lu_env *env, struct client_obd *cli,
struct cl_object *clobj = NULL;
struct cl_page **pvec;
struct osc_page *opg;
- struct osc_page *temp;
int maxscan = 0;
long count = 0;
int index = 0;
@@ -568,7 +567,7 @@ long osc_lru_shrink(const struct lu_env *env, struct client_obd *cli,
if (force)
cli->cl_lru_reclaim++;
maxscan = min(target << 1, atomic_long_read(&cli->cl_lru_in_list));
- list_for_each_entry_safe(opg, temp, &cli->cl_lru_list, ops_lru) {
+ while (!list_empty(&cli->cl_lru_list)) {
struct cl_page *page;
bool will_free = false;

@@ -578,6 +577,8 @@ long osc_lru_shrink(const struct lu_env *env, struct client_obd *cli,
if (--maxscan < 0)
break;

+ opg = list_entry(cli->cl_lru_list.next, struct osc_page,
+ ops_lru);
page = opg->ops_cl.cpl_page;
if (lru_page_busy(cli, page)) {
list_move_tail(&opg->ops_lru, &cli->cl_lru_list);
--
2.7.4


2016-12-07 10:40:37

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH] staging/lustre/osc: Revert erroneous list_for_each_entry_safe use

On Tue, Dec 06, 2016 at 10:53:48PM -0500, Oleg Drokin wrote:
> I have been having a lot of unexplainable crashes in osc_lru_shrink
> lately that I could not see a good explanation for and then I found
> this patch that slip under the radar somehow that incorrectly
> converted while loop for lru list iteration into
> list_for_each_entry_safe totally ignoring that in the body of
> the loop we drop spinlocks guarding this list and move list entries
> around.
> Not sure why it was not showing up right away, perhaps some of the
> more recent LRU changes committed caused some extra pressure on this
> code that finally highlighted the breakage.
>
> Reverts: 8adddc36b1fc ("staging: lustre: osc: Use list_for_each_entry_safe")
> CC: Bhaktipriya Shridhar <[email protected]>
> Signed-off-by: Oleg Drokin <[email protected]>
> ---
> I also do not see this patch in any of the mailing lists I am subscribed to.
> I wonder if there's a way to subscribe to those Greg's
> "This is a note to let you know that I've just added the patch ...."
> emails that concern Lustre to get them even if I am not on the CC list in
> the patch itself?

This came in from the Outreacy application process, which now requires
that they cc: the maintainers to catch this type of issue. So you
should have seen these types of patches this last round, the commit you
reference was done before that change happened, sorry.

This change should go to stable kernels, so I'll mark it that way.

thanks,

greg k-h

2016-12-07 16:30:22

by Oleg Drokin

[permalink] [raw]
Subject: Re: [PATCH] staging/lustre/osc: Revert erroneous list_for_each_entry_safe use


On Dec 7, 2016, at 5:40 AM, Greg Kroah-Hartman wrote:

> On Tue, Dec 06, 2016 at 10:53:48PM -0500, Oleg Drokin wrote:
>> I have been having a lot of unexplainable crashes in osc_lru_shrink
>> lately that I could not see a good explanation for and then I found
>> this patch that slip under the radar somehow that incorrectly
>> converted while loop for lru list iteration into
>> list_for_each_entry_safe totally ignoring that in the body of
>> the loop we drop spinlocks guarding this list and move list entries
>> around.
>> Not sure why it was not showing up right away, perhaps some of the
>> more recent LRU changes committed caused some extra pressure on this
>> code that finally highlighted the breakage.
>>
>> Reverts: 8adddc36b1fc ("staging: lustre: osc: Use list_for_each_entry_safe")
>> CC: Bhaktipriya Shridhar <[email protected]>
>> Signed-off-by: Oleg Drokin <[email protected]>
>> ---
>> I also do not see this patch in any of the mailing lists I am subscribed to.
>> I wonder if there's a way to subscribe to those Greg's
>> "This is a note to let you know that I've just added the patch ...."
>> emails that concern Lustre to get them even if I am not on the CC list in
>> the patch itself?
>
> This came in from the Outreacy application process, which now requires
> that they cc: the maintainers to catch this type of issue. So you
> should have seen these types of patches this last round, the commit you
> reference was done before that change happened, sorry.

Do you know approximate date range of when these patches ere sneaking in?
I'd like to take a look at the rest of it proactively just to see if there are
more undiscovered surprises?

> This change should go to stable kernels, so I'll mark it that way.

Thanks!

2016-12-07 20:37:38

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH] staging/lustre/osc: Revert erroneous list_for_each_entry_safe use

On Wed, Dec 07, 2016 at 11:29:36AM -0500, Oleg Drokin wrote:
>
> On Dec 7, 2016, at 5:40 AM, Greg Kroah-Hartman wrote:
>
> > On Tue, Dec 06, 2016 at 10:53:48PM -0500, Oleg Drokin wrote:
> >> I have been having a lot of unexplainable crashes in osc_lru_shrink
> >> lately that I could not see a good explanation for and then I found
> >> this patch that slip under the radar somehow that incorrectly
> >> converted while loop for lru list iteration into
> >> list_for_each_entry_safe totally ignoring that in the body of
> >> the loop we drop spinlocks guarding this list and move list entries
> >> around.
> >> Not sure why it was not showing up right away, perhaps some of the
> >> more recent LRU changes committed caused some extra pressure on this
> >> code that finally highlighted the breakage.
> >>
> >> Reverts: 8adddc36b1fc ("staging: lustre: osc: Use list_for_each_entry_safe")
> >> CC: Bhaktipriya Shridhar <[email protected]>
> >> Signed-off-by: Oleg Drokin <[email protected]>
> >> ---
> >> I also do not see this patch in any of the mailing lists I am subscribed to.
> >> I wonder if there's a way to subscribe to those Greg's
> >> "This is a note to let you know that I've just added the patch ...."
> >> emails that concern Lustre to get them even if I am not on the CC list in
> >> the patch itself?
> >
> > This came in from the Outreacy application process, which now requires
> > that they cc: the maintainers to catch this type of issue. So you
> > should have seen these types of patches this last round, the commit you
> > reference was done before that change happened, sorry.
>
> Do you know approximate date range of when these patches ere sneaking in?

Anytime before a few months ago.

> I'd like to take a look at the rest of it proactively just to see if there are
> more undiscovered surprises?

If your testing isn't finding any problems, all should be good, right?
:)

2016-12-07 21:17:41

by Oleg Drokin

[permalink] [raw]
Subject: Re: [lustre-devel] [PATCH] staging/lustre/osc: Revert erroneous list_for_each_entry_safe use


On Dec 7, 2016, at 3:37 PM, Greg Kroah-Hartman wrote:

> On Wed, Dec 07, 2016 at 11:29:36AM -0500, Oleg Drokin wrote:
>>
>> On Dec 7, 2016, at 5:40 AM, Greg Kroah-Hartman wrote:
>>
>>> On Tue, Dec 06, 2016 at 10:53:48PM -0500, Oleg Drokin wrote:
>>>> I have been having a lot of unexplainable crashes in osc_lru_shrink
>>>> lately that I could not see a good explanation for and then I found
>>>> this patch that slip under the radar somehow that incorrectly
>>>> converted while loop for lru list iteration into
>>>> list_for_each_entry_safe totally ignoring that in the body of
>>>> the loop we drop spinlocks guarding this list and move list entries
>>>> around.
>>>> Not sure why it was not showing up right away, perhaps some of the
>>>> more recent LRU changes committed caused some extra pressure on this
>>>> code that finally highlighted the breakage.
>>>>
>>>> Reverts: 8adddc36b1fc ("staging: lustre: osc: Use list_for_each_entry_safe")
>>>> CC: Bhaktipriya Shridhar <[email protected]>
>>>> Signed-off-by: Oleg Drokin <[email protected]>
>>>> ---
>>>> I also do not see this patch in any of the mailing lists I am subscribed to.
>>>> I wonder if there's a way to subscribe to those Greg's
>>>> "This is a note to let you know that I've just added the patch ...."
>>>> emails that concern Lustre to get them even if I am not on the CC list in
>>>> the patch itself?
>>>
>>> This came in from the Outreacy application process, which now requires
>>> that they cc: the maintainers to catch this type of issue. So you
>>> should have seen these types of patches this last round, the commit you
>>> reference was done before that change happened, sorry.
>>
>> Do you know approximate date range of when these patches ere sneaking in?
>
> Anytime before a few months ago.

Ugh, I see.

>> I'd like to take a look at the rest of it proactively just to see if there are
>> more undiscovered surprises?
>
> If your testing isn't finding any problems, all should be good, right?
> :)

I see processes hanging waiting for RPC response (rarely) that is very suspicious,
but I did not get to the root of it yet.
Also my test system is limited in capacity, they don't let me anywhere near those
TOP100 systems with the staging client ;)