FYI, there is an obvious bug in cpusets in 2.6.15-rcX:
cpuset_excl_nodes_overlap() may sleep (as it takes semaphore), but is
called from atomic context - select_bad_process() under tasklist_lock.
BUG. Found by Denis Lunev.
the same actually applies to cpuset_zone_allowed() which is called e.g.
from __alloc_pages()->get_page_from_freelist() and doesn't check for
GPF_NOATOMIC anyhow...
Kirill
Kirill Korotaev wrote:
> FYI, there is an obvious bug in cpusets in 2.6.15-rcX:
> cpuset_excl_nodes_overlap() may sleep (as it takes semaphore), but is
> called from atomic context - select_bad_process() under tasklist_lock.
> BUG. Found by Denis Lunev.
Sorry for not responding sooner - I was off the air for a week.
Thanks for finding and reporting this.
Apparently, from KUROSAWA Takahiro's report, this bug was also in
2.6.14. My initial reading of the code in 2.6.14 and 2.6.15-* agrees,
and finds that this bug was present since the cpuset_excl_nodes_overlap
call was added, Sept 8, 2005 (in Linus's tree.)
> the same actually applies to cpuset_zone_allowed() which is called e.g.
> from __alloc_pages()->get_page_from_freelist() and doesn't check for
> GPF_NOATOMIC anyhow...
I don't think so. Please read the comments in kernel/cpuset.c above
the routine cpuset_zone_allowed(). Either that routine is called with
the __GFP_HARDWALL flag set, so returns before it gets to the semaphore
call, or it is not called at all, due to the check for ATOMIC (!wait)
in mm/page_alloc.c.
I don't see any bugs like this, in the cpuset_zone_allowed code path.
==> My initial analysis - I have one bug, in the oom_kill path,
where the code takes callback_sem while holding tasklist_ lock,
that has been in the main line kernel since 2.6.14.
My first guess is that it will take me about a week, with testing and
other priorities (including a few more days vacation), to respond with a
patch. Speak up if that doesn't meet your needs.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401
>>FYI, there is an obvious bug in cpusets in 2.6.15-rcX:
>>cpuset_excl_nodes_overlap() may sleep (as it takes semaphore), but is
>>called from atomic context - select_bad_process() under tasklist_lock.
>>BUG. Found by Denis Lunev.
>
>
> Sorry for not responding sooner - I was off the air for a week.
>
> Thanks for finding and reporting this.
>
> Apparently, from KUROSAWA Takahiro's report, this bug was also in
> 2.6.14. My initial reading of the code in 2.6.14 and 2.6.15-* agrees,
> and finds that this bug was present since the cpuset_excl_nodes_overlap
> call was added, Sept 8, 2005 (in Linus's tree.)
>
>
>
>>the same actually applies to cpuset_zone_allowed() which is called e.g.
>>from __alloc_pages()->get_page_from_freelist() and doesn't check for
>>GPF_NOATOMIC anyhow...
>
>
> I don't think so. Please read the comments in kernel/cpuset.c above
> the routine cpuset_zone_allowed(). Either that routine is called with
> the __GFP_HARDWALL flag set, so returns before it gets to the semaphore
> call, or it is not called at all, due to the check for ATOMIC (!wait)
> in mm/page_alloc.c.
>
> I don't see any bugs like this, in the cpuset_zone_allowed code path.
this piece of code in __alloc_pages():
if (((p->flags & PF_MEMALLOC) ||
unlikely(test_thread_flag(TIF_MEMDIE)))
&& !in_interrupt()) {
if (!(gfp_mask & __GFP_NOMEMALLOC)) {
nofail_alloc:
/* go through the zonelist yet again, ignoring
mins */
page = get_page_from_freelist(gfp_mask, order,
zonelist,
ALLOC_NO_WATERMARKS|ALLOC_CPUSET);
ALLOC_CPUSET is specified, gfp_mask can be GFP_ATOMIC still and no
__GFP_HARDWALL. Am I wrong?
Kirill
Hi Mr. Jackson,
I've been stress testing 2.6.15 on an IntelliStation Z20 in our lab, and
have seen the same complaints that were described earlier in this thread
(sleeping in the cpuset code under OOM conditions):
Debug: sleeping function called from invalid context at
include/asm/semaphore.h:105
in_atomic():1, irqs_disabled():0
Call Trace:<ffffffff80130640>{__might_sleep+179}
<ffffffff8015bb68>{cpuset_excl_nodes_overlap+31}
<ffffffff80164d9f>{out_of_memory+123}
<ffffffff801676c4>{__alloc_pages+564}
<ffffffff8017ec41>{alloc_pages_current+160}
<ffffffff8016873f>{__do_page_cache_readahead+197}
<ffffffff8010e5c3>{__switch_to+50}
<ffffffff8031c547>{_spin_unlock_irqrestore+52}
<ffffffff80166604>{free_pages_bulk+674}
<ffffffff80168a00>{do_page_cache_readahead+82}
<ffffffff8016355a>{filemap_nopage+332}
<ffffffff8017378f>{__handle_mm_fault+1162}
<ffffffff8031c547>{_spin_unlock_irqrestore+52}
<ffffffff8031df02>{do_page_fault+1096}
<ffffffff801991ee>{sys_select+839} <ffffffff8011078d>{error_exit+0}
Since we seem to be able to reproduce this with some regularity, let me
know if you'd like us to test out any cpuset patch(es).
--Darrick
> Kirill Korotaev wrote:
>> FYI, there is an obvious bug in cpusets in 2.6.15-rcX:
>> cpuset_excl_nodes_overlap() may sleep (as it takes semaphore), but is
>> called from atomic context - select_bad_process() under tasklist_lock.
>> BUG. Found by Denis Lunev.
>
> Sorry for not responding sooner - I was off the air for a week.
>
> Thanks for finding and reporting this.
>
> Apparently, from KUROSAWA Takahiro's report, this bug was also in
> 2.6.14. My initial reading of the code in 2.6.14 and 2.6.15-* agrees,
> and finds that this bug was present since the cpuset_excl_nodes_overlap
> call was added, Sept 8, 2005 (in Linus's tree.)
>
>
>> the same actually applies to cpuset_zone_allowed() which is called e.g.
>> from __alloc_pages()->get_page_from_freelist() and doesn't check for
>> GPF_NOATOMIC anyhow...
>
> I don't think so. Please read the comments in kernel/cpuset.c above
> the routine cpuset_zone_allowed(). Either that routine is called with
> the __GFP_HARDWALL flag set, so returns before it gets to the semaphore
> call, or it is not called at all, due to the check for ATOMIC (!wait)
> in mm/page_alloc.c.
>
> I don't see any bugs like this, in the cpuset_zone_allowed code path.
>
>
> ==> My initial analysis - I have one bug, in the oom_kill path,
> where the code takes callback_sem while holding tasklist_ lock,
> that has been in the main line kernel since 2.6.14.
>
> My first guess is that it will take me about a week, with testing and
> other priorities (including a few more days vacation), to respond with a
> patch. Speak up if that doesn't meet your needs.