On Mon, 2019-07-29 at 17:42 -0400, Waiman Long wrote:
> What I have found is that a long running process on a mostly idle
> system
> with many CPUs is likely to cycle through a lot of the CPUs during
> its
> lifetime and leave behind its mm in the active_mm of those CPUs. My
> 2-socket test system have 96 logical CPUs. After running the test
> program for a minute or so, it leaves behind its mm in about half of
> the
> CPUs with a mm_count of 45 after exit. So the dying mm will stay
> until
> all those 45 CPUs get new user tasks to run.
OK. On what kernel are you seeing this?
On current upstream, the code in native_flush_tlb_others()
will send a TLB flush to every CPU in mm_cpumask() if page
table pages have been freed.
That should cause the lazy TLB CPUs to switch to init_mm
when the exit->zap_page_range path gets to the point where
it frees page tables.
> > If it is only on the CPU where the task is exiting,
> > would the TASK_DEAD handling in finish_task_switch()
> > be a better place to handle this?
>
> I need to switch the mm off the dying one. mm switching is only done
> in
> context_switch(). I don't think finish_task_switch() is the right
> place.
mm switching is also done in flush_tlb_func_common,
if the CPU received a TLB shootdown IPI while in lazy
TLB mode.
--
All Rights Reversed.
On 7/29/19 8:26 PM, Rik van Riel wrote:
> On Mon, 2019-07-29 at 17:42 -0400, Waiman Long wrote:
>
>> What I have found is that a long running process on a mostly idle
>> system
>> with many CPUs is likely to cycle through a lot of the CPUs during
>> its
>> lifetime and leave behind its mm in the active_mm of those CPUs. My
>> 2-socket test system have 96 logical CPUs. After running the test
>> program for a minute or so, it leaves behind its mm in about half of
>> the
>> CPUs with a mm_count of 45 after exit. So the dying mm will stay
>> until
>> all those 45 CPUs get new user tasks to run.
> OK. On what kernel are you seeing this?
>
> On current upstream, the code in native_flush_tlb_others()
> will send a TLB flush to every CPU in mm_cpumask() if page
> table pages have been freed.
>
> That should cause the lazy TLB CPUs to switch to init_mm
> when the exit->zap_page_range path gets to the point where
> it frees page tables.
>
I was using the latest upstream 5.3-rc2 kernel. It may be the case that
the mm has been switched, but the mm_count field of the active_mm of the
kthread is not being decremented until a user task runs on a CPU.
>>> If it is only on the CPU where the task is exiting,
>>> would the TASK_DEAD handling in finish_task_switch()
>>> be a better place to handle this?
>> I need to switch the mm off the dying one. mm switching is only done
>> in
>> context_switch(). I don't think finish_task_switch() is the right
>> place.
> mm switching is also done in flush_tlb_func_common,
> if the CPU received a TLB shootdown IPI while in lazy
> TLB mode.
>
I see.
Cheers,
Longman
On Tue, 2019-07-30 at 17:01 -0400, Waiman Long wrote:
> On 7/29/19 8:26 PM, Rik van Riel wrote:
> > On Mon, 2019-07-29 at 17:42 -0400, Waiman Long wrote:
> >
> > > What I have found is that a long running process on a mostly idle
> > > system
> > > with many CPUs is likely to cycle through a lot of the CPUs
> > > during
> > > its
> > > lifetime and leave behind its mm in the active_mm of those
> > > CPUs. My
> > > 2-socket test system have 96 logical CPUs. After running the test
> > > program for a minute or so, it leaves behind its mm in about half
> > > of
> > > the
> > > CPUs with a mm_count of 45 after exit. So the dying mm will stay
> > > until
> > > all those 45 CPUs get new user tasks to run.
> > OK. On what kernel are you seeing this?
> >
> > On current upstream, the code in native_flush_tlb_others()
> > will send a TLB flush to every CPU in mm_cpumask() if page
> > table pages have been freed.
> >
> > That should cause the lazy TLB CPUs to switch to init_mm
> > when the exit->zap_page_range path gets to the point where
> > it frees page tables.
> >
> I was using the latest upstream 5.3-rc2 kernel. It may be the case
> that
> the mm has been switched, but the mm_count field of the active_mm of
> the
> kthread is not being decremented until a user task runs on a CPU.
Is that something we could fix from the TLB flushing
code?
When switching to init_mm, drop the refcount on the
lazy mm?
That way that overhead is not added to the context
switching code.
--
All Rights Reversed.
On 7/31/19 9:48 AM, Rik van Riel wrote:
> On Tue, 2019-07-30 at 17:01 -0400, Waiman Long wrote:
>> On 7/29/19 8:26 PM, Rik van Riel wrote:
>>> On Mon, 2019-07-29 at 17:42 -0400, Waiman Long wrote:
>>>
>>>> What I have found is that a long running process on a mostly idle
>>>> system
>>>> with many CPUs is likely to cycle through a lot of the CPUs
>>>> during
>>>> its
>>>> lifetime and leave behind its mm in the active_mm of those
>>>> CPUs. My
>>>> 2-socket test system have 96 logical CPUs. After running the test
>>>> program for a minute or so, it leaves behind its mm in about half
>>>> of
>>>> the
>>>> CPUs with a mm_count of 45 after exit. So the dying mm will stay
>>>> until
>>>> all those 45 CPUs get new user tasks to run.
>>> OK. On what kernel are you seeing this?
>>>
>>> On current upstream, the code in native_flush_tlb_others()
>>> will send a TLB flush to every CPU in mm_cpumask() if page
>>> table pages have been freed.
>>>
>>> That should cause the lazy TLB CPUs to switch to init_mm
>>> when the exit->zap_page_range path gets to the point where
>>> it frees page tables.
>>>
>> I was using the latest upstream 5.3-rc2 kernel. It may be the case
>> that
>> the mm has been switched, but the mm_count field of the active_mm of
>> the
>> kthread is not being decremented until a user task runs on a CPU.
> Is that something we could fix from the TLB flushing
> code?
>
> When switching to init_mm, drop the refcount on the
> lazy mm?
>
> That way that overhead is not added to the context
> switching code.
I have thought about that. That will require changing the active_mm of
the current task to point to init_mm, for example. Since TLB flush is
done in interrupt context, proper coordination between interrupt and
process context will require some atomic instruction which will defect
the purpose.
Cheers,
Longman
On Wed, 2019-07-31 at 10:15 -0400, Waiman Long wrote:
> On 7/31/19 9:48 AM, Rik van Riel wrote:
> > On Tue, 2019-07-30 at 17:01 -0400, Waiman Long wrote:
> > > On 7/29/19 8:26 PM, Rik van Riel wrote:
> > > > On Mon, 2019-07-29 at 17:42 -0400, Waiman Long wrote:
> > > >
> > > > > What I have found is that a long running process on a mostly
> > > > > idle
> > > > > system
> > > > > with many CPUs is likely to cycle through a lot of the CPUs
> > > > > during
> > > > > its
> > > > > lifetime and leave behind its mm in the active_mm of those
> > > > > CPUs. My
> > > > > 2-socket test system have 96 logical CPUs. After running the
> > > > > test
> > > > > program for a minute or so, it leaves behind its mm in about
> > > > > half
> > > > > of
> > > > > the
> > > > > CPUs with a mm_count of 45 after exit. So the dying mm will
> > > > > stay
> > > > > until
> > > > > all those 45 CPUs get new user tasks to run.
> > > > OK. On what kernel are you seeing this?
> > > >
> > > > On current upstream, the code in native_flush_tlb_others()
> > > > will send a TLB flush to every CPU in mm_cpumask() if page
> > > > table pages have been freed.
> > > >
> > > > That should cause the lazy TLB CPUs to switch to init_mm
> > > > when the exit->zap_page_range path gets to the point where
> > > > it frees page tables.
> > > >
> > > I was using the latest upstream 5.3-rc2 kernel. It may be the
> > > case
> > > that
> > > the mm has been switched, but the mm_count field of the active_mm
> > > of
> > > the
> > > kthread is not being decremented until a user task runs on a CPU.
> > Is that something we could fix from the TLB flushing
> > code?
> >
> > When switching to init_mm, drop the refcount on the
> > lazy mm?
> >
> > That way that overhead is not added to the context
> > switching code.
>
> I have thought about that. That will require changing the active_mm
> of
> the current task to point to init_mm, for example. Since TLB flush is
> done in interrupt context, proper coordination between interrupt and
> process context will require some atomic instruction which will
> defect
> the purpose.
Would it be possible to work around that by scheduling
a work item that drops the active_mm?
After all, a work item runs in a kernel thread, so by
the time the work item is run, either the kernel will
still be running the mm you want to get rid of as
active_mm, or it will have already gotten rid of it
earlier.
--
All Rights Reversed.