LinuxLists.cc - [QUERY]: Is using CPU hotplug right for isolating CPUs?

2014-01-15 09:27:40

Subject: [QUERY]: Is using CPU hotplug right for isolating CPUs?

Hi Again,

I am now successful in isolating a CPU completely using CPUsets,
NO_HZ_FULL and CPU hotplug..

My setup and requirements for those who weren't following the
earlier mails:

For networking machines it is required to run data plane threads on
some CPUs (i.e. one thread per CPU) and these CPUs shouldn't be
interrupted by kernel at all.

Earlier I tried CPUSets with NO_HZ by creating two groups with
load_balancing disabled between them and manually tried to move
all tasks out to CPU0 group. But even then there were interruptions
which were continuously coming on CPU1 (which I am trying to
isolate). These were some workqueue events, some timers (like
prandom), timer overflow events (As NO_HZ_FULL pushes hrtimer
to long ahead in future, 450 seconds, rather than disabling them
completely, and these hardware timers were overflowing their
counters after 90 seconds on Samsung Exynos board).

So after creating CPUsets I hotunplugged CPU1 and added it back
immediately. This moved all these interruptions away and now
CPU1 is running my single thread ("stress") for ever.

Now my question is: Is there anything particularly wrong about using
hotplugging here ? Will that lead to a disaster :)

Thanks in Advance.

--
viresh

2014-01-15 10:38:32

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [QUERY]: Is using CPU hotplug right for isolating CPUs?

On Wed, Jan 15, 2014 at 02:57:36PM +0530, Viresh Kumar wrote:
> Hi Again,
>
> I am now successful in isolating a CPU completely using CPUsets,
> NO_HZ_FULL and CPU hotplug..
>
> My setup and requirements for those who weren't following the
> earlier mails:
>
> For networking machines it is required to run data plane threads on
> some CPUs (i.e. one thread per CPU) and these CPUs shouldn't be
> interrupted by kernel at all.
>
> Earlier I tried CPUSets with NO_HZ by creating two groups with
> load_balancing disabled between them and manually tried to move
> all tasks out to CPU0 group. But even then there were interruptions
> which were continuously coming on CPU1 (which I am trying to
> isolate). These were some workqueue events, some timers (like
> prandom), timer overflow events (As NO_HZ_FULL pushes hrtimer
> to long ahead in future, 450 seconds, rather than disabling them
> completely, and these hardware timers were overflowing their
> counters after 90 seconds on Samsung Exynos board).
>
> So after creating CPUsets I hotunplugged CPU1 and added it back
> immediately. This moved all these interruptions away and now
> CPU1 is running my single thread ("stress") for ever.
>
> Now my question is: Is there anything particularly wrong about using
> hotplugging here ? Will that lead to a disaster :)

Nah, its just ugly and we should fix it. You need to be careful to not
place tasks in a cpuset you're going to unplug though, that'll give
funny results.

2014-01-15 10:47:29

by Viresh Kumar

[permalink] [raw]

Subject: Re: [QUERY]: Is using CPU hotplug right for isolating CPUs?

On 15 January 2014 16:08, Peter Zijlstra <[email protected]> wrote:
> Nah, its just ugly and we should fix it. You need to be careful to not
> place tasks in a cpuset you're going to unplug though, that'll give
> funny results.

Okay. So how do you suggest to get rid of cases like a work queued
on CPU1 initially and because it gets queued again from its work handler,
it stays on the same CPU forever.

And then there were timer overflow events that occur because hrtimer
is started by tick-sched stuff for 450 seconds later in time.

--
viresh

2014-01-15 11:34:28

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [QUERY]: Is using CPU hotplug right for isolating CPUs?

On Wed, Jan 15, 2014 at 04:17:26PM +0530, Viresh Kumar wrote:
> On 15 January 2014 16:08, Peter Zijlstra <[email protected]> wrote:
> > Nah, its just ugly and we should fix it. You need to be careful to not
> > place tasks in a cpuset you're going to unplug though, that'll give
> > funny results.
>
> Okay. So how do you suggest to get rid of cases like a work queued
> on CPU1 initially and because it gets queued again from its work handler,
> it stays on the same CPU forever.

We should have a cpuset.quiesce control or something that moves all
timers out.

> And then there were timer overflow events that occur because hrtimer
> is started by tick-sched stuff for 450 seconds later in time.

-ENOPARSE

2014-01-15 17:17:12

by Frederic Weisbecker

[permalink] [raw]

Subject: Re: [QUERY]: Is using CPU hotplug right for isolating CPUs?

On Wed, Jan 15, 2014 at 02:57:36PM +0530, Viresh Kumar wrote:
> Hi Again,
>
> I am now successful in isolating a CPU completely using CPUsets,
> NO_HZ_FULL and CPU hotplug..
>
> My setup and requirements for those who weren't following the
> earlier mails:
>
> For networking machines it is required to run data plane threads on
> some CPUs (i.e. one thread per CPU) and these CPUs shouldn't be
> interrupted by kernel at all.
>
> Earlier I tried CPUSets with NO_HZ by creating two groups with
> load_balancing disabled between them and manually tried to move
> all tasks out to CPU0 group. But even then there were interruptions
> which were continuously coming on CPU1 (which I am trying to
> isolate). These were some workqueue events, some timers (like
> prandom), timer overflow events (As NO_HZ_FULL pushes hrtimer
> to long ahead in future, 450 seconds, rather than disabling them
> completely, and these hardware timers were overflowing their
> counters after 90 seconds on Samsung Exynos board).

Are you sure about that? NO_HZ_FULL shouldn't touch much hrtimers.
Those are independant from the tick.

Although some of them seem to rely on the softirq, but that seem to
concern the tick hrtimer only.

2014-01-16 09:46:39

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [QUERY]: Is using CPU hotplug right for isolating CPUs?

On Thu, 16 Jan 2014, Viresh Kumar wrote:

> On 15 January 2014 22:47, Frederic Weisbecker <[email protected]> wrote:
> > Are you sure about that? NO_HZ_FULL shouldn't touch much hrtimers.
> > Those are independant from the tick.
> >
> > Although some of them seem to rely on the softirq, but that seem to
> > concern the tick hrtimer only.
>
> To make it clear I was talking about the hrtimer used by tick_sched_timer.
> I have crossed checked which timers are active on isolated CPU from
> /proc/timer_list and it gave on tick_sched_timer's hrtimer.
>
> In the attached trace (dft.txt), see these locations:
> - Line 252: Time 302.573881: we scheduled the hrtimer for 300 seconds
> ahead of current time.
> - Line 254, 258, 262, 330, 334: We got interruptions continuously after
> ~90 seconds and this looked to be a case of timer's counter overflow.
> Isn't it? (I have removed some lines towards the end of this file to make
> it shorter, though dft.dat is untouched)

Just do the math.

max reload value / timer freq = max time span

So:

0x7fffffff / 24MHz = 89.478485 sec

Nothing to do here except to get rid of the requirement to arm the
timer at all.

Thanks,

tglx

2014-01-20 11:30:24

by Viresh Kumar

[permalink] [raw]

Subject: Re: [QUERY]: Is using CPU hotplug right for isolating CPUs?

On 16 January 2014 15:16, Thomas Gleixner <[email protected]> wrote:
> Just do the math.
>
> max reload value / timer freq = max time span

Thanks.

> So:
>
> 0x7fffffff / 24MHz = 89.478485 sec
>
> Nothing to do here except to get rid of the requirement to arm the
> timer at all.

@Frederic: Any inputs on how to get rid of this timer here?

2014-01-20 13:59:49

by Lei Wen

[permalink] [raw]

Subject: Re: [QUERY]: Is using CPU hotplug right for isolating CPUs?

Hi Viresh,

On Wed, Jan 15, 2014 at 5:27 PM, Viresh Kumar <[email protected]> wrote:
> Hi Again,
>
> I am now successful in isolating a CPU completely using CPUsets,
> NO_HZ_FULL and CPU hotplug..
>
> My setup and requirements for those who weren't following the
> earlier mails:
>
> For networking machines it is required to run data plane threads on
> some CPUs (i.e. one thread per CPU) and these CPUs shouldn't be
> interrupted by kernel at all.
>
> Earlier I tried CPUSets with NO_HZ by creating two groups with
> load_balancing disabled between them and manually tried to move
> all tasks out to CPU0 group. But even then there were interruptions
> which were continuously coming on CPU1 (which I am trying to
> isolate). These were some workqueue events, some timers (like
> prandom), timer overflow events (As NO_HZ_FULL pushes hrtimer
> to long ahead in future, 450 seconds, rather than disabling them
> completely, and these hardware timers were overflowing their
> counters after 90 seconds on Samsung Exynos board).
>
> So after creating CPUsets I hotunplugged CPU1 and added it back
> immediately. This moved all these interruptions away and now
> CPU1 is running my single thread ("stress") for ever.

I have one question regarding unbounded workqueue migration in your case.
You use hotplug to migrate the unbounded work to other cpus, but its cpu mask
would still be 0xf, since cannot be changed by cpuset.

My question is how you could prevent this unbounded work migrate back
to your isolated cpu?
Seems to me there is no such mechanism in kernel, am I understand wrong?

Thanks,
Lei

>
> Now my question is: Is there anything particularly wrong about using
> hotplugging here ? Will that lead to a disaster :)
>
> Thanks in Advance.
>
> --
> viresh
>
> _______________________________________________
> linaro-kernel mailing list
> [email protected]
> http://lists.linaro.org/mailman/listinfo/linaro-kernel

2014-01-20 15:00:14

by Viresh Kumar

[permalink] [raw]

Subject: Re: [QUERY]: Is using CPU hotplug right for isolating CPUs?

On 20 January 2014 19:29, Lei Wen <[email protected]> wrote:
> Hi Viresh,

Hi Lei,

> I have one question regarding unbounded workqueue migration in your case.
> You use hotplug to migrate the unbounded work to other cpus, but its cpu mask
> would still be 0xf, since cannot be changed by cpuset.
>
> My question is how you could prevent this unbounded work migrate back
> to your isolated cpu?
> Seems to me there is no such mechanism in kernel, am I understand wrong?

These workqueues are normally queued back from workqueue handler. And we
normally queue them on the local cpu, that's the default behavior of workqueue
subsystem. And so they land up on the same CPU again and again.

2014-01-20 15:41:13

by Frederic Weisbecker

[permalink] [raw]

Subject: Re: [QUERY]: Is using CPU hotplug right for isolating CPUs?

On Mon, Jan 20, 2014 at 08:30:10PM +0530, Viresh Kumar wrote:
> On 20 January 2014 19:29, Lei Wen <[email protected]> wrote:
> > Hi Viresh,
>
> Hi Lei,
>
> > I have one question regarding unbounded workqueue migration in your case.
> > You use hotplug to migrate the unbounded work to other cpus, but its cpu mask
> > would still be 0xf, since cannot be changed by cpuset.
> >
> > My question is how you could prevent this unbounded work migrate back
> > to your isolated cpu?
> > Seems to me there is no such mechanism in kernel, am I understand wrong?
>
> These workqueues are normally queued back from workqueue handler. And we
> normally queue them on the local cpu, that's the default behavior of workqueue
> subsystem. And so they land up on the same CPU again and again.

But for workqueues having a global affinity, I think they can be rescheduled later
on the old CPUs. Although I'm not sure about that, I'm Cc'ing Tejun.

Also, one of the plan is to extend the sysfs interface of workqueues to override
their affinity. If any of you guys want to try something there, that would be welcome.
Also we want to work on the timer affinity. Perhaps we don't need a user interface
for that, or maybe something on top of full dynticks to outline that we want the unbound
timers to run on housekeeping CPUs only.

2014-01-20 15:51:56

by Frederic Weisbecker

[permalink] [raw]

Subject: Re: [QUERY]: Is using CPU hotplug right for isolating CPUs?

On Mon, Jan 20, 2014 at 05:00:20PM +0530, Viresh Kumar wrote:
> On 16 January 2014 15:16, Thomas Gleixner <[email protected]> wrote:
> > Just do the math.
> >
> > max reload value / timer freq = max time span
>
> Thanks.
>
> > So:
> >
> > 0x7fffffff / 24MHz = 89.478485 sec
> >
> > Nothing to do here except to get rid of the requirement to arm the
> > timer at all.
>
> @Frederic: Any inputs on how to get rid of this timer here?

I fear you can't. If you schedule a timer in 4 seconds away and your clockdevice
can only count up to 2 seconds, you can't help much the interrupt in the middle to
cope with the overflow.

So you need to act on the source of the timer:

* identify what cause this timer
* try to turn that feature off
* if you can't then move the timer to the housekeeping CPU

I'll have a look into the latter point to affine global timers to the
housekeeping CPU. Per cpu timers need more inspection though. Either we rework
them to be possibly handled by remote/housekeeping CPUs, or we let the associate feature
to be turned off. All in one it's a case by case work.

2014-01-21 02:08:06

by Lei Wen

[permalink] [raw]

Subject: Re: [QUERY]: Is using CPU hotplug right for isolating CPUs?

On Mon, Jan 20, 2014 at 11:41 PM, Frederic Weisbecker
<[email protected]> wrote:
> On Mon, Jan 20, 2014 at 08:30:10PM +0530, Viresh Kumar wrote:
>> On 20 January 2014 19:29, Lei Wen <[email protected]> wrote:
>> > Hi Viresh,
>>
>> Hi Lei,
>>
>> > I have one question regarding unbounded workqueue migration in your case.
>> > You use hotplug to migrate the unbounded work to other cpus, but its cpu mask
>> > would still be 0xf, since cannot be changed by cpuset.
>> >
>> > My question is how you could prevent this unbounded work migrate back
>> > to your isolated cpu?
>> > Seems to me there is no such mechanism in kernel, am I understand wrong?
>>
>> These workqueues are normally queued back from workqueue handler. And we
>> normally queue them on the local cpu, that's the default behavior of workqueue
>> subsystem. And so they land up on the same CPU again and again.
>
> But for workqueues having a global affinity, I think they can be rescheduled later
> on the old CPUs. Although I'm not sure about that, I'm Cc'ing Tejun.

Agree, since worker thread is made as enterring into all cpus, it
cannot prevent scheduler
do the migration.

But here is one point, that I see Viresh alredy set up two cpuset with
scheduler load balance
disabled, so it should stop the task migration between those two groups? Since
the sched_domain changed?

What is more, I also did similiar test, and find when I set two such
cpuset group,
like core 0-2 to cpuset1, core 3 to cpuset2, while hotunplug the core3
afterwise.
I find the cpuset's cpus member becomes NULL even I hotplug the core3
back again.
So is it a bug?

Thanks,
Lei

>
> Also, one of the plan is to extend the sysfs interface of workqueues to override
> their affinity. If any of you guys want to try something there, that would be welcome.
> Also we want to work on the timer affinity. Perhaps we don't need a user interface
> for that, or maybe something on top of full dynticks to outline that we want the unbound
> timers to run on housekeeping CPUs only.

2014-01-21 09:49:41

by Viresh Kumar

[permalink] [raw]

Subject: Re: [QUERY]: Is using CPU hotplug right for isolating CPUs?

On 20 January 2014 21:11, Frederic Weisbecker <[email protected]> wrote:
> But for workqueues having a global affinity, I think they can be rescheduled later
> on the old CPUs. Although I'm not sure about that, I'm Cc'ing Tejun.

Works queued on workqueues with WQ_UNBOUND flag set are run on any cpu
and is decided by scheduler, whereas works queued on workqueues with this
flag not set and without a cpu number mentioned while queuing work,
runs on local
CPU always.

> Also, one of the plan is to extend the sysfs interface of workqueues to override
> their affinity. If any of you guys want to try something there, that would be welcome.
> Also we want to work on the timer affinity. Perhaps we don't need a user interface
> for that, or maybe something on top of full dynticks to outline that we want the unbound
> timers to run on housekeeping CPUs only.

What about a quiesce option as mentioned by PeterZ? With that we can move
all UNBOUND timers and workqueues away. But to guarantee that we don't get
them queued again later we need to make similar updates in workqueue/timer
subsystem to disallow queuing any such stuff on such cpusets.

2014-01-21 09:50:32

by Viresh Kumar

[permalink] [raw]

Subject: Re: [QUERY]: Is using CPU hotplug right for isolating CPUs?

On 21 January 2014 07:37, Lei Wen <[email protected]> wrote:
> What is more, I also did similiar test, and find when I set two such
> cpuset group,
> like core 0-2 to cpuset1, core 3 to cpuset2, while hotunplug the core3
> afterwise.
> I find the cpuset's cpus member becomes NULL even I hotplug the core3
> back again.
> So is it a bug?

I confirm the same :)

2014-01-21 10:34:01

by Viresh Kumar

[permalink] [raw]

Subject: Re: [QUERY]: Is using CPU hotplug right for isolating CPUs?

On 20 January 2014 21:21, Frederic Weisbecker <[email protected]> wrote:
> I fear you can't. If you schedule a timer in 4 seconds away and your clockdevice
> can only count up to 2 seconds, you can't help much the interrupt in the middle to
> cope with the overflow.
>
> So you need to act on the source of the timer:
>
> * identify what cause this timer
> * try to turn that feature off
> * if you can't then move the timer to the housekeeping CPU

So, the main problem in my case was caused by this:

<...>-2147 [001] d..2 302.573881: hrtimer_start:
hrtimer=c172aa50 function=tick_sched_timer expires=602075000000
softexpires=602075000000

I have mentioned this earlier when I sent you attachments. I think
this is somehow
tied with the NO_HZ_FULL stuff? As the timer is queued for 300 seconds after
current time.

How to get this out?

> I'll have a look into the latter point to affine global timers to the
> housekeeping CPU. Per cpu timers need more inspection though. Either we rework
> them to be possibly handled by remote/housekeeping CPUs, or we let the associate feature
> to be turned off. All in one it's a case by case work.

Which CPUs are housekeeping CPUs? How do we declare them?

2014-01-23 13:54:37

by Frederic Weisbecker

[permalink] [raw]

Subject: Re: [QUERY]: Is using CPU hotplug right for isolating CPUs?

On Tue, Jan 21, 2014 at 10:07:58AM +0800, Lei Wen wrote:
> On Mon, Jan 20, 2014 at 11:41 PM, Frederic Weisbecker
> <[email protected]> wrote:
> > On Mon, Jan 20, 2014 at 08:30:10PM +0530, Viresh Kumar wrote:
> >> On 20 January 2014 19:29, Lei Wen <[email protected]> wrote:
> >> > Hi Viresh,
> >>
> >> Hi Lei,
> >>
> >> > I have one question regarding unbounded workqueue migration in your case.
> >> > You use hotplug to migrate the unbounded work to other cpus, but its cpu mask
> >> > would still be 0xf, since cannot be changed by cpuset.
> >> >
> >> > My question is how you could prevent this unbounded work migrate back
> >> > to your isolated cpu?
> >> > Seems to me there is no such mechanism in kernel, am I understand wrong?
> >>
> >> These workqueues are normally queued back from workqueue handler. And we
> >> normally queue them on the local cpu, that's the default behavior of workqueue
> >> subsystem. And so they land up on the same CPU again and again.
> >
> > But for workqueues having a global affinity, I think they can be rescheduled later
> > on the old CPUs. Although I'm not sure about that, I'm Cc'ing Tejun.
>
> Agree, since worker thread is made as enterring into all cpus, it
> cannot prevent scheduler
> do the migration.
>
> But here is one point, that I see Viresh alredy set up two cpuset with
> scheduler load balance
> disabled, so it should stop the task migration between those two groups? Since
> the sched_domain changed?
>
> What is more, I also did similiar test, and find when I set two such
> cpuset group,
> like core 0-2 to cpuset1, core 3 to cpuset2, while hotunplug the core3
> afterwise.
> I find the cpuset's cpus member becomes NULL even I hotplug the core3
> back again.
> So is it a bug?

Not sure, you may need to check cpuset internals.

2014-01-23 14:01:24

by Frederic Weisbecker

[permalink] [raw]

Subject: Re: [QUERY]: Is using CPU hotplug right for isolating CPUs?

On Tue, Jan 21, 2014 at 03:19:36PM +0530, Viresh Kumar wrote:
> On 20 January 2014 21:11, Frederic Weisbecker <[email protected]> wrote:
> > But for workqueues having a global affinity, I think they can be rescheduled later
> > on the old CPUs. Although I'm not sure about that, I'm Cc'ing Tejun.
>
> Works queued on workqueues with WQ_UNBOUND flag set are run on any cpu
> and is decided by scheduler, whereas works queued on workqueues with this
> flag not set and without a cpu number mentioned while queuing work,
> runs on local
> CPU always.

Ok, so it is fine to migrate the latter kind I guess?

>
> > Also, one of the plan is to extend the sysfs interface of workqueues to override
> > their affinity. If any of you guys want to try something there, that would be welcome.
> > Also we want to work on the timer affinity. Perhaps we don't need a user interface
> > for that, or maybe something on top of full dynticks to outline that we want the unbound
> > timers to run on housekeeping CPUs only.
>
> What about a quiesce option as mentioned by PeterZ? With that we can move
> all UNBOUND timers and workqueues away. But to guarantee that we don't get
> them queued again later we need to make similar updates in workqueue/timer
> subsystem to disallow queuing any such stuff on such cpusets.

I haven't checked the details but then this quiesce option would involve
a dependency on cpuset for any workload involving workqueues affinity. I'm
not sure we really want this. Besides, workqueues have an existing sysfs interface
that can be easily extended.

Now indeed we may also want to enforce some policy to make sure that further
created and queued workqueues are affine to a specific subset of CPUs. And then
cpuset sounds like a good idea :)

2014-01-23 14:27:46

by Viresh Kumar

[permalink] [raw]

Subject: Re: [QUERY]: Is using CPU hotplug right for isolating CPUs?

On 23 January 2014 19:24, Frederic Weisbecker <[email protected]> wrote:
> On Tue, Jan 21, 2014 at 10:07:58AM +0800, Lei Wen wrote:
>> I find the cpuset's cpus member becomes NULL even I hotplug the core3
>> back again.
>> So is it a bug?
>
> Not sure, you may need to check cpuset internals.

I think this is the correct behavior. Userspace must decide what to do
with that CPU once it is back. Simply reverting to earlier cpusets
configuration might not be the right approach.

Also, what if cpusets have been rewritten in-between hotplug events.

2014-01-23 14:58:49

by Frederic Weisbecker

[permalink] [raw]

Subject: Re: [QUERY]: Is using CPU hotplug right for isolating CPUs?

On Tue, Jan 21, 2014 at 04:03:53PM +0530, Viresh Kumar wrote:
> On 20 January 2014 21:21, Frederic Weisbecker <[email protected]> wrote:
> > I fear you can't. If you schedule a timer in 4 seconds away and your clockdevice
> > can only count up to 2 seconds, you can't help much the interrupt in the middle to
> > cope with the overflow.
> >
> > So you need to act on the source of the timer:
> >
> > * identify what cause this timer
> > * try to turn that feature off
> > * if you can't then move the timer to the housekeeping CPU
>
> So, the main problem in my case was caused by this:
>
> <...>-2147 [001] d..2 302.573881: hrtimer_start:
> hrtimer=c172aa50 function=tick_sched_timer expires=602075000000
> softexpires=602075000000
>
> I have mentioned this earlier when I sent you attachments. I think
> this is somehow
> tied with the NO_HZ_FULL stuff? As the timer is queued for 300 seconds after
> current time.
>
> How to get this out?

So it's scheduled away 300 seconds later. It might be a pending timer_list. Enabling the
timer tracepoints may give you some clues.

>
> > I'll have a look into the latter point to affine global timers to the
> > housekeeping CPU. Per cpu timers need more inspection though. Either we rework
> > them to be possibly handled by remote/housekeeping CPUs, or we let the associate feature
> > to be turned off. All in one it's a case by case work.
>
> Which CPUs are housekeeping CPUs? How do we declare them?

It's not yet implemented, but it's an idea (partly from Thomas) of something we can do to
define some general policy on various periodic/async work affinity to enforce isolation.

The basic idea is to define the CPU handling the timekeeping duty to be the housekeeping
CPU. Given that CPU must keep a periodic tick, lets move all the unbound timers and
workqueues there. And also try to move some CPU affine work as well. For example
we could handle the scheduler tick of the full dynticks CPUs into that housekeeping
CPU, at a low freqency. This way we could remove that 1 second scheduler tick max deferment
per CPU. It may be an overkill though to run all the scheduler ticks on a single CPU so there
may be other ways to cope with that.

And I would like to keep that housekeeping notion flexible enough to be extendable on more
than one CPU, as I heard that some people plan to reserve one CPU per node on big
NUMA machines for such a purpose. So that could be a cpumask, augmented with an infrastructure.

Of course, if some people help contributing in this area, some things may eventually move foward
on the support of CPU isolation. I can't do that all alone, at least not quickly, given all the
things already pending in my queue (fix buggy nohz iowait accounting, support RCU full sysidle detection,
apply AMD range breakpoints patches, further cleanup posix cpu timers, etc...).

Thanks.

2014-01-24 05:21:16

by Viresh Kumar

[permalink] [raw]

Subject: Re: [QUERY]: Is using CPU hotplug right for isolating CPUs?

On 23 January 2014 20:28, Frederic Weisbecker <[email protected]> wrote:
> On Tue, Jan 21, 2014 at 04:03:53PM +0530, Viresh Kumar wrote:

>> So, the main problem in my case was caused by this:
>>
>> <...>-2147 [001] d..2 302.573881: hrtimer_start:
>> hrtimer=c172aa50 function=tick_sched_timer expires=602075000000
>> softexpires=602075000000
>>
>> I have mentioned this earlier when I sent you attachments. I think
>> this is somehow
>> tied with the NO_HZ_FULL stuff? As the timer is queued for 300 seconds after
>> current time.
>>
>> How to get this out?
>
> So it's scheduled away 300 seconds later. It might be a pending timer_list. Enabling the
> timer tracepoints may give you some clues.

Trace was done with that enabled. /proc/timer_list confirms that a hrtimer
is queued for 300 seconds later for tick_sched_timer. And so I assumed
this is part of the current NO_HZ_FULL implementation.

Just to confirm, when we decide that a CPU is running a single task and so
can enter tickless mode, do we queue this tick_sched_timer for 300 seconds
ahead of time? If not, then who is doing this :)

>> Which CPUs are housekeeping CPUs? How do we declare them?
>
> It's not yet implemented, but it's an idea (partly from Thomas) of something we can do to
> define some general policy on various periodic/async work affinity to enforce isolation.
>
> The basic idea is to define the CPU handling the timekeeping duty to be the housekeeping
> CPU. Given that CPU must keep a periodic tick, lets move all the unbound timers and
> workqueues there. And also try to move some CPU affine work as well. For example
> we could handle the scheduler tick of the full dynticks CPUs into that housekeeping
> CPU, at a low freqency. This way we could remove that 1 second scheduler tick max deferment
> per CPU. It may be an overkill though to run all the scheduler ticks on a single CPU so there
> may be other ways to cope with that.
>
> And I would like to keep that housekeeping notion flexible enough to be extendable on more
> than one CPU, as I heard that some people plan to reserve one CPU per node on big
> NUMA machines for such a purpose. So that could be a cpumask, augmented with an infrastructure.
>
> Of course, if some people help contributing in this area, some things may eventually move foward
> on the support of CPU isolation. I can't do that all alone, at least not quickly, given all the
> things already pending in my queue (fix buggy nohz iowait accounting, support RCU full sysidle detection,
> apply AMD range breakpoints patches, further cleanup posix cpu timers, etc...).

I see. As I am currently working on the isolation stuff which is very
much required
for my usecase, I will try to do that as the second step of my work.
The first one
stays something like a cpuset.quiesce option that PeterZ suggested.

Any pointers of earlier discussion on this topic would be helpful to
start working on
this..

2014-01-24 08:29:33

by Mike Galbraith

[permalink] [raw]

Subject: Re: [QUERY]: Is using CPU hotplug right for isolating CPUs?

On Fri, 2014-01-24 at 10:51 +0530, Viresh Kumar wrote:
> On 23 January 2014 20:28, Frederic Weisbecker <[email protected]> wrote:
> > On Tue, Jan 21, 2014 at 04:03:53PM +0530, Viresh Kumar wrote:
>
> >> So, the main problem in my case was caused by this:
> >>
> >> <...>-2147 [001] d..2 302.573881: hrtimer_start:
> >> hrtimer=c172aa50 function=tick_sched_timer expires=602075000000
> >> softexpires=602075000000
> >>
> >> I have mentioned this earlier when I sent you attachments. I think
> >> this is somehow
> >> tied with the NO_HZ_FULL stuff? As the timer is queued for 300 seconds after
> >> current time.
> >>
> >> How to get this out?
> >
> > So it's scheduled away 300 seconds later. It might be a pending timer_list. Enabling the
> > timer tracepoints may give you some clues.
>
> Trace was done with that enabled. /proc/timer_list confirms that a hrtimer
> is queued for 300 seconds later for tick_sched_timer. And so I assumed
> this is part of the current NO_HZ_FULL implementation.
>
> Just to confirm, when we decide that a CPU is running a single task and so
> can enter tickless mode, do we queue this tick_sched_timer for 300 seconds
> ahead of time? If not, then who is doing this :)
>
> >> Which CPUs are housekeeping CPUs? How do we declare them?
> >
> > It's not yet implemented, but it's an idea (partly from Thomas) of something we can do to
> > define some general policy on various periodic/async work affinity to enforce isolation.
> >
> > The basic idea is to define the CPU handling the timekeeping duty to be the housekeeping
> > CPU. Given that CPU must keep a periodic tick, lets move all the unbound timers and
> > workqueues there. And also try to move some CPU affine work as well. For example
> > we could handle the scheduler tick of the full dynticks CPUs into that housekeeping
> > CPU, at a low freqency. This way we could remove that 1 second scheduler tick max deferment
> > per CPU. It may be an overkill though to run all the scheduler ticks on a single CPU so there
> > may be other ways to cope with that.
> >
> > And I would like to keep that housekeeping notion flexible enough to be extendable on more
> > than one CPU, as I heard that some people plan to reserve one CPU per node on big
> > NUMA machines for such a purpose. So that could be a cpumask, augmented with an infrastructure.
> >
> > Of course, if some people help contributing in this area, some things may eventually move foward
> > on the support of CPU isolation. I can't do that all alone, at least not quickly, given all the
> > things already pending in my queue (fix buggy nohz iowait accounting, support RCU full sysidle detection,
> > apply AMD range breakpoints patches, further cleanup posix cpu timers, etc...).
>
> I see. As I am currently working on the isolation stuff which is very
> much required
> for my usecase, I will try to do that as the second step of my work.
> The first one
> stays something like a cpuset.quiesce option that PeterZ suggested.
>
> Any pointers of earlier discussion on this topic would be helpful to
> start working on
> this..

All of that nohz_full stuff would be a lot more usable if it were
dynamic via cpusets. As the thing sits, if you need a small group of
tickless cores once in a while, you have to eat a truckload of overhead
and zillion threads always. The price is high.

I have a little hack for my -rt kernel that allows the user to turn the
tick on/off (and cpupri) on a per fully isolated set basis, because
jitter is lower with the tick than with nohz doing it's thing. With
that, you can set up whatever portion of box to meet your needs on the
fly. When you need very low jitter, turn all load balancing off in your
critical set, turn nohz off, turn rt load balancing off, and 80 core
boxen become usable for cool zillion dollar realtime video games.. box
becomes a militarized playstation.

Doing the same with nohz_full would be a _lot_ harder (my hacks are
trivial), but would be a lot more attractive to users than always eating
the high nohz_full cost whether using it or not. Poke buttons, threads
are born or die, patch in/out expensive accounting goop and whatnot,
play evil high speed stock market bandit, or whatever else, at the poke
of couple buttons.

-Mike

2014-01-24 08:53:18

by Viresh Kumar

[permalink] [raw]

Subject: Re: [QUERY]: Is using CPU hotplug right for isolating CPUs?

On 23 January 2014 19:31, Frederic Weisbecker <[email protected]> wrote:
> Ok, so it is fine to migrate the latter kind I guess?

Unless somebody has abused the API and used bound workqueues where he
should have used unbound ones.

> I haven't checked the details but then this quiesce option would involve
> a dependency on cpuset for any workload involving workqueues affinity. I'm
> not sure we really want this. Besides, workqueues have an existing sysfs interface
> that can be easily extended.
>
> Now indeed we may also want to enforce some policy to make sure that further
> created and queued workqueues are affine to a specific subset of CPUs. And then
> cpuset sounds like a good idea :)

Exactly. Cpuset would be more useful here. Probably we can keep both cpusets
and sysfs interface of workqueues..

I will try to add this option under cpuset which will initially move timers and
workqueues away from the cpuset in question.

2014-01-28 13:23:16

by Frederic Weisbecker

[permalink] [raw]

Subject: Re: [QUERY]: Is using CPU hotplug right for isolating CPUs?

On Fri, Jan 24, 2014 at 10:51:14AM +0530, Viresh Kumar wrote:
> On 23 January 2014 20:28, Frederic Weisbecker <[email protected]> wrote:
> > On Tue, Jan 21, 2014 at 04:03:53PM +0530, Viresh Kumar wrote:
>
> >> So, the main problem in my case was caused by this:
> >>
> >> <...>-2147 [001] d..2 302.573881: hrtimer_start:
> >> hrtimer=c172aa50 function=tick_sched_timer expires=602075000000
> >> softexpires=602075000000
> >>
> >> I have mentioned this earlier when I sent you attachments. I think
> >> this is somehow
> >> tied with the NO_HZ_FULL stuff? As the timer is queued for 300 seconds after
> >> current time.
> >>
> >> How to get this out?
> >
> > So it's scheduled away 300 seconds later. It might be a pending timer_list. Enabling the
> > timer tracepoints may give you some clues.
>
> Trace was done with that enabled. /proc/timer_list confirms that a hrtimer
> is queued for 300 seconds later for tick_sched_timer. And so I assumed
> this is part of the current NO_HZ_FULL implementation.
>
> Just to confirm, when we decide that a CPU is running a single task and so
> can enter tickless mode, do we queue this tick_sched_timer for 300 seconds
> ahead of time? If not, then who is doing this :)

No, when a single task is running on a full dynticks CPU, the tick is supposed to run
every seconds. I'm actually suprised it doesn't happen in your traces, did you tweak
something specific?

The 300 seconds timer is probably due to a timer_list, just enable the
timer_start and timer_expire_entry events to get the name of the culprit.

>
> >> Which CPUs are housekeeping CPUs? How do we declare them?
> >
> > It's not yet implemented, but it's an idea (partly from Thomas) of something we can do to
> > define some general policy on various periodic/async work affinity to enforce isolation.
> >
> > The basic idea is to define the CPU handling the timekeeping duty to be the housekeeping
> > CPU. Given that CPU must keep a periodic tick, lets move all the unbound timers and
> > workqueues there. And also try to move some CPU affine work as well. For example
> > we could handle the scheduler tick of the full dynticks CPUs into that housekeeping
> > CPU, at a low freqency. This way we could remove that 1 second scheduler tick max deferment
> > per CPU. It may be an overkill though to run all the scheduler ticks on a single CPU so there
> > may be other ways to cope with that.
> >
> > And I would like to keep that housekeeping notion flexible enough to be extendable on more
> > than one CPU, as I heard that some people plan to reserve one CPU per node on big
> > NUMA machines for such a purpose. So that could be a cpumask, augmented with an infrastructure.
> >
> > Of course, if some people help contributing in this area, some things may eventually move foward
> > on the support of CPU isolation. I can't do that all alone, at least not quickly, given all the
> > things already pending in my queue (fix buggy nohz iowait accounting, support RCU full sysidle detection,
> > apply AMD range breakpoints patches, further cleanup posix cpu timers, etc...).
>
> I see. As I am currently working on the isolation stuff which is very
> much required
> for my usecase, I will try to do that as the second step of my work.
> The first one
> stays something like a cpuset.quiesce option that PeterZ suggested.

Cool!

>
> Any pointers of earlier discussion on this topic would be helpful to
> start working on
> this..

I think that being able to control the UNBOUND workqueue affinity may be a nice
first step.

Thanks.

2014-01-28 16:11:28

by Kevin Hilman

[permalink] [raw]

Subject: Re: [QUERY]: Is using CPU hotplug right for isolating CPUs?

On Tue, Jan 28, 2014 at 5:23 AM, Frederic Weisbecker <[email protected]> wrote:
> On Fri, Jan 24, 2014 at 10:51:14AM +0530, Viresh Kumar wrote:
>> On 23 January 2014 20:28, Frederic Weisbecker <[email protected]> wrote:
>> > On Tue, Jan 21, 2014 at 04:03:53PM +0530, Viresh Kumar wrote:
>>
>> >> So, the main problem in my case was caused by this:
>> >>
>> >> <...>-2147 [001] d..2 302.573881: hrtimer_start:
>> >> hrtimer=c172aa50 function=tick_sched_timer expires=602075000000
>> >> softexpires=602075000000
>> >>
>> >> I have mentioned this earlier when I sent you attachments. I think
>> >> this is somehow
>> >> tied with the NO_HZ_FULL stuff? As the timer is queued for 300 seconds after
>> >> current time.
>> >>
>> >> How to get this out?
>> >
>> > So it's scheduled away 300 seconds later. It might be a pending timer_list. Enabling the
>> > timer tracepoints may give you some clues.
>>
>> Trace was done with that enabled. /proc/timer_list confirms that a hrtimer
>> is queued for 300 seconds later for tick_sched_timer. And so I assumed
>> this is part of the current NO_HZ_FULL implementation.
>>
>> Just to confirm, when we decide that a CPU is running a single task and so
>> can enter tickless mode, do we queue this tick_sched_timer for 300 seconds
>> ahead of time? If not, then who is doing this :)
>
> No, when a single task is running on a full dynticks CPU, the tick is supposed to run
> every seconds. I'm actually suprised it doesn't happen in your traces, did you tweak
> something specific?

I think Viresh is using my patch/hack to configure/disable the 1Hz
residual tick.

Kevin

2014-02-03 08:26:19

by Viresh Kumar

[permalink] [raw]

Subject: Re: [QUERY]: Is using CPU hotplug right for isolating CPUs?

On 28 January 2014 21:41, Kevin Hilman <[email protected]> wrote:
> I think Viresh is using my patch/hack to configure/disable the 1Hz
> residual tick.

Yeah. I am using sched_tick_max_deferment by setting it to -1. Why
do we need a timer every second for NO_HZ_FULL currently?

2014-02-11 08:52:45

by Viresh Kumar

[permalink] [raw]

Subject: Re: [QUERY]: Is using CPU hotplug right for isolating CPUs?

On 28 January 2014 18:53, Frederic Weisbecker <[email protected]> wrote:
> No, when a single task is running on a full dynticks CPU, the tick is supposed to run
> every seconds. I'm actually suprised it doesn't happen in your traces, did you tweak
> something specific?

Why do we need this 1 second tick currently? And what will happen if I
hotunplug that
CPU and get it back? Would the timer for tick move away from CPU in
question? I see
that when I have changed this 1sec stuff to 300 seconds. But what
would be impact
of that? Will things still work normally?

2014-02-13 14:20:51

by Frederic Weisbecker

[permalink] [raw]

Subject: Re: [QUERY]: Is using CPU hotplug right for isolating CPUs?

On Tue, Feb 11, 2014 at 02:22:43PM +0530, Viresh Kumar wrote:
> On 28 January 2014 18:53, Frederic Weisbecker <[email protected]> wrote:
> > No, when a single task is running on a full dynticks CPU, the tick is supposed to run
> > every seconds. I'm actually suprised it doesn't happen in your traces, did you tweak
> > something specific?
>
> Why do we need this 1 second tick currently? And what will happen if I
> hotunplug that
> CPU and get it back? Would the timer for tick move away from CPU in
> question? I see
> that when I have changed this 1sec stuff to 300 seconds. But what
> would be impact
> of that? Will things still work normally?

So the problem resides in the gazillions accounting maintained in scheduler_tick() and
current->sched_class->task_tick().

The scheduler correctness depends on these to be updated regularly. If you deactivate
or increase the delay with very high values, the result is unpredictable. Just expect that
at least some scheduler feature will behave randomly, like load balancing for example or
simply local fairness issues.

So we have that 1 Hz max that makes sure that things are moving forward while keeping
a rate that should be still nice for HPC workloads. But we certainly want to find a
way to remove the need for any tick altogether for extreme real time workloads which
need guarantees rather than just optimizations.

I see two potential solutions for that:

1) Rework the scheduler accounting such that it is safe against full dynticks. That
was the initial plan but it's scary. The scheduler accountings is a huge maze. And I'm not
sure it's actually worth the complication.

2) Offload the accounting. For example we could imagine that the timekeeping could handle the
task_tick() calls on behalf of the full dynticks CPUs. At a small rate like 1 Hz.

2014-02-28 09:04:50

by Viresh Kumar

[permalink] [raw]

Subject: Re: [QUERY]: Is using CPU hotplug right for isolating CPUs?

On 15 January 2014 17:04, Peter Zijlstra <[email protected]> wrote:
> On Wed, Jan 15, 2014 at 04:17:26PM +0530, Viresh Kumar wrote:
>> On 15 January 2014 16:08, Peter Zijlstra <[email protected]> wrote:
>> > Nah, its just ugly and we should fix it. You need to be careful to not
>> > place tasks in a cpuset you're going to unplug though, that'll give
>> > funny results.
>>
>> Okay. So how do you suggest to get rid of cases like a work queued
>> on CPU1 initially and because it gets queued again from its work handler,
>> it stays on the same CPU forever.
>
> We should have a cpuset.quiesce control or something that moves all
> timers out.

What should we do here if we have a valid base->running_timer for the
cpu requesting the quiesce ?