Folks!
After rereading the mail flood on CAT and staring into the SDM for a
while, I think we all should sit back and look at it from scratch
again w/o our preconceptions - I certainly had to put my own away.
Let's look at the properties of CAT again:
- It's a per socket facility
- CAT slots can be associated to external hardware. This
association is per socket as well, so different sockets can have
different behaviour. I missed that detail when staring the first
time, thanks for the pointer!
- The association ifself is per cpu. The COS selection happens on a
CPU while the set of masks which are selected via COS are shared
by all CPUs on a socket.
There are restrictions which CAT imposes in terms of configurability:
- The bits which select a cache partition need to be consecutive
- The number of possible cache association masks is limited
Let's look at the configurations (CDP omitted and size restricted)
Default: 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1
Shared: 1 1 1 1 1 1 1 1
0 0 1 1 1 1 1 1
0 0 0 0 1 1 1 1
0 0 0 0 0 0 1 1
Isolated: 1 1 1 1 0 0 0 0
0 0 0 0 1 1 0 0
0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 1
Or any combination thereof. Surely some combinations will not make any
sense, but we really should not make any restrictions on the stupidity
of a sysadmin. The worst outcome might be L3 disabled for everything,
so what?
Now that gets even more convoluted if CDP comes into play and we
really need to look at CDP right now. We might end up with something
which looks like this:
1 1 1 1 0 0 0 0 Code
1 1 1 1 0 0 0 0 Data
0 0 0 0 0 0 1 0 Code
0 0 0 0 1 1 0 0 Data
0 0 0 0 0 0 0 1 Code
0 0 0 0 1 1 0 0 Data
or
0 0 0 0 0 0 0 1 Code
0 0 0 0 1 1 0 0 Data
0 0 0 0 0 0 0 1 Code
0 0 0 0 0 1 1 0 Data
Let's look at partitioning itself. We have two options:
1) Per task partitioning
2) Per CPU partitioning
So far we only talked about #1, but I think that #2 has a value as
well. Let me give you a simple example.
Assume that you have isolated a CPU and run your important task on
it. You give that task a slice of cache. Now that task needs kernel
services which run in kernel threads on that CPU. We really don't want
to (and cannot) hunt down random kernel threads (think cpu bound
worker threads, softirq threads ....) and give them another slice of
cache. What we really want is:
1 1 1 1 0 0 0 0 <- Default cache
0 0 0 0 1 1 1 0 <- Cache for important task
0 0 0 0 0 0 0 1 <- Cache for CPU of important task
It would even be sufficient for particular use cases to just associate
a piece of cache to a given CPU and do not bother with tasks at all.
We really need to make this as configurable as possible from userspace
without imposing random restrictions to it. I played around with it on
my new intel toy and the restriction to 16 COS ids (that's 8 with CDP
enabled) makes it really useless if we force the ids to have the same
meaning on all sockets and restrict it to per task partitioning.
Even if next generation systems will have more COS ids available,
there are not going to be enough to have a system wide consistent
view unless we have COS ids > nr_cpus.
Aside of that I don't think that a system wide consistent view is
useful at all.
- If a task migrates between sockets, it's going to suffer anyway.
Real sensitive applications will simply pin tasks on a socket to
avoid that in the first place. If we make the whole thing
configurable enough then the sysadmin can set it up to support
even the nonsensical case of identical cache partitions on all
sockets and let tasks use the corresponding partitions when
migrating.
- The number of cache slices is going to be limited no matter what,
so one still has to come up with a sensible partitioning scheme.
- Even if we have enough cos ids the system wide view will not make
the configuration problem any simpler as it remains per socket.
It's hard. Policies are hard by definition, but this one is harder
than most other policies due to the inherent limitations.
So now to the interface part. Unfortunately we need to expose this
very close to the hardware implementation as there are really no
abstractions which allow us to express the various bitmap
combinations. Any abstraction I tried to come up with renders that
thing completely useless.
I was not able to identify any existing infrastructure where this
really fits in. I chose a directory/file based representation. We
certainly could do the same with a syscall, but that's just an
implementation detail.
At top level:
xxxxxxx/cat/max_cosids <- Assume that all CPUs are the same
xxxxxxx/cat/max_maskbits <- Assume that all CPUs are the same
xxxxxxx/cat/cdp_enable <- Depends on CDP availability
Per socket data:
xxxxxxx/cat/socket-0/
...
xxxxxxx/cat/socket-N/l3_size
xxxxxxx/cat/socket-N/hwsharedbits
Per socket mask data:
xxxxxxx/cat/socket-N/cos-id-0/
...
xxxxxxx/cat/socket-N/cos-id-N/inuse
/cat_mask
/cdp_mask <- Data mask if CDP enabled
Per cpu default cos id for the cpus on that socket:
xxxxxxx/cat/socket-N/cpu-x/default_cosid
...
xxxxxxx/cat/socket-N/cpu-N/default_cosid
The above allows a simple cpu based partitioning. All tasks which do
not have a cache partition assigned on a particular socket use the
default one of the cpu they are running on.
Now for the task(s) partitioning:
xxxxxxx/cat/partitions/
Under that directory one can create partitions
xxxxxxx/cat/partitions/p1/tasks
/socket-0/cosid
...
/socket-n/cosid
The default value for the per socket cosid is COSID_DEFAULT, which
causes the task(s) to use the per cpu default id.
Thoughts?
Thanks,
tglx
On Wed, 18 Nov 2015 19:25:03 +0100 (CET)
Thomas Gleixner <[email protected]> wrote:
> We really need to make this as configurable as possible from userspace
> without imposing random restrictions to it. I played around with it on
> my new intel toy and the restriction to 16 COS ids (that's 8 with CDP
> enabled) makes it really useless if we force the ids to have the same
> meaning on all sockets and restrict it to per task partitioning.
>
> Even if next generation systems will have more COS ids available,
> there are not going to be enough to have a system wide consistent
> view unless we have COS ids > nr_cpus.
>
> Aside of that I don't think that a system wide consistent view is
> useful at all.
This is a great writeup! I agree with everything you said.
> So now to the interface part. Unfortunately we need to expose this
> very close to the hardware implementation as there are really no
> abstractions which allow us to express the various bitmap
> combinations. Any abstraction I tried to come up with renders that
> thing completely useless.
>
> I was not able to identify any existing infrastructure where this
> really fits in. I chose a directory/file based representation. We
> certainly could do the same with a syscall, but that's just an
> implementation detail.
>
> At top level:
>
> xxxxxxx/cat/max_cosids <- Assume that all CPUs are the same
> xxxxxxx/cat/max_maskbits <- Assume that all CPUs are the same
> xxxxxxx/cat/cdp_enable <- Depends on CDP availability
>
> Per socket data:
>
> xxxxxxx/cat/socket-0/
> ...
> xxxxxxx/cat/socket-N/l3_size
> xxxxxxx/cat/socket-N/hwsharedbits
>
> Per socket mask data:
>
> xxxxxxx/cat/socket-N/cos-id-0/
> ...
> xxxxxxx/cat/socket-N/cos-id-N/inuse
> /cat_mask
> /cdp_mask <- Data mask if CDP enabled
>
> Per cpu default cos id for the cpus on that socket:
>
> xxxxxxx/cat/socket-N/cpu-x/default_cosid
> ...
> xxxxxxx/cat/socket-N/cpu-N/default_cosid
>
> The above allows a simple cpu based partitioning. All tasks which do
> not have a cache partition assigned on a particular socket use the
> default one of the cpu they are running on.
>
> Now for the task(s) partitioning:
>
> xxxxxxx/cat/partitions/
>
> Under that directory one can create partitions
>
> xxxxxxx/cat/partitions/p1/tasks
> /socket-0/cosid
> ...
> /socket-n/cosid
>
> The default value for the per socket cosid is COSID_DEFAULT, which
> causes the task(s) to use the per cpu default id.
I hope I've got all the details right, but this proposal looks awesome.
There's more people who seem to agree with something like this.
Btw, I think it should be possible to implement this with cgroups. But
I too don't care that much on cgroups vs. syscalls.
+Tony
> -----Original Message-----
> From: Luiz Capitulino [mailto:[email protected]]
> Sent: Wednesday, November 18, 2015 11:38 AM
> To: Thomas Gleixner
> Cc: LKML; Peter Zijlstra; [email protected]; Marcelo Tosatti; Shivappa, Vikas; Tejun
> Heo; Yu, Fenghua; Auld, Will; Dugger, Donald D; [email protected]
> Subject: Re: [RFD] CAT user space interface revisited
>
> On Wed, 18 Nov 2015 19:25:03 +0100 (CET) Thomas Gleixner
> <[email protected]> wrote:
>
> > We really need to make this as configurable as possible from userspace
> > without imposing random restrictions to it. I played around with it on
> > my new intel toy and the restriction to 16 COS ids (that's 8 with CDP
> > enabled) makes it really useless if we force the ids to have the same
> > meaning on all sockets and restrict it to per task partitioning.
> >
> > Even if next generation systems will have more COS ids available,
> > there are not going to be enough to have a system wide consistent view
> > unless we have COS ids > nr_cpus.
> >
> > Aside of that I don't think that a system wide consistent view is
> > useful at all.
>
> This is a great writeup! I agree with everything you said.
>
> > So now to the interface part. Unfortunately we need to expose this
> > very close to the hardware implementation as there are really no
> > abstractions which allow us to express the various bitmap
> > combinations. Any abstraction I tried to come up with renders that
> > thing completely useless.
> >
> > I was not able to identify any existing infrastructure where this
> > really fits in. I chose a directory/file based representation. We
> > certainly could do the same with a syscall, but that's just an
> > implementation detail.
> >
> > At top level:
> >
> > xxxxxxx/cat/max_cosids <- Assume that all CPUs are the same
> > xxxxxxx/cat/max_maskbits <- Assume that all CPUs are the same
> > xxxxxxx/cat/cdp_enable <- Depends on CDP availability
> >
> > Per socket data:
> >
> > xxxxxxx/cat/socket-0/
> > ...
> > xxxxxxx/cat/socket-N/l3_size
> > xxxxxxx/cat/socket-N/hwsharedbits
> >
> > Per socket mask data:
> >
> > xxxxxxx/cat/socket-N/cos-id-0/
> > ...
> > xxxxxxx/cat/socket-N/cos-id-N/inuse
> > /cat_mask
> > /cdp_mask <- Data mask if CDP enabled
> >
> > Per cpu default cos id for the cpus on that socket:
> >
> > xxxxxxx/cat/socket-N/cpu-x/default_cosid
> > ...
> > xxxxxxx/cat/socket-N/cpu-N/default_cosid
> >
> > The above allows a simple cpu based partitioning. All tasks which do
> > not have a cache partition assigned on a particular socket use the
> > default one of the cpu they are running on.
> >
> > Now for the task(s) partitioning:
> >
> > xxxxxxx/cat/partitions/
> >
> > Under that directory one can create partitions
> >
> > xxxxxxx/cat/partitions/p1/tasks
> > /socket-0/cosid
> > ...
> > /socket-n/cosid
> >
> > The default value for the per socket cosid is COSID_DEFAULT, which
> > causes the task(s) to use the per cpu default id.
>
> I hope I've got all the details right, but this proposal looks awesome.
> There's more people who seem to agree with something like this.
>
> Btw, I think it should be possible to implement this with cgroups. But I too don't
> care that much on cgroups vs. syscalls.
On Wed, Nov 18, 2015 at 07:25:03PM +0100, Thomas Gleixner wrote:
> Folks!
>
> After rereading the mail flood on CAT and staring into the SDM for a
> while, I think we all should sit back and look at it from scratch
> again w/o our preconceptions - I certainly had to put my own away.
>
> Let's look at the properties of CAT again:
>
> - It's a per socket facility
>
> - CAT slots can be associated to external hardware. This
> association is per socket as well, so different sockets can have
> different behaviour. I missed that detail when staring the first
> time, thanks for the pointer!
>
> - The association ifself is per cpu. The COS selection happens on a
> CPU while the set of masks which are selected via COS are shared
> by all CPUs on a socket.
>
> There are restrictions which CAT imposes in terms of configurability:
>
> - The bits which select a cache partition need to be consecutive
>
> - The number of possible cache association masks is limited
>
> Let's look at the configurations (CDP omitted and size restricted)
>
> Default: 1 1 1 1 1 1 1 1
> 1 1 1 1 1 1 1 1
> 1 1 1 1 1 1 1 1
> 1 1 1 1 1 1 1 1
>
> Shared: 1 1 1 1 1 1 1 1
> 0 0 1 1 1 1 1 1
> 0 0 0 0 1 1 1 1
> 0 0 0 0 0 0 1 1
>
> Isolated: 1 1 1 1 0 0 0 0
> 0 0 0 0 1 1 0 0
> 0 0 0 0 0 0 1 0
> 0 0 0 0 0 0 0 1
>
> Or any combination thereof. Surely some combinations will not make any
> sense, but we really should not make any restrictions on the stupidity
> of a sysadmin. The worst outcome might be L3 disabled for everything,
> so what?
>
> Now that gets even more convoluted if CDP comes into play and we
> really need to look at CDP right now. We might end up with something
> which looks like this:
>
> 1 1 1 1 0 0 0 0 Code
> 1 1 1 1 0 0 0 0 Data
> 0 0 0 0 0 0 1 0 Code
> 0 0 0 0 1 1 0 0 Data
> 0 0 0 0 0 0 0 1 Code
> 0 0 0 0 1 1 0 0 Data
> or
> 0 0 0 0 0 0 0 1 Code
> 0 0 0 0 1 1 0 0 Data
> 0 0 0 0 0 0 0 1 Code
> 0 0 0 0 0 1 1 0 Data
>
> Let's look at partitioning itself. We have two options:
>
> 1) Per task partitioning
>
> 2) Per CPU partitioning
>
> So far we only talked about #1, but I think that #2 has a value as
> well. Let me give you a simple example.
>
> Assume that you have isolated a CPU and run your important task on
> it. You give that task a slice of cache. Now that task needs kernel
> services which run in kernel threads on that CPU. We really don't want
> to (and cannot) hunt down random kernel threads (think cpu bound
> worker threads, softirq threads ....) and give them another slice of
> cache. What we really want is:
>
> 1 1 1 1 0 0 0 0 <- Default cache
> 0 0 0 0 1 1 1 0 <- Cache for important task
> 0 0 0 0 0 0 0 1 <- Cache for CPU of important task
>
> It would even be sufficient for particular use cases to just associate
> a piece of cache to a given CPU and do not bother with tasks at all.
>
> We really need to make this as configurable as possible from userspace
> without imposing random restrictions to it. I played around with it on
> my new intel toy and the restriction to 16 COS ids (that's 8 with CDP
> enabled) makes it really useless if we force the ids to have the same
> meaning on all sockets and restrict it to per task partitioning.
>
> Even if next generation systems will have more COS ids available,
> there are not going to be enough to have a system wide consistent
> view unless we have COS ids > nr_cpus.
>
> Aside of that I don't think that a system wide consistent view is
> useful at all.
>
> - If a task migrates between sockets, it's going to suffer anyway.
> Real sensitive applications will simply pin tasks on a socket to
> avoid that in the first place. If we make the whole thing
> configurable enough then the sysadmin can set it up to support
> even the nonsensical case of identical cache partitions on all
> sockets and let tasks use the corresponding partitions when
> migrating.
>
> - The number of cache slices is going to be limited no matter what,
> so one still has to come up with a sensible partitioning scheme.
>
> - Even if we have enough cos ids the system wide view will not make
> the configuration problem any simpler as it remains per socket.
>
> It's hard. Policies are hard by definition, but this one is harder
> than most other policies due to the inherent limitations.
>
> So now to the interface part. Unfortunately we need to expose this
> very close to the hardware implementation as there are really no
> abstractions which allow us to express the various bitmap
> combinations. Any abstraction I tried to come up with renders that
> thing completely useless.
>
> I was not able to identify any existing infrastructure where this
> really fits in. I chose a directory/file based representation. We
> certainly could do the same with a syscall, but that's just an
> implementation detail.
>
> At top level:
>
> xxxxxxx/cat/max_cosids <- Assume that all CPUs are the same
> xxxxxxx/cat/max_maskbits <- Assume that all CPUs are the same
> xxxxxxx/cat/cdp_enable <- Depends on CDP availability
>
> Per socket data:
>
> xxxxxxx/cat/socket-0/
> ...
> xxxxxxx/cat/socket-N/l3_size
> xxxxxxx/cat/socket-N/hwsharedbits
>
> Per socket mask data:
>
> xxxxxxx/cat/socket-N/cos-id-0/
> ...
> xxxxxxx/cat/socket-N/cos-id-N/inuse
> /cat_mask
> /cdp_mask <- Data mask if CDP enabled
>
> Per cpu default cos id for the cpus on that socket:
>
> xxxxxxx/cat/socket-N/cpu-x/default_cosid
> ...
> xxxxxxx/cat/socket-N/cpu-N/default_cosid
>
> The above allows a simple cpu based partitioning. All tasks which do
> not have a cache partition assigned on a particular socket use the
> default one of the cpu they are running on.
>
> Now for the task(s) partitioning:
>
> xxxxxxx/cat/partitions/
>
> Under that directory one can create partitions
>
> xxxxxxx/cat/partitions/p1/tasks
> /socket-0/cosid
> ...
> /socket-n/cosid
>
> The default value for the per socket cosid is COSID_DEFAULT, which
> causes the task(s) to use the per cpu default id.
>
> Thoughts?
>
> Thanks,
>
> tglx
The cgroups interface works, but moves the problem of contiguous
allocation to userspace, and is incompatible with cache allocations
on demand.
Have to solve the kernel threads VS cgroups issue...
On Wed, Nov 18, 2015 at 07:25:03PM +0100, Thomas Gleixner wrote:
> Folks!
>
> After rereading the mail flood on CAT and staring into the SDM for a
> while, I think we all should sit back and look at it from scratch
> again w/o our preconceptions - I certainly had to put my own away.
>
> Let's look at the properties of CAT again:
>
> - It's a per socket facility
>
> - CAT slots can be associated to external hardware. This
> association is per socket as well, so different sockets can have
> different behaviour. I missed that detail when staring the first
> time, thanks for the pointer!
>
> - The association ifself is per cpu. The COS selection happens on a
> CPU while the set of masks which are selected via COS are shared
> by all CPUs on a socket.
>
> There are restrictions which CAT imposes in terms of configurability:
>
> - The bits which select a cache partition need to be consecutive
>
> - The number of possible cache association masks is limited
>
> Let's look at the configurations (CDP omitted and size restricted)
>
> Default: 1 1 1 1 1 1 1 1
> 1 1 1 1 1 1 1 1
> 1 1 1 1 1 1 1 1
> 1 1 1 1 1 1 1 1
>
> Shared: 1 1 1 1 1 1 1 1
> 0 0 1 1 1 1 1 1
> 0 0 0 0 1 1 1 1
> 0 0 0 0 0 0 1 1
>
> Isolated: 1 1 1 1 0 0 0 0
> 0 0 0 0 1 1 0 0
> 0 0 0 0 0 0 1 0
> 0 0 0 0 0 0 0 1
>
> Or any combination thereof. Surely some combinations will not make any
> sense, but we really should not make any restrictions on the stupidity
> of a sysadmin. The worst outcome might be L3 disabled for everything,
> so what?
>
> Now that gets even more convoluted if CDP comes into play and we
> really need to look at CDP right now. We might end up with something
> which looks like this:
>
> 1 1 1 1 0 0 0 0 Code
> 1 1 1 1 0 0 0 0 Data
> 0 0 0 0 0 0 1 0 Code
> 0 0 0 0 1 1 0 0 Data
> 0 0 0 0 0 0 0 1 Code
> 0 0 0 0 1 1 0 0 Data
> or
> 0 0 0 0 0 0 0 1 Code
> 0 0 0 0 1 1 0 0 Data
> 0 0 0 0 0 0 0 1 Code
> 0 0 0 0 0 1 1 0 Data
>
> Let's look at partitioning itself. We have two options:
>
> 1) Per task partitioning
>
> 2) Per CPU partitioning
>
> So far we only talked about #1, but I think that #2 has a value as
> well. Let me give you a simple example.
>
> Assume that you have isolated a CPU and run your important task on
> it. You give that task a slice of cache. Now that task needs kernel
> services which run in kernel threads on that CPU. We really don't want
> to (and cannot) hunt down random kernel threads (think cpu bound
> worker threads, softirq threads ....) and give them another slice of
> cache. What we really want is:
>
> 1 1 1 1 0 0 0 0 <- Default cache
> 0 0 0 0 1 1 1 0 <- Cache for important task
> 0 0 0 0 0 0 0 1 <- Cache for CPU of important task
>
> It would even be sufficient for particular use cases to just associate
> a piece of cache to a given CPU and do not bother with tasks at all.
>
> We really need to make this as configurable as possible from userspace
> without imposing random restrictions to it. I played around with it on
> my new intel toy and the restriction to 16 COS ids (that's 8 with CDP
> enabled) makes it really useless if we force the ids to have the same
> meaning on all sockets and restrict it to per task partitioning.
>
> Even if next generation systems will have more COS ids available,
> there are not going to be enough to have a system wide consistent
> view unless we have COS ids > nr_cpus.
>
> Aside of that I don't think that a system wide consistent view is
> useful at all.
>
> - If a task migrates between sockets, it's going to suffer anyway.
> Real sensitive applications will simply pin tasks on a socket to
> avoid that in the first place. If we make the whole thing
> configurable enough then the sysadmin can set it up to support
> even the nonsensical case of identical cache partitions on all
> sockets and let tasks use the corresponding partitions when
> migrating.
>
> - The number of cache slices is going to be limited no matter what,
> so one still has to come up with a sensible partitioning scheme.
>
> - Even if we have enough cos ids the system wide view will not make
> the configuration problem any simpler as it remains per socket.
>
> It's hard. Policies are hard by definition, but this one is harder
> than most other policies due to the inherent limitations.
>
> So now to the interface part. Unfortunately we need to expose this
> very close to the hardware implementation as there are really no
> abstractions which allow us to express the various bitmap
> combinations. Any abstraction I tried to come up with renders that
> thing completely useless.
No you don't.
> I was not able to identify any existing infrastructure where this
> really fits in. I chose a directory/file based representation. We
> certainly could do the same with a syscall, but that's just an
> implementation detail.
>
> At top level:
>
> xxxxxxx/cat/max_cosids <- Assume that all CPUs are the same
> xxxxxxx/cat/max_maskbits <- Assume that all CPUs are the same
> xxxxxxx/cat/cdp_enable <- Depends on CDP availability
>
> Per socket data:
>
> xxxxxxx/cat/socket-0/
> ...
> xxxxxxx/cat/socket-N/l3_size
> xxxxxxx/cat/socket-N/hwsharedbits
>
> Per socket mask data:
>
> xxxxxxx/cat/socket-N/cos-id-0/
> ...
> xxxxxxx/cat/socket-N/cos-id-N/inuse
> /cat_mask
> /cdp_mask <- Data mask if CDP enabled
There is no need to expose all this to userspace, but for some unknown
reason people seem to be fond of that, so lets pretend its necessary.
> Per cpu default cos id for the cpus on that socket:
>
> xxxxxxx/cat/socket-N/cpu-x/default_cosid
> ...
> xxxxxxx/cat/socket-N/cpu-N/default_cosid
>
> The above allows a simple cpu based partitioning. All tasks which do
> not have a cache partition assigned on a particular socket use the
> default one of the cpu they are running on.
A tasks which does not have a partition assigned to it
has to use the "other tasks" group (COSid0), so that it does
not interfere with the cache reservations of other tasks.
All is necessary are reservations {size,type}, and lists of reservations
per tasks. This is the right level to expose this to userspace without
userspace having to care about unnecessary HW details.
> Now for the task(s) partitioning:
>
> xxxxxxx/cat/partitions/
>
> Under that directory one can create partitions
>
> xxxxxxx/cat/partitions/p1/tasks
> /socket-0/cosid
> ...
> /socket-n/cosid
>
> The default value for the per socket cosid is COSID_DEFAULT, which
> causes the task(s) to use the per cpu default id.
>
> Thoughts?
>
> Thanks,
>
> tglx
Again: you don't need to look into the MSR table and relate it
to tasks if you store the data as:
task group 1 = {
reservation-1 = {size = 80Kb, type = data, socketmask = 0xffff},
reservation-2 = {size = 100Kb, type = code, socketmask = 0xffff}
}
task group 2 = {
reservation-1 = {size = 80Kb, type = data, socketmask = 0xffff},
reservation-3 = {size = 200Kb, type = code, socketmask = 0xffff}
}
Task group 1 and task group 2 share reservation-1.
This is what userspace is going to expose to users, of course.
If you expose the MSRs to userspace, you force userspace to convert
from this format to the MSRs (minding whether there
are contiguous regions available, and the region shared with HW).
- The bits which select a cache partition need to be consecutive
BUT, for our usecase the cgroups interface works as well, so lets
go with that (Tejun apparently had a usecase where tasks were allowed to
set reservations themselves, on response to external events).
On Wed, Nov 18, 2015 at 08:34:07PM -0200, Marcelo Tosatti wrote:
> On Wed, Nov 18, 2015 at 07:25:03PM +0100, Thomas Gleixner wrote:
> > Folks!
> >
> > After rereading the mail flood on CAT and staring into the SDM for a
> > while, I think we all should sit back and look at it from scratch
> > again w/o our preconceptions - I certainly had to put my own away.
> >
> > Let's look at the properties of CAT again:
> >
> > - It's a per socket facility
> >
> > - CAT slots can be associated to external hardware. This
> > association is per socket as well, so different sockets can have
> > different behaviour. I missed that detail when staring the first
> > time, thanks for the pointer!
> >
> > - The association ifself is per cpu. The COS selection happens on a
> > CPU while the set of masks which are selected via COS are shared
> > by all CPUs on a socket.
> >
> > There are restrictions which CAT imposes in terms of configurability:
> >
> > - The bits which select a cache partition need to be consecutive
> >
> > - The number of possible cache association masks is limited
> >
> > Let's look at the configurations (CDP omitted and size restricted)
> >
> > Default: 1 1 1 1 1 1 1 1
> > 1 1 1 1 1 1 1 1
> > 1 1 1 1 1 1 1 1
> > 1 1 1 1 1 1 1 1
> >
> > Shared: 1 1 1 1 1 1 1 1
> > 0 0 1 1 1 1 1 1
> > 0 0 0 0 1 1 1 1
> > 0 0 0 0 0 0 1 1
> >
> > Isolated: 1 1 1 1 0 0 0 0
> > 0 0 0 0 1 1 0 0
> > 0 0 0 0 0 0 1 0
> > 0 0 0 0 0 0 0 1
> >
> > Or any combination thereof. Surely some combinations will not make any
> > sense, but we really should not make any restrictions on the stupidity
> > of a sysadmin. The worst outcome might be L3 disabled for everything,
> > so what?
> >
> > Now that gets even more convoluted if CDP comes into play and we
> > really need to look at CDP right now. We might end up with something
> > which looks like this:
> >
> > 1 1 1 1 0 0 0 0 Code
> > 1 1 1 1 0 0 0 0 Data
> > 0 0 0 0 0 0 1 0 Code
> > 0 0 0 0 1 1 0 0 Data
> > 0 0 0 0 0 0 0 1 Code
> > 0 0 0 0 1 1 0 0 Data
> > or
> > 0 0 0 0 0 0 0 1 Code
> > 0 0 0 0 1 1 0 0 Data
> > 0 0 0 0 0 0 0 1 Code
> > 0 0 0 0 0 1 1 0 Data
> >
> > Let's look at partitioning itself. We have two options:
> >
> > 1) Per task partitioning
> >
> > 2) Per CPU partitioning
> >
> > So far we only talked about #1, but I think that #2 has a value as
> > well. Let me give you a simple example.
> >
> > Assume that you have isolated a CPU and run your important task on
> > it. You give that task a slice of cache. Now that task needs kernel
> > services which run in kernel threads on that CPU. We really don't want
> > to (and cannot) hunt down random kernel threads (think cpu bound
> > worker threads, softirq threads ....) and give them another slice of
> > cache. What we really want is:
> >
> > 1 1 1 1 0 0 0 0 <- Default cache
> > 0 0 0 0 1 1 1 0 <- Cache for important task
> > 0 0 0 0 0 0 0 1 <- Cache for CPU of important task
> >
> > It would even be sufficient for particular use cases to just associate
> > a piece of cache to a given CPU and do not bother with tasks at all.
Well any work on behalf of the important task, should have its cache
protected as well (example irq handling threads).
But for certain kernel tasks for which L3 cache is not beneficial
(eg: kernel samepage merging), it might useful to exclude such tasks
from the "important, do not flush" L3 cache portion.
> > We really need to make this as configurable as possible from userspace
> > without imposing random restrictions to it. I played around with it on
> > my new intel toy and the restriction to 16 COS ids (that's 8 with CDP
> > enabled) makes it really useless if we force the ids to have the same
> > meaning on all sockets and restrict it to per task partitioning.
> >
> > Even if next generation systems will have more COS ids available,
> > there are not going to be enough to have a system wide consistent
> > view unless we have COS ids > nr_cpus.
> >
> > Aside of that I don't think that a system wide consistent view is
> > useful at all.
> >
> > - If a task migrates between sockets, it's going to suffer anyway.
> > Real sensitive applications will simply pin tasks on a socket to
> > avoid that in the first place. If we make the whole thing
> > configurable enough then the sysadmin can set it up to support
> > even the nonsensical case of identical cache partitions on all
> > sockets and let tasks use the corresponding partitions when
> > migrating.
> >
> > - The number of cache slices is going to be limited no matter what,
> > so one still has to come up with a sensible partitioning scheme.
> >
> > - Even if we have enough cos ids the system wide view will not make
> > the configuration problem any simpler as it remains per socket.
> >
> > It's hard. Policies are hard by definition, but this one is harder
> > than most other policies due to the inherent limitations.
That is exactly why it should be allowed for software to automatically
configure the policies.
> > So now to the interface part. Unfortunately we need to expose this
> > very close to the hardware implementation as there are really no
> > abstractions which allow us to express the various bitmap
> > combinations. Any abstraction I tried to come up with renders that
> > thing completely useless.
> >
> > I was not able to identify any existing infrastructure where this
> > really fits in. I chose a directory/file based representation. We
> > certainly could do the same with a syscall, but that's just an
> > implementation detail.
> >
> > At top level:
> >
> > xxxxxxx/cat/max_cosids <- Assume that all CPUs are the same
> > xxxxxxx/cat/max_maskbits <- Assume that all CPUs are the same
> > xxxxxxx/cat/cdp_enable <- Depends on CDP availability
> >
> > Per socket data:
> >
> > xxxxxxx/cat/socket-0/
> > ...
> > xxxxxxx/cat/socket-N/l3_size
> > xxxxxxx/cat/socket-N/hwsharedbits
> >
> > Per socket mask data:
> >
> > xxxxxxx/cat/socket-N/cos-id-0/
> > ...
> > xxxxxxx/cat/socket-N/cos-id-N/inuse
> > /cat_mask
> > /cdp_mask <- Data mask if CDP enabled
> >
> > Per cpu default cos id for the cpus on that socket:
> >
> > xxxxxxx/cat/socket-N/cpu-x/default_cosid
> > ...
> > xxxxxxx/cat/socket-N/cpu-N/default_cosid
> >
> > The above allows a simple cpu based partitioning. All tasks which do
> > not have a cache partition assigned on a particular socket use the
> > default one of the cpu they are running on.
> >
> > Now for the task(s) partitioning:
> >
> > xxxxxxx/cat/partitions/
> >
> > Under that directory one can create partitions
> >
> > xxxxxxx/cat/partitions/p1/tasks
> > /socket-0/cosid
> > ...
> > /socket-n/cosid
> >
> > The default value for the per socket cosid is COSID_DEFAULT, which
> > causes the task(s) to use the per cpu default id.
> >
> > Thoughts?
> >
> > Thanks,
> >
> > tglx
>
> The cgroups interface works, but moves the problem of contiguous
> allocation to userspace, and is incompatible with cache allocations
> on demand.
>
> Have to solve the kernel threads VS cgroups issue...
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
On Wed, Nov 18, 2015 at 10:01:53PM -0200, Marcelo Tosatti wrote:
> On Wed, Nov 18, 2015 at 07:25:03PM +0100, Thomas Gleixner wrote:
> > Folks!
> >
> > After rereading the mail flood on CAT and staring into the SDM for a
> > while, I think we all should sit back and look at it from scratch
> > again w/o our preconceptions - I certainly had to put my own away.
> >
> > Let's look at the properties of CAT again:
> >
> > - It's a per socket facility
> >
> > - CAT slots can be associated to external hardware. This
> > association is per socket as well, so different sockets can have
> > different behaviour. I missed that detail when staring the first
> > time, thanks for the pointer!
> >
> > - The association ifself is per cpu. The COS selection happens on a
> > CPU while the set of masks which are selected via COS are shared
> > by all CPUs on a socket.
> >
> > There are restrictions which CAT imposes in terms of configurability:
> >
> > - The bits which select a cache partition need to be consecutive
> >
> > - The number of possible cache association masks is limited
> >
> > Let's look at the configurations (CDP omitted and size restricted)
> >
> > Default: 1 1 1 1 1 1 1 1
> > 1 1 1 1 1 1 1 1
> > 1 1 1 1 1 1 1 1
> > 1 1 1 1 1 1 1 1
> >
> > Shared: 1 1 1 1 1 1 1 1
> > 0 0 1 1 1 1 1 1
> > 0 0 0 0 1 1 1 1
> > 0 0 0 0 0 0 1 1
> >
> > Isolated: 1 1 1 1 0 0 0 0
> > 0 0 0 0 1 1 0 0
> > 0 0 0 0 0 0 1 0
> > 0 0 0 0 0 0 0 1
> >
> > Or any combination thereof. Surely some combinations will not make any
> > sense, but we really should not make any restrictions on the stupidity
> > of a sysadmin. The worst outcome might be L3 disabled for everything,
> > so what?
> >
> > Now that gets even more convoluted if CDP comes into play and we
> > really need to look at CDP right now. We might end up with something
> > which looks like this:
> >
> > 1 1 1 1 0 0 0 0 Code
> > 1 1 1 1 0 0 0 0 Data
> > 0 0 0 0 0 0 1 0 Code
> > 0 0 0 0 1 1 0 0 Data
> > 0 0 0 0 0 0 0 1 Code
> > 0 0 0 0 1 1 0 0 Data
> > or
> > 0 0 0 0 0 0 0 1 Code
> > 0 0 0 0 1 1 0 0 Data
> > 0 0 0 0 0 0 0 1 Code
> > 0 0 0 0 0 1 1 0 Data
> >
> > Let's look at partitioning itself. We have two options:
> >
> > 1) Per task partitioning
> >
> > 2) Per CPU partitioning
> >
> > So far we only talked about #1, but I think that #2 has a value as
> > well. Let me give you a simple example.
> >
> > Assume that you have isolated a CPU and run your important task on
> > it. You give that task a slice of cache. Now that task needs kernel
> > services which run in kernel threads on that CPU. We really don't want
> > to (and cannot) hunt down random kernel threads (think cpu bound
> > worker threads, softirq threads ....) and give them another slice of
> > cache. What we really want is:
> >
> > 1 1 1 1 0 0 0 0 <- Default cache
> > 0 0 0 0 1 1 1 0 <- Cache for important task
> > 0 0 0 0 0 0 0 1 <- Cache for CPU of important task
> >
> > It would even be sufficient for particular use cases to just associate
> > a piece of cache to a given CPU and do not bother with tasks at all.
> >
> > We really need to make this as configurable as possible from userspace
> > without imposing random restrictions to it. I played around with it on
> > my new intel toy and the restriction to 16 COS ids (that's 8 with CDP
> > enabled) makes it really useless if we force the ids to have the same
> > meaning on all sockets and restrict it to per task partitioning.
> >
> > Even if next generation systems will have more COS ids available,
> > there are not going to be enough to have a system wide consistent
> > view unless we have COS ids > nr_cpus.
> >
> > Aside of that I don't think that a system wide consistent view is
> > useful at all.
> >
> > - If a task migrates between sockets, it's going to suffer anyway.
> > Real sensitive applications will simply pin tasks on a socket to
> > avoid that in the first place. If we make the whole thing
> > configurable enough then the sysadmin can set it up to support
> > even the nonsensical case of identical cache partitions on all
> > sockets and let tasks use the corresponding partitions when
> > migrating.
> >
> > - The number of cache slices is going to be limited no matter what,
> > so one still has to come up with a sensible partitioning scheme.
> >
> > - Even if we have enough cos ids the system wide view will not make
> > the configuration problem any simpler as it remains per socket.
> >
> > It's hard. Policies are hard by definition, but this one is harder
> > than most other policies due to the inherent limitations.
> >
> > So now to the interface part. Unfortunately we need to expose this
> > very close to the hardware implementation as there are really no
> > abstractions which allow us to express the various bitmap
> > combinations. Any abstraction I tried to come up with renders that
> > thing completely useless.
>
> No you don't.
Actually, there is a point that is useful: you might want the important
application to share the L3 portion with HW (that HW DMAs into), and
have only the application and the HW use that region.
So its a good point that controlling the exact position of the reservation
is important.
Marcelo,
On Wed, 18 Nov 2015, Marcelo Tosatti wrote:
Can you please trim your replies? It's really annoying having to
search for a single line of reply.
> The cgroups interface works, but moves the problem of contiguous
> allocation to userspace, and is incompatible with cache allocations
> on demand.
>
> Have to solve the kernel threads VS cgroups issue...
Sorry, I have no idea what you want to tell me.
Thanks,
tglx
On Wed, 18 Nov 2015, Marcelo Tosatti wrote:
> On Wed, Nov 18, 2015 at 08:34:07PM -0200, Marcelo Tosatti wrote:
> > On Wed, Nov 18, 2015 at 07:25:03PM +0100, Thomas Gleixner wrote:
> > > Assume that you have isolated a CPU and run your important task on
> > > it. You give that task a slice of cache. Now that task needs kernel
> > > services which run in kernel threads on that CPU. We really don't want
> > > to (and cannot) hunt down random kernel threads (think cpu bound
> > > worker threads, softirq threads ....) and give them another slice of
> > > cache. What we really want is:
> > >
> > > 1 1 1 1 0 0 0 0 <- Default cache
> > > 0 0 0 0 1 1 1 0 <- Cache for important task
> > > 0 0 0 0 0 0 0 1 <- Cache for CPU of important task
> > >
> > > It would even be sufficient for particular use cases to just associate
> > > a piece of cache to a given CPU and do not bother with tasks at all.
>
> Well any work on behalf of the important task, should have its cache
> protected as well (example irq handling threads).
Right, but that's nothing you can do automatically and certainly not
from a random application.
> But for certain kernel tasks for which L3 cache is not beneficial
> (eg: kernel samepage merging), it might useful to exclude such tasks
> from the "important, do not flush" L3 cache portion.
Sure it might be useful, but this needs to be done on a case by case
basis and there is no way to do this in any automated way.
> > > It's hard. Policies are hard by definition, but this one is harder
> > > than most other policies due to the inherent limitations.
>
> That is exactly why it should be allowed for software to automatically
> configure the policies.
There is nothing you can do automatically. If you want to allow
applications to set the policies themself, then you need to assign a
portion of the bitmask space and a portion of the cos id space to that
application and then let it do with that space what it wants.
That's where cgroups come into play. But that does not solve the other
issues of "global" configuration, i.e. CPU defaults etc.
Thanks,
tglx
On Wed, 18 Nov 2015, Marcelo Tosatti wrote:
> On Wed, Nov 18, 2015 at 07:25:03PM +0100, Thomas Gleixner wrote:
> > So now to the interface part. Unfortunately we need to expose this
> > very close to the hardware implementation as there are really no
> > abstractions which allow us to express the various bitmap
> > combinations. Any abstraction I tried to come up with renders that
> > thing completely useless.
>
> No you don't.
Because you have a use case which allows you to write some policy
translator? I seriously doubt that it is general enough.
> Again: you don't need to look into the MSR table and relate it
> to tasks if you store the data as:
>
> task group 1 = {
> reservation-1 = {size = 80Kb, type = data, socketmask = 0xffff},
> reservation-2 = {size = 100Kb, type = code, socketmask = 0xffff}
> }
>
> task group 2 = {
> reservation-1 = {size = 80Kb, type = data, socketmask = 0xffff},
> reservation-3 = {size = 200Kb, type = code, socketmask = 0xffff}
> }
>
> Task group 1 and task group 2 share reservation-1.
>
> This is what userspace is going to expose to users, of course.
> If you expose the MSRs to userspace, you force userspace to convert
> from this format to the MSRs (minding whether there
> are contiguous regions available, and the region shared with HW).
Fair enough. I'm not too fond about the exposure of the MSRs, but I
chose this just to explain the full problem space and the various
requirements we might have accross the full application space.
If we can come up with an abstract way which does not impose
restrictions on the overall configuration abilities, I'm all for it.
> - The bits which select a cache partition need to be consecutive
>
> BUT, for our usecase the cgroups interface works as well, so lets
> go with that (Tejun apparently had a usecase where tasks were allowed to
> set reservations themselves, on response to external events).
Can you please set aside your narrow use case view for a moment and
just think about the full application space? We are not designing such
an interface for a single use case.
Thanks,
tglx
On Wed, 18 Nov 2015, Marcelo Tosatti wrote
> Actually, there is a point that is useful: you might want the important
> application to share the L3 portion with HW (that HW DMAs into), and
> have only the application and the HW use that region.
>
> So its a good point that controlling the exact position of the reservation
> is important.
I'm glad you figured that out yourself. :)
Thanks,
tglx
On Thu, 19 Nov 2015 09:35:34 +0100 (CET)
Thomas Gleixner <[email protected]> wrote:
> > Well any work on behalf of the important task, should have its cache
> > protected as well (example irq handling threads).
>
> Right, but that's nothing you can do automatically and certainly not
> from a random application.
Right and that's not a problem. For the use-cases CAT is intended to,
manual and per-workload system setup is very common. Things like
thread pinning, hugepages reservation, CPU isolation, nohz_full, etc
require manual setup too.
On Wed, Nov 18, 2015 at 11:05:35PM -0200, Marcelo Tosatti wrote:
> On Wed, Nov 18, 2015 at 10:01:53PM -0200, Marcelo Tosatti wrote:
> > On Wed, Nov 18, 2015 at 07:25:03PM +0100, Thomas Gleixner wrote:
> > > Folks!
> > >
> > > After rereading the mail flood on CAT and staring into the SDM for a
> > > while, I think we all should sit back and look at it from scratch
> > > again w/o our preconceptions - I certainly had to put my own away.
> > >
> > > Let's look at the properties of CAT again:
> > >
> > > - It's a per socket facility
> > >
> > > - CAT slots can be associated to external hardware. This
> > > association is per socket as well, so different sockets can have
> > > different behaviour. I missed that detail when staring the first
> > > time, thanks for the pointer!
> > >
> > > - The association ifself is per cpu. The COS selection happens on a
> > > CPU while the set of masks which are selected via COS are shared
> > > by all CPUs on a socket.
> > >
> > > There are restrictions which CAT imposes in terms of configurability:
> > >
> > > - The bits which select a cache partition need to be consecutive
> > >
> > > - The number of possible cache association masks is limited
> > >
> > > Let's look at the configurations (CDP omitted and size restricted)
> > >
> > > Default: 1 1 1 1 1 1 1 1
> > > 1 1 1 1 1 1 1 1
> > > 1 1 1 1 1 1 1 1
> > > 1 1 1 1 1 1 1 1
> > >
> > > Shared: 1 1 1 1 1 1 1 1
> > > 0 0 1 1 1 1 1 1
> > > 0 0 0 0 1 1 1 1
> > > 0 0 0 0 0 0 1 1
> > >
> > > Isolated: 1 1 1 1 0 0 0 0
> > > 0 0 0 0 1 1 0 0
> > > 0 0 0 0 0 0 1 0
> > > 0 0 0 0 0 0 0 1
> > >
> > > Or any combination thereof. Surely some combinations will not make any
> > > sense, but we really should not make any restrictions on the stupidity
> > > of a sysadmin. The worst outcome might be L3 disabled for everything,
> > > so what?
> > >
> > > Now that gets even more convoluted if CDP comes into play and we
> > > really need to look at CDP right now. We might end up with something
> > > which looks like this:
> > >
> > > 1 1 1 1 0 0 0 0 Code
> > > 1 1 1 1 0 0 0 0 Data
> > > 0 0 0 0 0 0 1 0 Code
> > > 0 0 0 0 1 1 0 0 Data
> > > 0 0 0 0 0 0 0 1 Code
> > > 0 0 0 0 1 1 0 0 Data
> > > or
> > > 0 0 0 0 0 0 0 1 Code
> > > 0 0 0 0 1 1 0 0 Data
> > > 0 0 0 0 0 0 0 1 Code
> > > 0 0 0 0 0 1 1 0 Data
> > >
> > > Let's look at partitioning itself. We have two options:
> > >
> > > 1) Per task partitioning
> > >
> > > 2) Per CPU partitioning
> > >
> > > So far we only talked about #1, but I think that #2 has a value as
> > > well. Let me give you a simple example.
> > >
> > > Assume that you have isolated a CPU and run your important task on
> > > it. You give that task a slice of cache. Now that task needs kernel
> > > services which run in kernel threads on that CPU. We really don't want
> > > to (and cannot) hunt down random kernel threads (think cpu bound
> > > worker threads, softirq threads ....) and give them another slice of
> > > cache. What we really want is:
> > >
> > > 1 1 1 1 0 0 0 0 <- Default cache
> > > 0 0 0 0 1 1 1 0 <- Cache for important task
> > > 0 0 0 0 0 0 0 1 <- Cache for CPU of important task
> > >
> > > It would even be sufficient for particular use cases to just associate
> > > a piece of cache to a given CPU and do not bother with tasks at all.
> > >
> > > We really need to make this as configurable as possible from userspace
> > > without imposing random restrictions to it. I played around with it on
> > > my new intel toy and the restriction to 16 COS ids (that's 8 with CDP
> > > enabled) makes it really useless if we force the ids to have the same
> > > meaning on all sockets and restrict it to per task partitioning.
> > >
> > > Even if next generation systems will have more COS ids available,
> > > there are not going to be enough to have a system wide consistent
> > > view unless we have COS ids > nr_cpus.
> > >
> > > Aside of that I don't think that a system wide consistent view is
> > > useful at all.
> > >
> > > - If a task migrates between sockets, it's going to suffer anyway.
> > > Real sensitive applications will simply pin tasks on a socket to
> > > avoid that in the first place. If we make the whole thing
> > > configurable enough then the sysadmin can set it up to support
> > > even the nonsensical case of identical cache partitions on all
> > > sockets and let tasks use the corresponding partitions when
> > > migrating.
> > >
> > > - The number of cache slices is going to be limited no matter what,
> > > so one still has to come up with a sensible partitioning scheme.
> > >
> > > - Even if we have enough cos ids the system wide view will not make
> > > the configuration problem any simpler as it remains per socket.
> > >
> > > It's hard. Policies are hard by definition, but this one is harder
> > > than most other policies due to the inherent limitations.
> > >
> > > So now to the interface part. Unfortunately we need to expose this
> > > very close to the hardware implementation as there are really no
> > > abstractions which allow us to express the various bitmap
> > > combinations. Any abstraction I tried to come up with renders that
> > > thing completely useless.
> >
> > No you don't.
>
> Actually, there is a point that is useful: you might want the important
> application to share the L3 portion with HW (that HW DMAs into), and
> have only the application and the HW use that region.
Actually, don't see why that makes sense.
So "share the L3 portion" means being allowed to reclaim data from that
portion of L3 cache.
Why would you want to allow application and HW to reclaim from the same
region? I don't know.
But exposing the HW interface allows you to do that, if some reason
for doing so exists.
Exposing the HW interface:
--------------------------
Pros: *1) Can do whatever combination necessary.
Cons: *2) Userspace has to deal with the contiguity issue
(example: upon allocation request, "compacting" the cbm bits
can allow for allocation request to be succesful (that is
enough contiguous bits available), but "compacting" means
moving CBM bits around which means applications will lose
their reservation at the time the CBM bit positions are moved,
so it can affect ongoing code).
*3) Userspace has to deal with convertion from kbytes to cache ways.
*4) Userspace has to deal with locking access to the interface.
* Userspace has no access to the timing of schedin/schedouts,
so it cannot perform optimizations based on that information.
Not exposing the HW interface:
------------------------------
Pros: *10) Can use whatever combination necessary, provided that you
extend the interface.
*11) Allows the kernel to optimize usage of the reservations, because only
the kernel knows the times of scheduling.
*12) Allows the kernel to handle 2,3,4, rather than having userspace
handle it.
*13) Allows applications to set cache reservations themselves, directly
via an ioctl or system call.
Cons:
* There are users of the cgroups interface today, they will have
to change.
On Thu, Nov 19, 2015 at 10:09:03AM +0100, Thomas Gleixner wrote:
> On Wed, 18 Nov 2015, Marcelo Tosatti wrote
> > Actually, there is a point that is useful: you might want the important
> > application to share the L3 portion with HW (that HW DMAs into), and
> > have only the application and the HW use that region.
> >
> > So its a good point that controlling the exact position of the reservation
> > is important.
>
> I'm glad you figured that out yourself. :)
>
> Thanks,
>
> tglx
The HW is a reclaimer of the L3 region shared with HW.
You might want to remove any threads from reclaiming from
that region.
On Thu, 19 Nov 2015, Marcelo Tosatti wrote:
> On Thu, Nov 19, 2015 at 10:09:03AM +0100, Thomas Gleixner wrote:
> > On Wed, 18 Nov 2015, Marcelo Tosatti wrote
> > > Actually, there is a point that is useful: you might want the important
> > > application to share the L3 portion with HW (that HW DMAs into), and
> > > have only the application and the HW use that region.
> > >
> > > So its a good point that controlling the exact position of the reservation
> > > is important.
> >
> > I'm glad you figured that out yourself. :)
> >
> > Thanks,
> >
> > tglx
>
> The HW is a reclaimer of the L3 region shared with HW.
>
> You might want to remove any threads from reclaiming from
> that region.
I might for some threads, but certainly not for those which need to
access DMA buffers. Throwing away 10% of L3 just because you don't
want to deal with it at the interface level is hillarious.
Thanks,
tglx
On Thu, Nov 19, 2015 at 09:35:34AM +0100, Thomas Gleixner wrote:
> On Wed, 18 Nov 2015, Marcelo Tosatti wrote:
> > On Wed, Nov 18, 2015 at 08:34:07PM -0200, Marcelo Tosatti wrote:
> > > On Wed, Nov 18, 2015 at 07:25:03PM +0100, Thomas Gleixner wrote:
> > > > Assume that you have isolated a CPU and run your important task on
> > > > it. You give that task a slice of cache. Now that task needs kernel
> > > > services which run in kernel threads on that CPU. We really don't want
> > > > to (and cannot) hunt down random kernel threads (think cpu bound
> > > > worker threads, softirq threads ....) and give them another slice of
> > > > cache. What we really want is:
> > > >
> > > > 1 1 1 1 0 0 0 0 <- Default cache
> > > > 0 0 0 0 1 1 1 0 <- Cache for important task
> > > > 0 0 0 0 0 0 0 1 <- Cache for CPU of important task
> > > >
> > > > It would even be sufficient for particular use cases to just associate
> > > > a piece of cache to a given CPU and do not bother with tasks at all.
> >
> > Well any work on behalf of the important task, should have its cache
> > protected as well (example irq handling threads).
>
> Right, but that's nothing you can do automatically and certainly not
> from a random application.
>
> > But for certain kernel tasks for which L3 cache is not beneficial
> > (eg: kernel samepage merging), it might useful to exclude such tasks
> > from the "important, do not flush" L3 cache portion.
>
> Sure it might be useful, but this needs to be done on a case by case
> basis and there is no way to do this in any automated way.
>
> > > > It's hard. Policies are hard by definition, but this one is harder
> > > > than most other policies due to the inherent limitations.
> >
> > That is exactly why it should be allowed for software to automatically
> > configure the policies.
>
> There is nothing you can do automatically.
Every cacheline brought in the L3 has a reaccess time (the time when it
was first brought in to the time it was reaccessed).
Assume you have a single threaded app, a sequence of cacheline
accesses.
Now if there are groups of accesses which have long reaccess times
(meaning that keeping them in L3 is not beneficial), that are large
enough to justify the OS notification, the application can notify the OS
to switch to a constrained COSid (so that L3 misses reclaim from that
small portion of the L3 cache).
> If you want to allow
> applications to set the policies themself, then you need to assign a
> portion of the bitmask space and a portion of the cos id space to that
> application and then let it do with that space what it wants.
Thats why you should specify the requirements independently of each
other (the requirement in this case the size of the reservation and
type, which is tied to the application), and let something else figure
out how they all fit together.
> That's where cgroups come into play. But that does not solve the other
> issues of "global" configuration, i.e. CPU defaults etc.
I don't understand what you mean issues of global configuration.
CPU defaults: A task is associated with a COSid. A COSid points to
a set of CBMs (one CBM per socket). What defaults are you talking about?
But the interfaces do not exclude each other (the ioctl or syscall
interfaces and the manual direct MSR interface can coexist). There is
time pressure to integrate something workable for the present use cases
(none are in the class "applications set reservation themselves").
Peter has some objection against ioctls. So for something workable,
well have to handle the numbered issues pointed in the other e-mail
(2,3,4), in userspace.
On Fri, Nov 20, 2015 at 08:53:34AM +0100, Thomas Gleixner wrote:
> On Thu, 19 Nov 2015, Marcelo Tosatti wrote:
> > On Thu, Nov 19, 2015 at 10:09:03AM +0100, Thomas Gleixner wrote:
> > > On Wed, 18 Nov 2015, Marcelo Tosatti wrote
> > > > Actually, there is a point that is useful: you might want the important
> > > > application to share the L3 portion with HW (that HW DMAs into), and
> > > > have only the application and the HW use that region.
> > > >
> > > > So its a good point that controlling the exact position of the reservation
> > > > is important.
> > >
> > > I'm glad you figured that out yourself. :)
> > >
> > > Thanks,
> > >
> > > tglx
> >
> > The HW is a reclaimer of the L3 region shared with HW.
> >
> > You might want to remove any threads from reclaiming from
> > that region.
>
> I might for some threads, but certainly not for those which need to
> access DMA buffers.
Yes, when i wrote "its a good point that controlling the exact position
of the reservation is important" i had that in mind as well.
But its wrong: not having a bit set in the CBM for the portion of L3
cache which is shared with HW only means "for cacheline misses of the
application, evict cachelines from this portion".
So yes, you might want to exclude the application which accesses DMA
buffers from reclaiming cachelines in the portion shared with HW,
to keep those cachelines longer in L3.
> Throwing away 10% of L3 just because you don't
> want to deal with it at the interface level is hillarious.
If there is interest on per-application configuration then it can
be integrated as well.
Thanks for your time.
On Wed, Nov 18, 2015 at 07:25:03PM +0100, Thomas Gleixner wrote:
>
> Let's look at partitioning itself. We have two options:
>
> 1) Per task partitioning
>
> 2) Per CPU partitioning
>
> So far we only talked about #1, but I think that #2 has a value as
> well. Let me give you a simple example.
I would second this. In practice per CPU partitioning is useful for
realtime as well. And I can see three possible solutions:
1) What you suggested below, to address both problems in one
framework. But I wonder if it would end with too complex.
2) Achieve per CPU partitioning with per task partitioning. For
example, if current CAT patch can solve the kernel threads
problem, together with CPU pinning, we then can set a same CBM
for all the tasks/kernel threads run on an isolated CPU.
3) I wonder if it feasible to separate the two requirements? For
example, divides the work into three components: rdt-base,
per task interface (current cgroup interface/IOCTL or something)
and per CPU interface. The two interfaces are exclusive and
selected at build time. One thing to reject this option would be
even with per CPU partitioning, we still need per task partitioning,
in that case we will go to option 1) again.
Thanks,
Chao
On Wed, Nov 18, 2015 at 10:01:54PM -0200, Marcelo Tosatti wrote:
> > tglx
>
> Again: you don't need to look into the MSR table and relate it
> to tasks if you store the data as:
>
> task group 1 = {
> reservation-1 = {size = 80Kb, type = data, socketmask = 0xffff},
> reservation-2 = {size = 100Kb, type = code, socketmask = 0xffff}
> }
>
> task group 2 = {
> reservation-1 = {size = 80Kb, type = data, socketmask = 0xffff},
> reservation-3 = {size = 200Kb, type = code, socketmask = 0xffff}
> }
>
> Task group 1 and task group 2 share reservation-1.
Because there is only size but not CBM position info, I guess for
different reservations they will not overlap each other, right?
Personally I like this way of exposing minimal information to userspace.
I can think it working well except for one concern of losing flexibility:
For instance, there is a box for which the full CBM is 0xfffff. After
cache reservation creating/freeing for a while we then have reservations:
reservation1: 0xf0000
reservation2: 0x00ff0
Now people want to request a reservation which size is 0xff, so how
will kernel do at this time? It could return just error or do some
moving/merging (e.g. for reservation2: 0x00ff0 => 0x0ff00) and then
satisfy the request. But I don't know if the moving/merging will cause
delay for tasks that is using it.
Thanks,
Chao
On Tue, Nov 24, 2015 at 03:31:24PM +0800, Chao Peng wrote:
> On Wed, Nov 18, 2015 at 07:25:03PM +0100, Thomas Gleixner wrote:
> >
> > Let's look at partitioning itself. We have two options:
> >
> > 1) Per task partitioning
> >
> > 2) Per CPU partitioning
> >
> > So far we only talked about #1, but I think that #2 has a value as
> > well. Let me give you a simple example.
>
> I would second this. In practice per CPU partitioning is useful for
> realtime as well. And I can see three possible solutions:
>
> 1) What you suggested below, to address both problems in one
> framework. But I wonder if it would end with too complex.
>
> 2) Achieve per CPU partitioning with per task partitioning. For
> example, if current CAT patch can solve the kernel threads
> problem, together with CPU pinning, we then can set a same CBM
> for all the tasks/kernel threads run on an isolated CPU.
As for the kernel threads problem, it seems its a silly limitation of
the code which handles writes to cgroups:
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index f89d929..0603652 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2466,16 +2466,6 @@ static ssize_t __cgroup_procs_write(struct
kernfs_open_file *of, char *buf,
if (threadgroup)
tsk = tsk->group_leader;
- /*
- * Workqueue threads may acquire PF_NO_SETAFFINITY and become
- * trapped in a cpuset, or RT worker may be born in a cgroup
- * with no rt_runtime allocated. Just say no.
- */
- if (tsk == kthreadd_task || (tsk->flags & PF_NO_SETAFFINITY)) {
- ret = -EINVAL;
- goto out_unlock_rcu;
- }
-
get_task_struct(tsk);
rcu_read_unlock();
For a cgroup hierarchy with no cpusets (such as CAT only) this
limitation makes no sense (looking for a place where to move this to).
Any ETA on per-socket bitmasks?
>
> 3) I wonder if it feasible to separate the two requirements? For
> example, divides the work into three components: rdt-base,
> per task interface (current cgroup interface/IOCTL or something)
> and per CPU interface. The two interfaces are exclusive and
> selected at build time. One thing to reject this option would be
> even with per CPU partitioning, we still need per task partitioning,
> in that case we will go to option 1) again.
>
> Thanks,
> Chao
On Tue, Nov 24, 2015 at 07:25:43PM -0200, Marcelo Tosatti wrote:
> On Tue, Nov 24, 2015 at 04:27:54PM +0800, Chao Peng wrote:
> > On Wed, Nov 18, 2015 at 10:01:54PM -0200, Marcelo Tosatti wrote:
> > > > tglx
> > >
> > > Again: you don't need to look into the MSR table and relate it
> > > to tasks if you store the data as:
> > >
> > > task group 1 = {
> > > reservation-1 = {size = 80Kb, type = data, socketmask = 0xffff},
> > > reservation-2 = {size = 100Kb, type = code, socketmask = 0xffff}
> > > }
> > >
> > > task group 2 = {
> > > reservation-1 = {size = 80Kb, type = data, socketmask = 0xffff},
> > > reservation-3 = {size = 200Kb, type = code, socketmask = 0xffff}
> > > }
> > >
> > > Task group 1 and task group 2 share reservation-1.
> >
> > Because there is only size but not CBM position info, I guess for
> > different reservations they will not overlap each other, right?
>
> Reservation 1 is shared between task group 1 and task group 2
> so the CBMs overlap (by 80Kb, rounded).
>
> > Personally I like this way of exposing minimal information to userspace.
> > I can think it working well except for one concern of losing flexibility:
> >
> > For instance, there is a box for which the full CBM is 0xfffff. After
> > cache reservation creating/freeing for a while we then have reservations:
> >
> > reservation1: 0xf0000
> > reservation2: 0x00ff0
> >
> > Now people want to request a reservation which size is 0xff, so how
> > will kernel do at this time? It could return just error or do some
> > moving/merging (e.g. for reservation2: 0x00ff0 => 0x0ff00) and then
> > satisfy the request. But I don't know if the moving/merging will cause
> > delay for tasks that is using it.
>
> Right, i was thinking of adding a "force" parameter.
>
> So, default behaviour of attach: do not merge.
> "force" behaviour of attach: move reservations around and merge if
> necessary.
To make the decision userspace would need the know that a merge can
be performed if particular reservations can be moved (that is, the
moveable property is per-reservation, depending on whether its ok
for the given app to cacheline fault or not).
Anyway, thats for later.
> From: Thomas Gleixner [mailto:[email protected]]
> Sent: Wednesday, November 18, 2015 10:25 AM
> Folks!
>
> After rereading the mail flood on CAT and staring into the SDM for a while, I
> think we all should sit back and look at it from scratch again w/o our
> preconceptions - I certainly had to put my own away.
>
> Let's look at the properties of CAT again:
>
> - It's a per socket facility
>
> - CAT slots can be associated to external hardware. This
> association is per socket as well, so different sockets can have
> different behaviour. I missed that detail when staring the first
> time, thanks for the pointer!
>
> - The association ifself is per cpu. The COS selection happens on a
> CPU while the set of masks which are selected via COS are shared
> by all CPUs on a socket.
>
> There are restrictions which CAT imposes in terms of configurability:
>
> - The bits which select a cache partition need to be consecutive
>
> - The number of possible cache association masks is limited
>
> Let's look at the configurations (CDP omitted and size restricted)
>
> Default: 1 1 1 1 1 1 1 1
> 1 1 1 1 1 1 1 1
> 1 1 1 1 1 1 1 1
> 1 1 1 1 1 1 1 1
>
> Shared: 1 1 1 1 1 1 1 1
> 0 0 1 1 1 1 1 1
> 0 0 0 0 1 1 1 1
> 0 0 0 0 0 0 1 1
>
> Isolated: 1 1 1 1 0 0 0 0
> 0 0 0 0 1 1 0 0
> 0 0 0 0 0 0 1 0
> 0 0 0 0 0 0 0 1
>
> Or any combination thereof. Surely some combinations will not make any
> sense, but we really should not make any restrictions on the stupidity of a
> sysadmin. The worst outcome might be L3 disabled for everything, so what?
>
> Now that gets even more convoluted if CDP comes into play and we really
> need to look at CDP right now. We might end up with something which looks
> like this:
>
> 1 1 1 1 0 0 0 0 Code
> 1 1 1 1 0 0 0 0 Data
> 0 0 0 0 0 0 1 0 Code
> 0 0 0 0 1 1 0 0 Data
> 0 0 0 0 0 0 0 1 Code
> 0 0 0 0 1 1 0 0 Data
> or
> 0 0 0 0 0 0 0 1 Code
> 0 0 0 0 1 1 0 0 Data
> 0 0 0 0 0 0 0 1 Code
> 0 0 0 0 0 1 1 0 Data
>
> Let's look at partitioning itself. We have two options:
>
> 1) Per task partitioning
>
> 2) Per CPU partitioning
>
> So far we only talked about #1, but I think that #2 has a value as well. Let me
> give you a simple example.
>
> Assume that you have isolated a CPU and run your important task on it. You
> give that task a slice of cache. Now that task needs kernel services which run
> in kernel threads on that CPU. We really don't want to (and cannot) hunt
> down random kernel threads (think cpu bound worker threads, softirq
> threads ....) and give them another slice of cache. What we really want is:
>
> 1 1 1 1 0 0 0 0 <- Default cache
> 0 0 0 0 1 1 1 0 <- Cache for important task
> 0 0 0 0 0 0 0 1 <- Cache for CPU of important task
>
> It would even be sufficient for particular use cases to just associate a piece of
> cache to a given CPU and do not bother with tasks at all.
>
> We really need to make this as configurable as possible from userspace
> without imposing random restrictions to it. I played around with it on my new
> intel toy and the restriction to 16 COS ids (that's 8 with CDP
> enabled) makes it really useless if we force the ids to have the same meaning
> on all sockets and restrict it to per task partitioning.
>
> Even if next generation systems will have more COS ids available, there are
> not going to be enough to have a system wide consistent view unless we
> have COS ids > nr_cpus.
>
> Aside of that I don't think that a system wide consistent view is useful at all.
>
> - If a task migrates between sockets, it's going to suffer anyway.
> Real sensitive applications will simply pin tasks on a socket to
> avoid that in the first place. If we make the whole thing
> configurable enough then the sysadmin can set it up to support
> even the nonsensical case of identical cache partitions on all
> sockets and let tasks use the corresponding partitions when
> migrating.
>
> - The number of cache slices is going to be limited no matter what,
> so one still has to come up with a sensible partitioning scheme.
>
> - Even if we have enough cos ids the system wide view will not make
> the configuration problem any simpler as it remains per socket.
>
> It's hard. Policies are hard by definition, but this one is harder than most
> other policies due to the inherent limitations.
>
> So now to the interface part. Unfortunately we need to expose this very
> close to the hardware implementation as there are really no abstractions
> which allow us to express the various bitmap combinations. Any abstraction I
> tried to come up with renders that thing completely useless.
>
> I was not able to identify any existing infrastructure where this really fits in. I
> chose a directory/file based representation. We certainly could do the same
Is this be /sys/devices/system/?
Then create qos/cat directory. In the future, other directories may be created
e.g. qos/mbm?
Thanks.
-Fenghua
On Tue, Dec 22, 2015 at 06:12:05PM +0000, Yu, Fenghua wrote:
> > From: Thomas Gleixner [mailto:[email protected]]
> > Sent: Wednesday, November 18, 2015 10:25 AM
> > Folks!
> >
> > After rereading the mail flood on CAT and staring into the SDM for a while, I
> > think we all should sit back and look at it from scratch again w/o our
> > preconceptions - I certainly had to put my own away.
> >
> > Let's look at the properties of CAT again:
> >
> > - It's a per socket facility
> >
> > - CAT slots can be associated to external hardware. This
> > association is per socket as well, so different sockets can have
> > different behaviour. I missed that detail when staring the first
> > time, thanks for the pointer!
> >
> > - The association ifself is per cpu. The COS selection happens on a
> > CPU while the set of masks which are selected via COS are shared
> > by all CPUs on a socket.
> >
> > There are restrictions which CAT imposes in terms of configurability:
> >
> > - The bits which select a cache partition need to be consecutive
> >
> > - The number of possible cache association masks is limited
> >
> > Let's look at the configurations (CDP omitted and size restricted)
> >
> > Default: 1 1 1 1 1 1 1 1
> > 1 1 1 1 1 1 1 1
> > 1 1 1 1 1 1 1 1
> > 1 1 1 1 1 1 1 1
> >
> > Shared: 1 1 1 1 1 1 1 1
> > 0 0 1 1 1 1 1 1
> > 0 0 0 0 1 1 1 1
> > 0 0 0 0 0 0 1 1
> >
> > Isolated: 1 1 1 1 0 0 0 0
> > 0 0 0 0 1 1 0 0
> > 0 0 0 0 0 0 1 0
> > 0 0 0 0 0 0 0 1
> >
> > Or any combination thereof. Surely some combinations will not make any
> > sense, but we really should not make any restrictions on the stupidity of a
> > sysadmin. The worst outcome might be L3 disabled for everything, so what?
> >
> > Now that gets even more convoluted if CDP comes into play and we really
> > need to look at CDP right now. We might end up with something which looks
> > like this:
> >
> > 1 1 1 1 0 0 0 0 Code
> > 1 1 1 1 0 0 0 0 Data
> > 0 0 0 0 0 0 1 0 Code
> > 0 0 0 0 1 1 0 0 Data
> > 0 0 0 0 0 0 0 1 Code
> > 0 0 0 0 1 1 0 0 Data
> > or
> > 0 0 0 0 0 0 0 1 Code
> > 0 0 0 0 1 1 0 0 Data
> > 0 0 0 0 0 0 0 1 Code
> > 0 0 0 0 0 1 1 0 Data
> >
> > Let's look at partitioning itself. We have two options:
> >
> > 1) Per task partitioning
> >
> > 2) Per CPU partitioning
> >
> > So far we only talked about #1, but I think that #2 has a value as well. Let me
> > give you a simple example.
> >
> > Assume that you have isolated a CPU and run your important task on it. You
> > give that task a slice of cache. Now that task needs kernel services which run
> > in kernel threads on that CPU. We really don't want to (and cannot) hunt
> > down random kernel threads (think cpu bound worker threads, softirq
> > threads ....) and give them another slice of cache. What we really want is:
> >
> > 1 1 1 1 0 0 0 0 <- Default cache
> > 0 0 0 0 1 1 1 0 <- Cache for important task
> > 0 0 0 0 0 0 0 1 <- Cache for CPU of important task
> >
> > It would even be sufficient for particular use cases to just associate a piece of
> > cache to a given CPU and do not bother with tasks at all.
> >
> > We really need to make this as configurable as possible from userspace
> > without imposing random restrictions to it. I played around with it on my new
> > intel toy and the restriction to 16 COS ids (that's 8 with CDP
> > enabled) makes it really useless if we force the ids to have the same meaning
> > on all sockets and restrict it to per task partitioning.
> >
> > Even if next generation systems will have more COS ids available, there are
> > not going to be enough to have a system wide consistent view unless we
> > have COS ids > nr_cpus.
> >
> > Aside of that I don't think that a system wide consistent view is useful at all.
> >
> > - If a task migrates between sockets, it's going to suffer anyway.
> > Real sensitive applications will simply pin tasks on a socket to
> > avoid that in the first place. If we make the whole thing
> > configurable enough then the sysadmin can set it up to support
> > even the nonsensical case of identical cache partitions on all
> > sockets and let tasks use the corresponding partitions when
> > migrating.
> >
> > - The number of cache slices is going to be limited no matter what,
> > so one still has to come up with a sensible partitioning scheme.
> >
> > - Even if we have enough cos ids the system wide view will not make
> > the configuration problem any simpler as it remains per socket.
> >
> > It's hard. Policies are hard by definition, but this one is harder than most
> > other policies due to the inherent limitations.
> >
> > So now to the interface part. Unfortunately we need to expose this very
> > close to the hardware implementation as there are really no abstractions
> > which allow us to express the various bitmap combinations. Any abstraction I
> > tried to come up with renders that thing completely useless.
> >
> > I was not able to identify any existing infrastructure where this really fits in. I
> > chose a directory/file based representation. We certainly could do the same
>
> Is this be /sys/devices/system/?
> Then create qos/cat directory. In the future, other directories may be created
> e.g. qos/mbm?
>
> Thanks.
>
> -Fenghua
Fenghua,
I suppose Thomas is talking about the socketmask only, as discussed in
the call with Intel.
Thomas, is that correct? (if you want a change in directory structure,
please explain the whys, because we don't need that change in directory
structure).
Marcelo,
On Wed, 23 Dec 2015, Marcelo Tosatti wrote:
> On Tue, Dec 22, 2015 at 06:12:05PM +0000, Yu, Fenghua wrote:
> > > From: Thomas Gleixner [mailto:[email protected]]
> > >
> > > I was not able to identify any existing infrastructure where this really fits in. I
> > > chose a directory/file based representation. We certainly could do the same
> >
> > Is this be /sys/devices/system/?
> > Then create qos/cat directory. In the future, other directories may be created
> > e.g. qos/mbm?
>
> I suppose Thomas is talking about the socketmask only, as discussed in
> the call with Intel.
I have no idea about what you talked in a RH/Intel call.
> Thomas, is that correct? (if you want a change in directory structure,
> please explain the whys, because we don't need that change in directory
> structure).
Can you please start to write coherent and understandable mails? I have no
idea of which directory structure, which does not need to be changed, you are
talking.
I described a directory structure for that qos/cat stuff in my proposal and
that's complete AFAICT.
Thanks,
tglx
On Tue, Dec 29, 2015 at 01:44:16PM +0100, Thomas Gleixner wrote:
> Marcelo,
>
> On Wed, 23 Dec 2015, Marcelo Tosatti wrote:
> > On Tue, Dec 22, 2015 at 06:12:05PM +0000, Yu, Fenghua wrote:
> > > > From: Thomas Gleixner [mailto:[email protected]]
> > > >
> > > > I was not able to identify any existing infrastructure where this really fits in. I
> > > > chose a directory/file based representation. We certainly could do the same
> > >
> > > Is this be /sys/devices/system/?
> > > Then create qos/cat directory. In the future, other directories may be created
> > > e.g. qos/mbm?
> >
> > I suppose Thomas is talking about the socketmask only, as discussed in
> > the call with Intel.
>
> I have no idea about what you talked in a RH/Intel call.
>
> > Thomas, is that correct? (if you want a change in directory structure,
> > please explain the whys, because we don't need that change in directory
> > structure).
>
> Can you please start to write coherent and understandable mails? I have no
> idea of which directory structure, which does not need to be changed, you are
> talking.
Thomas,
There is one directory structure in this topic, CAT. That is the
directory structure which is exposed to userspace to control the
CAT HW.
With the current patchset posted by Intel ("Subject: [PATCH V16 00/11]
x86: Intel Cache Allocation Technology Support"), the directory
structure there (the files and directories exposed by that patchset)
(*1) does not allow one to configure different CBM masks on each socket
(that is, it forces the user to configure the same mask CBM on every
socket). This is a blocker for us, and it is one of the points in your
proposal.
There was a call between Red Hat and Intel where it was communicated
to Intel, and Intel agreed, that it was necessary to fix this (fix this
== allow different CBM masks on different sockets).
Now, that is one change to the current directory structure (*1).
(*1) modified to allow for different CBM masks on different sockets,
lets say (*2), is what we have been waiting for Intel to post.
It would handle our usecase, and all use-cases which the current
patchset from Intel already handles (Vikas posted emails mentioning
there are happy users of the current interface, feel free to ask
him for more details).
What i have asked you, and you replied "to go Google read my previous
post" is this:
What are the advantages over you proposal (which is a completely
different directory structure, requiring a complete rewrite),
over (*2) ?
(what is my reason behind this: the reason is that if you, with
maintainer veto power, forces your proposal to be accepted, it will be
necessary to wait for another rewrite (a new set of problems, fully
think through your proposal, test it, ...) rather than simply modify an
already known, reviewed, already used directory structure.
And functionally, your proposal adds nothing to (*2) (other than, well,
being a different directory structure).
If Fenghua or you post a patchset, say in 2 weeks, with your proposal,
i am fine with that. But i since i doubt that will be the case, i am
pushing for the interface which requires the least amount of changes
(and therefore the least amount of time) to be integrated.
>From your email:
"It would even be sufficient for particular use cases to just associate
a piece of cache to a given CPU and do not bother with tasks at all.
We really need to make this as configurable as possible from userspace
without imposing random restrictions to it. I played around with it on
my new intel toy and the restriction to 16 COS ids (that's 8 with CDP
enabled) makes it really useless if we force the ids to have the same
meaning on all sockets and restrict it to per task partitioning."
Yes, thats the issue we hit, that is the modification that was agreed
with Intel, and thats what we are waiting for them to post.
> I described a directory structure for that qos/cat stuff in my proposal and
> that's complete AFAICT.
Ok, lets make the job for the submitter easier. You are the maintainer,
so you decide.
Is it enough for you to have (*2) (which was agreed with Intel), or
would you rather prefer to integrate the directory structure at
"[RFD] CAT user space interface revisited" ?
Thanks.
Marcelo,
On Thu, 31 Dec 2015, Marcelo Tosatti wrote:
First of all thanks for the explanation.
> There is one directory structure in this topic, CAT. That is the
> directory structure which is exposed to userspace to control the
> CAT HW.
>
> With the current patchset posted by Intel ("Subject: [PATCH V16 00/11]
> x86: Intel Cache Allocation Technology Support"), the directory
> structure there (the files and directories exposed by that patchset)
> (*1) does not allow one to configure different CBM masks on each socket
> (that is, it forces the user to configure the same mask CBM on every
> socket). This is a blocker for us, and it is one of the points in your
> proposal.
>
> There was a call between Red Hat and Intel where it was communicated
> to Intel, and Intel agreed, that it was necessary to fix this (fix this
> == allow different CBM masks on different sockets).
>
> Now, that is one change to the current directory structure (*1).
I don't have an idea how that would look like. The current structure is a
cgroups based hierarchy oriented approach, which does not allow simple things
like
T1 00001111
T2 00111100
at least not in a way which is natural to the problem at hand.
> (*1) modified to allow for different CBM masks on different sockets,
> lets say (*2), is what we have been waiting for Intel to post.
> It would handle our usecase, and all use-cases which the current
> patchset from Intel already handles (Vikas posted emails mentioning
> there are happy users of the current interface, feel free to ask
> him for more details).
I cannot imagine how that modification to the current interface would solve
that. Not to talk about per CPU associations which are not related to tasks at
all.
> What i have asked you, and you replied "to go Google read my previous
> post" is this:
> What are the advantages over you proposal (which is a completely
> different directory structure, requiring a complete rewrite),
> over (*2) ?
>
> (what is my reason behind this: the reason is that if you, with
> maintainer veto power, forces your proposal to be accepted, it will be
> necessary to wait for another rewrite (a new set of problems, fully
> think through your proposal, test it, ...) rather than simply modify an
> already known, reviewed, already used directory structure.
>
> And functionally, your proposal adds nothing to (*2) (other than, well,
> being a different directory structure).
Sorry. I cannot see at all how a modification to the existing interface would
cover all the sensible use cases I described in a coherent way. I really want
to see a proper description of the interface before people start hacking on it
in a frenzy. What you described is: "let's say (*2)" modification. That's
pretty meager.
> If Fenghua or you post a patchset, say in 2 weeks, with your proposal,
> i am fine with that. But i since i doubt that will be the case, i am
> pushing for the interface which requires the least amount of changes
> (and therefore the least amount of time) to be integrated.
>
> >From your email:
>
> "It would even be sufficient for particular use cases to just associate
> a piece of cache to a given CPU and do not bother with tasks at all.
>
> We really need to make this as configurable as possible from userspace
> without imposing random restrictions to it. I played around with it on
> my new intel toy and the restriction to 16 COS ids (that's 8 with CDP
> enabled) makes it really useless if we force the ids to have the same
> meaning on all sockets and restrict it to per task partitioning."
>
> Yes, thats the issue we hit, that is the modification that was agreed
> with Intel, and thats what we are waiting for them to post.
How do you implement the above - especially that part:
"It would even be sufficient for particular use cases to just associate a
piece of cache to a given CPU and do not bother with tasks at all."
as a "simple" modification to (*1) ?
> > I described a directory structure for that qos/cat stuff in my proposal and
> > that's complete AFAICT.
>
> Ok, lets make the job for the submitter easier. You are the maintainer,
> so you decide.
>
> Is it enough for you to have (*2) (which was agreed with Intel), or
> would you rather prefer to integrate the directory structure at
> "[RFD] CAT user space interface revisited" ?
The only thing I care about as a maintainer is, that we merge something which
actually reflects the properties of the hardware and gives the admin the
required flexibility to utilize it fully. I don't care at all if it's my
proposal or something else which allows to do the same.
Let me copy the relevant bits from my proposal here once more and let me ask
questions to the various points so you can tell me how that modification to
(*1) is going to deal with that.
>> At top level:
>>
>> xxxxxxx/cat/max_cosids <- Assume that all CPUs are the same
>> xxxxxxx/cat/max_maskbits <- Assume that all CPUs are the same
>> xxxxxxx/cat/cdp_enable <- Depends on CDP availability
Where is that information in (*2) and how is that related to (*1)? If you
think it's not required, please explain why.
>> Per socket data:
>>
>> xxxxxxx/cat/socket-0/
>> ...
>> xxxxxxx/cat/socket-N/l3_size
>> xxxxxxx/cat/socket-N/hwsharedbits
Where is that information in (*2) and how is that related to (*1)? If you
think it's not required, please explain why.
>> Per socket mask data:
>>
>> xxxxxxx/cat/socket-N/cos-id-0/
>> ...
>> xxxxxxx/cat/socket-N/cos-id-N/inuse
>> /cat_mask
>> /cdp_mask <- Data mask if CDP enabled
Where is that information in (*2) and how is that related to (*1)? If you
think it's not required, please explain why.
>> Per cpu default cos id for the cpus on that socket:
>>
>> xxxxxxx/cat/socket-N/cpu-x/default_cosid
>> ...
>> xxxxxxx/cat/socket-N/cpu-N/default_cosid
>>
>> The above allows a simple cpu based partitioning. All tasks which do
>> not have a cache partition assigned on a particular socket use the
>> default one of the cpu they are running on.
Where is that information in (*2) and how is that related to (*1)? If you
think it's not required, please explain why.
>> Now for the task(s) partitioning:
>>
>> xxxxxxx/cat/partitions/
>>
>> Under that directory one can create partitions
>>
>> xxxxxxx/cat/partitions/p1/tasks
>> /socket-0/cosid
>> ...
>> /socket-n/cosid
>>
>> The default value for the per socket cosid is COSID_DEFAULT, which
>> causes the task(s) to use the per cpu default id.
Where is that information in (*2) and how is that related to (*1)? If you
think it's not required, please explain why.
Yes. I ask the same question several times and I really want to see the
directory/interface structure which solves all of the above before anyone
starts to implement it. We already have a completely useless interface (*1)
and there is no point to implement another one based on it (*2) just because
it solves your particular issue and is the fastest way forward. User space
interfaces are hard and we really do not need some half baken solution which
we have to support forever.
Let me enumerate the required points again:
1) Information about the hardware properties
2) Integration of CAT and CDP
3) Per socket cos-id partitioning
4) Per cpu default cos-id association
5) Task association to cos-id
Can you please explain in a simple directory based scheme, like the one I gave
you how all of these points are going to be solved with a modification to (*1)?
Thanks,
tglx