From: "Yu, Fenghua" <fenghua.yu@intel.com>
To: Thomas Gleixner <tglx@linutronix.de>, LKML <linux-kernel@vger.kernel.org>
CC: Peter Zijlstra <peterz@infradead.org>, "x86@kernel.org" <x86@kernel.org>,
        Marcelo Tosatti <mtosatti@redhat.com>,
        Luiz Capitulino <lcapitulino@redhat.com>,
        "Shivappa, Vikas" <vikas.shivappa@intel.com>,
        Tejun Heo <tj@kernel.org>,
        "Shankar, Ravi V" <ravi.v.shankar@intel.com>,
        "Luck, Tony" <tony.luck@intel.com>
Subject: RE: [RFD] CAT user space interface revisited
Thread-Topic: [RFD] CAT user space interface revisited
Thread-Index: AQHRIi6aCjeIY673nkyGZXMlpWki5Z7Xbfmg
Date: Tue, 22 Dec 2015 18:12:05 +0000
Message-ID: <3E5A0FA7E9CA944F9D5414FEC6C712205DF4E157@ORSMSX106.amr.corp.intel.com>
References: <alpine.DEB.2.11.1511181534450.3761@nanos>
In-Reply-To: <alpine.DEB.2.11.1511181534450.3761@nanos>
Accept-Language: en-US
Content-Language: en-US
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 8BIT
MIME-Version: 1.0
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 5335
Lines: 135

> From: Thomas Gleixner [mailto:tglx@linutronix.de]
> Sent: Wednesday, November 18, 2015 10:25 AM
> Folks!
> 
> After rereading the mail flood on CAT and staring into the SDM for a while, I
> think we all should sit back and look at it from scratch again w/o our
> preconceptions - I certainly had to put my own away.
> 
> Let's look at the properties of CAT again:
> 
>    - It's a per socket facility
> 
>    - CAT slots can be associated to external hardware. This
>      association is per socket as well, so different sockets can have
>      different behaviour. I missed that detail when staring the first
>      time, thanks for the pointer!
> 
>    - The association ifself is per cpu. The COS selection happens on a
>      CPU while the set of masks which are selected via COS are shared
>      by all CPUs on a socket.
> 
> There are restrictions which CAT imposes in terms of configurability:
> 
>    - The bits which select a cache partition need to be consecutive
> 
>    - The number of possible cache association masks is limited
> 
> Let's look at the configurations (CDP omitted and size restricted)
> 
> Default:   1 1 1 1 1 1 1 1
> 	   1 1 1 1 1 1 1 1
> 	   1 1 1 1 1 1 1 1
> 	   1 1 1 1 1 1 1 1
> 
> Shared:	   1 1 1 1 1 1 1 1
> 	   0 0 1 1 1 1 1 1
> 	   0 0 0 0 1 1 1 1
> 	   0 0 0 0 0 0 1 1
> 
> Isolated:  1 1 1 1 0 0 0 0
> 	   0 0 0 0 1 1 0 0
> 	   0 0 0 0 0 0 1 0
> 	   0 0 0 0 0 0 0 1
> 
> Or any combination thereof. Surely some combinations will not make any
> sense, but we really should not make any restrictions on the stupidity of a
> sysadmin. The worst outcome might be L3 disabled for everything, so what?
> 
> Now that gets even more convoluted if CDP comes into play and we really
> need to look at CDP right now. We might end up with something which looks
> like this:
> 
>    	   1 1 1 1 0 0 0 0	Code
> 	   1 1 1 1 0 0 0 0	Data
> 	   0 0 0 0 0 0 1 0	Code
> 	   0 0 0 0 1 1 0 0	Data
> 	   0 0 0 0 0 0 0 1	Code
> 	   0 0 0 0 1 1 0 0	Data
> or
> 	   0 0 0 0 0 0 0 1	Code
> 	   0 0 0 0 1 1 0 0	Data
> 	   0 0 0 0 0 0 0 1	Code
> 	   0 0 0 0 0 1 1 0	Data
> 
> Let's look at partitioning itself. We have two options:
> 
>    1) Per task partitioning
> 
>    2) Per CPU partitioning
> 
> So far we only talked about #1, but I think that #2 has a value as well. Let me
> give you a simple example.
> 
> Assume that you have isolated a CPU and run your important task on it. You
> give that task a slice of cache. Now that task needs kernel services which run
> in kernel threads on that CPU. We really don't want to (and cannot) hunt
> down random kernel threads (think cpu bound worker threads, softirq
> threads ....) and give them another slice of cache. What we really want is:
> 
>     	 1 1 1 1 0 0 0 0    <- Default cache
> 	 0 0 0 0 1 1 1 0    <- Cache for important task
> 	 0 0 0 0 0 0 0 1    <- Cache for CPU of important task
> 
> It would even be sufficient for particular use cases to just associate a piece of
> cache to a given CPU and do not bother with tasks at all.
> 
> We really need to make this as configurable as possible from userspace
> without imposing random restrictions to it. I played around with it on my new
> intel toy and the restriction to 16 COS ids (that's 8 with CDP
> enabled) makes it really useless if we force the ids to have the same meaning
> on all sockets and restrict it to per task partitioning.
> 
> Even if next generation systems will have more COS ids available, there are
> not going to be enough to have a system wide consistent view unless we
> have COS ids > nr_cpus.
> 
> Aside of that I don't think that a system wide consistent view is useful at all.
> 
>  - If a task migrates between sockets, it's going to suffer anyway.
>    Real sensitive applications will simply pin tasks on a socket to
>    avoid that in the first place. If we make the whole thing
>    configurable enough then the sysadmin can set it up to support
>    even the nonsensical case of identical cache partitions on all
>    sockets and let tasks use the corresponding partitions when
>    migrating.
> 
>  - The number of cache slices is going to be limited no matter what,
>    so one still has to come up with a sensible partitioning scheme.
> 
>  - Even if we have enough cos ids the system wide view will not make
>    the configuration problem any simpler as it remains per socket.
> 
> It's hard. Policies are hard by definition, but this one is harder than most
> other policies due to the inherent limitations.
> 
> So now to the interface part. Unfortunately we need to expose this very
> close to the hardware implementation as there are really no abstractions
> which allow us to express the various bitmap combinations. Any abstraction I
> tried to come up with renders that thing completely useless.
> 
> I was not able to identify any existing infrastructure where this really fits in. I
> chose a directory/file based representation. We certainly could do the same

Is this be /sys/devices/system/?
Then create qos/cat directory. In the future, other directories may be created
e.g. qos/mbm?

Thanks.

-Fenghua
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/