Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752797AbbHVC3A (ORCPT ); Fri, 21 Aug 2015 22:29:00 -0400 Received: from mx1.redhat.com ([209.132.183.28]:33401 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751560AbbHVC26 (ORCPT ); Fri, 21 Aug 2015 22:28:58 -0400 Date: Fri, 21 Aug 2015 23:28:20 -0300 From: Marcelo Tosatti To: Vikas Shivappa Cc: Matt Fleming , Tejun Heo , Vikas Shivappa , linux-kernel@vger.kernel.org, x86@kernel.org, hpa@zytor.com, tglx@linutronix.de, mingo@kernel.org, peterz@infradead.org, matt.fleming@intel.com, will.auld@intel.com, glenn.p.williamson@intel.com, kanaka.d.juvva@intel.com, Karen Noel Subject: Re: [PATCH 5/9] x86/intel_rdt: Add new cgroup and Class of service management Message-ID: <20150822022819.GA743@amt.cnet> References: <1435789270-27010-6-git-send-email-vikas.shivappa@linux.intel.com> <20150730194458.GD3504@mtj.duckdns.org> <20150802163157.GB32599@mtj.duckdns.org> <20150805122257.GD4332@codeblueprint.co.uk> <20150806002404.GA24422@amt.cnet> <20150807131506.GA6649@amt.cnet> <20150818002050.GA3744@amt.cnet> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 11899 Lines: 304 On Thu, Aug 20, 2015 at 05:06:51PM -0700, Vikas Shivappa wrote: > > > On Mon, 17 Aug 2015, Marcelo Tosatti wrote: > > >Vikas, Tejun, > > > >This is an updated interface. It addresses all comments made > >so far and also covers all use-cases the cgroup interface > >covers. > > > >Let me know what you think. I'll proceed to writing > >the test applications. > > > >Usage model: > >------------ > > > >This document details how CAT technology is > >exposed to userspace. > > > >Each task has a list of task cache reservation entries (TCRE list). > > > >The init process is created with empty TCRE list. > > > >There is a system-wide unique ID space, each TCRE is assigned > >an ID from this space. ID's can be reused (but no two TCREs > >have the same ID at one time). > > > >The interface accomodates transient and independent cache allocation > >adjustments from applications, as well as static cache partitioning > >schemes. > > > >Allocation: > >Usage of the system calls require CAP_SYS_CACHE_RESERVATION capability. > > > >A configurable percentage is reserved to tasks with empty TCRE list. Hi Vikas, > And how do you think you will do this without a system controlled > mechanism ? > Everytime in your proposal you include these caveats > which actually mean to include a system controlled interface in the > background , > and your below interfaces make no mention of this really ! Why do we > want to confuse ourselves like this ? > syscall only interface does not seem to work on its own for the > cache allocation scenario. This can only be a nice to have interface > on top of a system controlled mechanism like cgroup interface. Sure > you can do all the things you did with cgroup with the same with > syscall interface but the point is what are the use cases that cant > be done with this syscall only interface. (ex: to deal with cases > you brought up earlier like when an app does cache intensive work > for some time and later changes - it could use the syscall interface > to quickly reqlinquish the cache lines or change a clos associated > with it) All use cases can be covered with the syscall interface. * How to convert from cgroups interface to syscall interface: Cgroup: Partition cache in cgroups, add tasks to cgroups. Syscall: Partition cache in TCRE, add TCREs to tasks. You build the same structure (task <--> CBM) either via syscall or via cgroups. Please be more specific, can't really see any problem. > I have repeatedly listed the use cases that can be dealt with , with > this interface. How will you address the cases like 1.1 and 1.2 with > your syscall only interface ? Case 1.1: -------- 1.1> Exclusive access: The task cannot give *itself* exclusive access from using the cache. For this it needs to have visibility of the cache allocation of other tasks and may need to reclaim or override others cache allocs which is not feasible (isnt that the ability of a system managing agent?). Answer: if the application has CAP_SYS_CACHE_RESERVATION, it can create cache allocation and remove cache allocation from other applications. So only the administrator could do it. Case 1.2 answer below. > So we expect all the millions of apps > like SAP, oracle etc and etc and all the millions of app developers > to magically learn our new syscall interface and also cooperate > between themselves to decide a cache allocation that is agreeable to > all ? (which btw the interface doesnt list below how to do it) and They don't have to: the administrator can use "cacheset" application. If an application wants to control the cache, it can. > then by some godly powers the noisly neighbour will decide himself > to give up the cache ? I suppose you imagine something like this: http://arxiv.org/pdf/1410.6513.pdf No, the syscall interface does not need to care about that because: * If you can set cache (CAP_SYS_CACHE_RESERVATION capability), you can remove cache reservation from your neighbours. So this problem does not exist (it assumes participants are cooperative). There is one confusion in the argument for cases 1.1 and case 1.2: that applications are supposed to include in their decision of cache allocation size the status of the system as a whole. This is a flawed argument. Please point specifically if this is not the case or if there is another case still not covered. It would be possible to partition the cache into watermarks such as: task group A - can reserve up to 20% of cache. task group B - can reserve up to 25% of cache. task group C - can reserve 50% of cache. But i am not sure... Tejun, do you think that is necessary? (CAP_SYS_CACHE_RESERVATION is good enough for our usecases). > (that should be first ever app to not request > more resource in the world for himself and hurt his own performance > - they surely dont want to do social service !) > > And how do we do the case 1.5 where the administrator want to assign > cache to specific VMs in a cloud etc - with the hypothetical syscall > interface we now should expect all the apps to do the above and now > they also need to know where they run (what VM , what socket etc) > and then decide and cooperate an allocation : compare this to a > container environment like rancher where today the admin can > convinetly use docker underneath to allocate mem/storage/compute to > containers and easily extend this to include shared l3. > > http://marc.info/?l=linux-kernel&m=143889397419199 > > without addressing the above the details of the interface below is irrelavant - You are missing the point, there is supposed to be a "cacheset" program which will allow the admin to setup TCRE and assign them to tasks. > Your initial request was to extend the cgroup interface to include > rounding off the size of cache (which can easily be done with a bash > script on top of cgroup interface !) and now you are proposing a > syscall only interface ? this is very confusing and will only > unnecessarily delay the process without adding any value. I suppose you are assuming that its necessary for applications to set their own cache. This assumption is not correct. Take a look at Tuna / sched_getaffinity: https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_MRG/1.3/html/Realtime_Reference_Guide/chap-Realtime_Reference_Guide-Affinity.html > however like i mentioned the syscall interface or user/app being > able to modify the cache alloc could be used to address some very > specific use cases on top an existing system managed interface. This > is not really a common case in cloud or container environment and > neither a feasible deployable solution. > Just consider the millions of apps that have to transition to such > an interface to even use it - if thats the only way to do it, thats > dead on arrival. Applications should not rely on interfaces that are not upstream. Is there an explicit request or comment from users about their difficulty regarding a change in the interface? > Also please donot include kernel automatically adjusting resources > in your reply as thats totally irrelavent and again more confusing > as we have already exchanged some >100 emails on this same patch > version without meaning anything so far. > > The debate is purely between a syscall only interface and a system > manageable interface(like cgroup where admin or a central entity > controls the resources). If not define what is it first before going > into details. See the Tuna / taskset page. The administrator could, for example, use "cacheset" from within the scripts which initialize the applications. Then having control over those scripts, he can view them as a "unified system control interface". Problems with cgroup interface: 1) Global IPI on CBM <---> task change does not scale. 2) Syscall interface specification is in kbytes, not cache ways (which is what must be recorded by the OS to allow migration of the OS between different hardware systems). 3) Compilers are able to configure cache optimally for given ranges of code inside applications, easily, if desired. 4) Does not allow proper usage of shared caches between applications. Think of the following scenario: * AppA has threads which are created/destroyed, but once initialized, want cache reservation. * How is AppA going to coordinate with cgroups system to initialized/shutdown cgroups? I started writing the syscall interface on top of your latest patchset yesterday (it should be relatively easy, given that most of the low-level code is already there). Any news on the data/code separation ? > Thanks, > Vikas > > > > >On fork, the child inherits the TCR from its parent. > > > >Semantics: > >Once a TCRE is created and assigned to a task, that task has > >guaranteed reservation on any CPU where its scheduled in, > >for the lifetime of the TCRE. > > > >A task can have its TCR list modified without notification. > > > >FIXME: Add a per-task flag to not copy the TCR list of a task but delete > >all TCR's on fork. > > > >Interface: > > > >enum cache_rsvt_flags { > > CACHE_RSVT_ROUND_DOWN = (1 << 0), /* round "kbytes" down */ > >}; > > > >enum cache_rsvt_type { > > CACHE_RSVT_TYPE_CODE = 0, /* cache reservation is for code */ > > CACHE_RSVT_TYPE_DATA, /* cache reservation is for data */ > > CACHE_RSVT_TYPE_BOTH, /* cache reservation is for code and data */ > >}; > > > >struct cache_reservation { > > unsigned long kbytes; > > int type; > > int flags; > > int trcid; > >}; > > > >The following syscalls modify the TCR of a task: > > > >* int sys_create_cache_reservation(struct cache_reservation *rsvt); > >DESCRIPTION: Creates a cache reservation entry, and assigns > >it to the current task. > > > >returns -ENOMEM if not enough space, -EPERM if no permission. > >returns 0 if reservation has been successful, copying actual > >number of kbytes reserved to "kbytes", type to type, and tcrid. > > > >* int sys_delete_cache_reservation(struct cache_reservation *rsvt); > >DESCRIPTION: Deletes a cache reservation entry, deassigning it > >from any task. > > > >Backward compatibility for processors with no support for code/data > >differentiation: by default code and data cache allocation types > >fallback to CACHE_RSVT_TYPE_BOTH on older processors (and return the > >information that they done so via "flags"). > > > >* int sys_attach_cache_reservation(pid_t pid, unsigned int tcrid); > >DESCRIPTION: Attaches cache reservation identified by "tcrid" to > >task by identified by pid. > >returns 0 if successful. > > > >* int sys_detach_cache_reservation(pid_t pid, unsigned int tcrid); > >DESCRIPTION: Detaches cache reservation identified by "tcrid" to > >task by identified pid. > > > >The following syscalls list the TCRs: > >* int sys_get_cache_reservations(size_t size, struct cache_reservation list[]); > >DESCRIPTION: Return all cache reservations in the system. > >Size should be set to the maximum number of items that can be stored > >in the buffer pointed to by list. > > > >* int sys_get_tcrid_tasks(unsigned int tcrid, size_t size, pid_t list[]); > >DESCRIPTION: Return which pids are associated to tcrid. > > > >* sys_get_pid_cache_reservations(pid_t pid, size_t size, > > struct cache_reservation list[]); > >DESCRIPTION: Return all cache reservations associated with "pid". > >Size should be set to the maximum number of items that can be stored > >in the buffer pointed to by list. > > > >* sys_get_cache_reservation_info() > >DESCRIPTION: ioctl to retrieve hardware info: cache round size, whether > >code/data separation is supported. > > > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/