Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753280AbbKBNnx (ORCPT ); Mon, 2 Nov 2015 08:43:53 -0500 Received: from mail-am1on0100.outbound.protection.outlook.com ([157.56.112.100]:53739 "EHLO emea01-am1-obe.outbound.protection.outlook.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1752544AbbKBNnr (ORCPT ); Mon, 2 Nov 2015 08:43:47 -0500 Authentication-Results: spf=pass (sender IP is 193.47.165.134) smtp.mailfrom=mellanox.com; obsidianresearch.com; dkim=none (message not signed) header.d=none;obsidianresearch.com; dmarc=pass action=none header.from=mellanox.com; Subject: Re: RFC rdma cgroup To: Parav Pandit References: <563233D7.90808@mellanox.com> CC: Tejun Heo , Doug Ledford , "Hefty, Sean" , "linux-rdma@vger.kernel.org" , "cgroups@vger.kernel.org" , Liran Liss , "linux-kernel@vger.kernel.org" , "lizefan@huawei.com" , Johannes Weiner , Jonathan Corbet , "james.l.morris@oracle.com" , "serge@hallyn.com" , Or Gerlitz , Matan Barak , "raindel@mellanox.com" , "akpm@linux-foundation.org" , "linux-security-module@vger.kernel.org" , Jason Gunthorpe From: Haggai Eran Message-ID: <56376889.2080908@mellanox.com> Date: Mon, 2 Nov 2015 15:43:37 +0200 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.3.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.0.52.254] X-EOPAttributedMessage: 0 X-Microsoft-Exchange-Diagnostics: 1;DB3FFO11FD052;1:K1HegJu/sTbzwxUONRzuFd6qYCxqXByq/s0w11acNBW3Nuwx/3jMUqGhnstLf7OPybuVEvLQTk8kIHKvZAELoY+Mj8gAUm4S/RaD2eTnVg0uIAcI9m+yUQ24Dfb5YvznSu8Wo/hJtxjPf1Z6MOGWNZU05xbVg0X0/EdlFVKHuoqrB7hGYOOKlAxSJsm3bpd2XDAAbC7HevDLdCLA2ofvOxmfGYO9UWpkuJmDQIqFy1eYaKsPhXMOqarofI2+XiIOSqPgFsPsuosS1hEI7O2SisiAVdWiE7GKyrko0QrCcAbmFeb/7urDlKNSTrizS1OTbgB9QIRab0d1aovgOCGHePgD96yRoPOLx/Bylns4HUQKTVccTu6AYi+ELCZDaoLyzgFbjzYJnH+/Lke00WPj2A== X-Forefront-Antispam-Report: CIP:193.47.165.134;CTRY:IL;IPV:NLI;EFV:NLI;SFV:NSPM;SFS:(10009020)(6009001)(2980300002)(438002)(377454003)(199003)(479174004)(24454002)(189002)(230700001)(11100500001)(77096005)(65816999)(2950100001)(59896002)(65806001)(87936001)(110136002)(5007970100001)(50466002)(54356999)(87266999)(92566002)(36756003)(117636001)(106466001)(65956001)(19580395003)(5008740100001)(50986999)(5004730100002)(189998001)(86362001)(97736004)(47776003)(64126003)(4001350100001)(6806005)(561944003)(76176999)(19580405001)(99136001)(23676002)(83506001)(3940600001)(62816006);DIR:OUT;SFP:1101;SCL:1;SRVR:AMSPR05MB342;H:mtlcas13.mtl.com;FPR:;SPF:Pass;PTR:ErrorRetry;A:1;MX:1;LANG:en; X-Microsoft-Exchange-Diagnostics: 1;AMSPR05MB342;2:JCl8ss5J4NFi6lpv+H4fxA38OLnAlNOB5N2zAa+Cd2WoQ04wf+ZVHgIbJUDr+GRcHQSDIn9S99v6ltubC1iwjiNFajmjvd3E0ICR65LCyq0GKvBdClWqPIcTQgzcMrIjGY3mm3ydt2PP/0Io1Q5ZhiJtj3QGBmqO4tm8INxxchM=;3:jmgXw3ho6+QeBM6f4l491d3NyK0n5xm0URzhuHmmGwe16qpUpEOHSsKHy5gyKPNNANQt68ag+4n6gIdM1XQhnAw3iTAo16IFINlVrF918dLCiPTyC/eBsNYzWmwphVhBJklKQpq0JNw938995qoI+0+DPZbL2WX4vxNdUvj2lxyfRnFuFQ4Lfe7dCEF3fv7jdUu3EYqb7k8AlCXkNieE9MV8Sz+mgqS9s1J25j9RqFf6DUDfPr+gAPmqVURPSwiJEv8NPVVbrfwYmcKedUaEfw==;25:yzagzpKlL0wFoU3CjI5S48uT22ea72wp/VP4B71ZE/THfq6gaW1tHlTxPI8bxno3gDm5mrkU5TWxT20os1pp4WpS9mJ+d42E7tDcqgqWVeP12ZWOvuukXyng00XPtl5W5VFRqicZdLAy6vgxNeHj6+260a3352WmeI0S/md54CLbjUN87ASysdO9SkI4ZL1zDz8os0Bt/7fDfJY7peR5bcXtrE5Q4SORCTwwZ3WSMWnd8obZIrcauPTCQj0YYGGMWDSH1UUYklk5yVRvkru0wA== X-Microsoft-Antispam: UriScan:;BCL:0;PCL:0;RULEID:(8251501001);SRVR:AMSPR05MB342; X-Microsoft-Exchange-Diagnostics: 1;AMSPR05MB342;20:bHMQldD2VdMMmg+OXtpLTmEsmqHeHQB+NjRFI+Cti0AbQer+cKg/UlonDjrCY3qpfchcp6IND9LFfEPCoPSSJmyKJtlPP8qCnPqL9flKfyyBcGLEOeKJQgA0SW5eNj6/PosfPXMDRyzVR//kIHRUpQXDFhXcqKv9sHstV6bfi0QUSyNrlvie8y8h6jGqE2vAqrpIDqJ8gg87COXrmiIKl5DQcpPGfHTAvRXBqWQpE3vYVajnV50ojhv6sWwcD1NSb4gQFbJGow7Qgn1oGlCbdL2A6a7BQGys9mRgEPsuQaEpG+YjOaXwEwNXhUeuqMA31UbbKBlWLYsYpGQiJVTrLa5Mw9wi4QAG5JIQYMesQquOHFD35XrKY6wwwbFlGegHVbGyEJFdYq/402pHX6+J8Yz//v1ZIfRZqSCiaOSgOpA9WKkH8TbR2Qd4nhba8UY0PJhGTLOf0T/35L7OF/TUnqw7yJi9cofaPx+oGNwtwZamW0IbWiDCgYobNvim8vCh;4:spPAfYRq4D6f8ohFqscKue55M7EjmuPUaON5srIAJJBeQFvfTEyAfNyJVC2yVTt1mr/ugPoTnsUNCgmg86IUssHL8EKuSiiHBE/umVuKRyi7pJhxLDVeG6yFSipXWpGYH2Ny2Nl0gyfdRWgEpCUdhEq9fJ4rD/qsuIGth1bws7I3BIaieobAvNntdZfF8xHwbfcNhI4GP2Sw83VOEFCim5ku67R7dQ0yIH812ljxB07QkRRbO628oDwoXXhm3yDjVWPuCpz10scHceDfQuYmj0KSbnnAszp9mvaY1lnH0AuHwlVDo4QcAUFtLPjcSVFSDVWi9Kce037o7JcShSDRQRHn/g7iRrtCXL/4ds8LZpdo9xTIc1tBHx8phemorkaK X-Microsoft-Antispam-PRVS: X-Exchange-Antispam-Report-Test: UriScan:(34787635062028); X-Exchange-Antispam-Report-CFA-Test: BCL:0;PCL:0;RULEID:(601004)(2401047)(5005006)(520078)(8121501046)(10201501046)(3002001);SRVR:AMSPR05MB342;BCL:0;PCL:0;RULEID:;SRVR:AMSPR05MB342; X-Forefront-PRVS: 0748FF9A04 X-Microsoft-Exchange-Diagnostics: =?utf-8?B?MTtBTVNQUjA1TUIzNDI7MjM6VXBlVXV5ZVZVSWszdkRoOGFqQktXNnBTak92?= =?utf-8?B?NUx5V0E4YnIrNmt6THdHczE4dkJ6RUo4akttcWxWN204cXlIYi9XWG1vN1I3?= =?utf-8?B?VUEvOXNoNmViY1pjZU1wTFUzY3VOZFM3cW5rOVlianF6YTRMMVpUeU8rT0N5?= =?utf-8?B?U1RUOVVqejBGSTRQcjl6QVZqa1RCQ0hDQ29oOVpvYm9zUFpielpFS09vZTdj?= =?utf-8?B?K0FBNTd0Ukp5VW9iY2E2OGVLdCtFS0RoVUJUbms4U0JvUnp2bll6WXVIdXBk?= =?utf-8?B?aFpMeGRtQW9naDNpNmlaUGJqYXFjVUtueFhRSUt2UTF0dXhxRlRwaytSakJU?= =?utf-8?B?cmFDUFZSWWpHNHgvWUh4c1o5Yk44Zi9uaGJFTWQ3c1I0elA1bS8yS1B0dFBL?= =?utf-8?B?SExSUDJyMi9IWWhKMWhqeTFmMGFlOFNRaFhMNW1zVTg0K2llM0lSbllHc056?= =?utf-8?B?SUhSbGdaZzYyWUsyUSs2K1hHbnpHQUpHQWpHRTBKeDNHSlVKWjJQRDBTRHpu?= =?utf-8?B?dXdVWkxFTEoxRTZIeWNQR3A2SjBmRFdnYmsxZ2R0ZForeXV4SjZBRmVhT2VD?= =?utf-8?B?dVN1ckNyZ3NXM2djeE5uT3dBbitHODVDUWVqUm44Unk3RHFZNmRiaUdObWQz?= =?utf-8?B?QnlMYlpJbENSRER2anp0MjNtQWVUU0I1a3BNVGJxNXVFVzFmYVpxSUt6eUtO?= =?utf-8?B?M2x3bUR1TW5IdzdWUlhJSXVVNHBxWVVubG4zUGZhZk93c25WSHMyNUFYSGJD?= =?utf-8?B?YS9DMXg5SXNuSVdUWVZqTWJKcmwrd0JlR3ZJOTNhRXJ0R21LNW9Vb2E0VXFD?= =?utf-8?B?VHNhVlcrQVlYUkhQbTdhSlZxb0xqV1o1eXZSclE5SVhYVVRWYUpBY0tQaG1r?= =?utf-8?B?UTB6ZlZvN3ZNR21Ra2RHLzliOHg3TE5FMFhsYkFxNG1Vdm1CdTVXOEZsTkpN?= =?utf-8?B?S3pxRlB0U01JbmsxOXdva0dQQ0llSTl1djNwWFNpK3NvKzJSMlQrNFgvSGtn?= =?utf-8?B?bDc5ZVFKcG1neWlxZ3MySW9MSTZpSUJTaUNoUVJGcVFkeklzVFAwQjRxMnIz?= =?utf-8?B?OVRqbm5HT21mZWx4enYyUFQvK3lscXdtN1BwWTRSOW84ZUl5dUFrbVRUZ3VP?= =?utf-8?B?RlhzUjZMdmtqdTV6d0VTMWdoMFBxMGdEZWdPWTk4cGdBZ3F4K2MwUTBIdU1n?= =?utf-8?B?ZVJmNmhRV2h3NXhjTDNUcHBxVm0zdFVKYktqOUxLanVIU0hQQU96Y2IxdlhF?= =?utf-8?B?TlBhN1BpWHRlRktpVCtOZFdUOXJkSzcvcFFub1dCSXMwbTFhSENURHdOOFds?= =?utf-8?B?QWVNMU1mZUE0TXo2MGtzTUk4Z1N4WjBQRFZTUDEwYkFwTmdxYUZ4M3ltdUdX?= =?utf-8?B?OFMvWVg2VUh5NXhsRm5hRjc2ajdWVHJuK0VmOFo3RW5teUIrblljanYzdE5T?= =?utf-8?B?SFg5Mzh1QmFtQ2l0bU9ESGVhSWhzMktEbHdGOEVCRGJ6dUFwelF2Slc3blgr?= =?utf-8?B?aGZ3bks2ZkNYNUdqQk9mTDlaajhXL2xYc1VscGw3TEpmYkJhWU1FeUQrNTZZ?= =?utf-8?B?TlIrbE1hUWJsUG1tV0ExUGZXUVNKSmVtRTBjSjBGSVEwQUNqWit3S0NGZFNG?= =?utf-8?Q?Z4oahSEAZ34t+dqQA4f?= X-Microsoft-Exchange-Diagnostics: 1;AMSPR05MB342;5:ZVo572y0ljFxyVPMxamr6mUwJy8tThw3GAIegKYM3mqYoY4xEbrnCHtzRn239eEhX9YzEA6SdExhVMVOvganEnSmFFEZbruv5hQmYGHl4RLQqhh6Ma7TkELUbivYLVZbTTzn7BmyM7BflkYgcc126A==;24:yIF0yvPjwAJRc1E/8uAgVqFbE870xeXq+faBSVqLnftDzvAH6n//LLtGs1ohwvmBPYhc5Hj2CVW3Hd8v7Z+fB3Q8+y7eRfGikcbfTTxjTDM=;20:JTNGKBloxc8Mw247uF0ARflPlZVzZRsSd/+oUt6TgcC2yMfjDH0v3nDUaUAZO0UJ5gcvJCMGI1HJBOleG2il0A== SpamDiagnosticOutput: 1:23 SpamDiagnosticMetadata: NSPM X-OriginatorOrg: Mellanox.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 02 Nov 2015 13:43:24.1564 (UTC) X-MS-Exchange-CrossTenant-Id: a652971c-7d2e-4d9b-a6a4-d149256f461b X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=a652971c-7d2e-4d9b-a6a4-d149256f461b;Ip=[193.47.165.134];Helo=[mtlcas13.mtl.com] X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: AMSPR05MB342 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 7460 Lines: 159 On 29/10/2015 20:46, Parav Pandit wrote: > On Thu, Oct 29, 2015 at 8:27 PM, Haggai Eran wrote: >> On 28/10/2015 10:29, Parav Pandit wrote: >>> 3. Resources are not defined by the RDMA cgroup. Resources are defined >>> by RDMA/IB subsystem and optionally by HCA vendor device drivers. >>> Rationale: This allows rdma cgroup to remain constant while RDMA/IB >>> subsystem can evolve without the need of rdma cgroup update. A new >>> resource can be easily added by the RDMA/IB subsystem without touching >>> rdma cgroup. >> Resources exposed by the cgroup are basically a UAPI, so we have to be >> careful to make it stable when it evolves. I understand the need for >> vendor specific resources, following the discussion on the previous >> proposal, but could you write on how you plan to allow these set of >> resources to evolve? > > Its fairly simple. > Here is the code snippet on how resources are defined in my tree. > It doesn't have the RSS work queues yet, but can be added right after > this patch. > > Resource are defined as index and as match_table_t. > > enum rdma_resource_type { > RDMA_VERB_RESOURCE_UCTX, > RDMA_VERB_RESOURCE_AH, > RDMA_VERB_RESOURCE_PD, > RDMA_VERB_RESOURCE_CQ, > RDMA_VERB_RESOURCE_MR, > RDMA_VERB_RESOURCE_MW, > RDMA_VERB_RESOURCE_SRQ, > RDMA_VERB_RESOURCE_QP, > RDMA_VERB_RESOURCE_FLOW, > RDMA_VERB_RESOURCE_MAX, > }; > So UAPI RDMA resources can evolve by just adding more entries here. Are the names that appear in userspace also controlled by uverbs? What about the vendor specific resources? >>> 8. Typically each RDMA cgroup will have 0 to 4 RDMA devices. Therefore >>> each cgroup will have 0 to 4 verbs resource pool and optionally 0 to 4 >>> hw resource pool per such device. >>> (Nothing stops to have more devices and pools, but design is around >>> this use case). >> In what way does the design depend on this assumption? > > Current code when performs resource charging/uncharging, it needs to > identify the resource pool which one to charge to. > This resource pool is maintained as list_head and so its linear search > per device. > If we are thinking of 100 of RDMA devices per container, than liner > search will not be good way and different data structure needs to be > deployed. Okay, sounds fine to me. >>> (c) When process migrate from one to other cgroup, resource is >>> continue to be owned by the creator cgroup (rather css). >>> After process migration, whenever new resource is created in new >>> cgroup, it will be owned by new cgroup. >> It sounds a little different from how other cgroups behave. I agree that >> mostly processes will create the resources in their cgroup and won't >> migrate, but why not move the charge during migration? >> > With fork() process doesn't really own the resource (unlike other file > and socket descriptors). > Parent process might have died also. > There is possibly no clear way to transfer resource to right child. > Child that cgroup picks might not even want to own RDMA resources. > RDMA resources might be allocated by one process and freed by other > process (though this might not be the way they use it). > Its pretty similar to other cgroups with exception in migration area, > such exception comes from different behavior of how RDMA resources are > owned, created and used. > Recent unified hierarchy patch from Tejun equally highlights to not > frequently migrate processes among cgroups. > > So in current implementation, (like other), > if process created a RDMA resource, forked a child. > child and parent both can allocate and free more resources. > child moved to different cgroup. But resource is shared among them. > child can free also the resource. All crazy combinations are possible > in theory (without much use cases). > So at best they are charged to the first cgroup css in which > parent/child are created and reference is hold to CSS. > cgroup, process can die, cut css remains until RDMA resources are freed. > This is similar to process behavior where task struct is release but > id is hold up for a while. I guess there aren't a lot of options when the resources can belong to multiple cgroups. So after migrating, new resources will belong to the new cgroup or the old one? >> I finally wanted to ask about other limitations an RDMA cgroup could >> handle. It would be great to be able to limit a container to be allowed >> to use only a subset of the MAC/VLAN pairs programmed to a device, > > Truly. I agree. That was one of the prime reason I originally has it > as part of the device cgroup. > Where RDMA was just one category. > But Tejun's opinion was to have rdma's own cgroup. > Current internal data structure and interface between rdma cgroup and > uverbs are tied to ib_device structure. > which I think easy to overcome by abstracting out as new > resource_device which can be used beyond RDMA as well. > > However my bigger concern is interface to user land. > We already have two use cases and I am inclined to make it as as > "device resource cgroup" instead of "rdma cgroup". > I seek Tejun's input here. > Initial implementation can expose rdma resources under device resource > cgroup, as it evolves we can add other net resources such as mac, vlan > as you described. When I was talking about limiting to MAC/VLAN pairs I only meant limiting an RDMA device's ability to use that pair (e.g. use a GID that uses the specific MAC VLAN pair). I don't understand how that makes the RDMA cgroup any more generic than it is. > or >> only a subset of P_Keys and GIDs it has. Do you see such limitations >> also as part of this cgroup? >> > At present no. Because GID, P_key resources are created from the > bottom up, either by stack or by network. They are kind of not tied to > the user processes, unlike mac, vlan, qp which are more application > driven or administrative driven. They are created from the network, after the network administrator configured them this way. > For applications that doesn't use RDMA-CM, query_device and query_port > will filter out the GID entries based on the network namespace in > which caller process is running. This could work well for RoCE, as each entry in the GID table is associated with a net device and a network namespace. However, in InfiniBand, the GID table isn't directly related to the network namespace. As for the P_Keys, you could deduce the set of P_Keys of a namespace by the set of IPoIB netdevs in the network namespace, but InfiniBand is designed to also work without IPoIB, so I don't think it's a good idea. I think it would be better to allow each cgroup to limit the pkeys and gids its processes can use. > It was in my TODO list while we were working on RoCEv2 and GID > movement changes but I never got chance to chase that fix. > > One of the idea I was considering is: to create virtual RDMA device > mapped to physical device. > And configure GID count limit via configfs for each such device. You could probably achieve what you want by creating a virtual RDMA device and use the device cgroup to limit access to it, but it sounds to me like an overkill. Regards, Haggai -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/