Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755946AbbKCTLN (ORCPT ); Tue, 3 Nov 2015 14:11:13 -0500 Received: from mail-wi0-f193.google.com ([209.85.212.193]:36834 "EHLO mail-wi0-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751782AbbKCTLK (ORCPT ); Tue, 3 Nov 2015 14:11:10 -0500 MIME-Version: 1.0 In-Reply-To: <56376889.2080908@mellanox.com> References: <563233D7.90808@mellanox.com> <56376889.2080908@mellanox.com> Date: Wed, 4 Nov 2015 00:41:08 +0530 Message-ID: Subject: Re: RFC rdma cgroup From: Parav Pandit To: Haggai Eran Cc: Tejun Heo , Doug Ledford , "Hefty, Sean" , "linux-rdma@vger.kernel.org" , "cgroups@vger.kernel.org" , Liran Liss , "linux-kernel@vger.kernel.org" , "lizefan@huawei.com" , Johannes Weiner , Jonathan Corbet , "james.l.morris@oracle.com" , "serge@hallyn.com" , Or Gerlitz , Matan Barak , "raindel@mellanox.com" , "akpm@linux-foundation.org" , "linux-security-module@vger.kernel.org" , Jason Gunthorpe Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 7084 Lines: 159 >> Resource are defined as index and as match_table_t. >> >> enum rdma_resource_type { >> RDMA_VERB_RESOURCE_UCTX, >> RDMA_VERB_RESOURCE_AH, >> RDMA_VERB_RESOURCE_PD, >> RDMA_VERB_RESOURCE_CQ, >> RDMA_VERB_RESOURCE_MR, >> RDMA_VERB_RESOURCE_MW, >> RDMA_VERB_RESOURCE_SRQ, >> RDMA_VERB_RESOURCE_QP, >> RDMA_VERB_RESOURCE_FLOW, >> RDMA_VERB_RESOURCE_MAX, >> }; >> So UAPI RDMA resources can evolve by just adding more entries here. > Are the names that appear in userspace also controlled by uverbs? What > about the vendor specific resources? I am not sure I followed your question. Basically any RDMA resource that is allocated through uverb API can be tracked. uverb makes the call to charge/uncharge. There is list rdma.resources.verbs.list. This file lists all the verbs resource names of all the devices which have registered themselves to rdma cgroup. Similarly there is rdma.resource.hw.list. This file lists all hw specific resource names, which means they are defined run time and potentially different for each vendor. So it looks like below, #cat rdma.resources.verbs.list Output: mlx4_0 uctx ah pd cq mr mw srq qp flow mlx4_1 uctx ah pd cq mr mw srq qp flow rss_wq #cat rdma.resources.hw.list hfi1 hw_qp hw_mr sw_pd (This particular one is hypothical example, I haven't actually coded this, unlike uverbs which is real). >>>> (c) When process migrate from one to other cgroup, resource is >>>> continue to be owned by the creator cgroup (rather css). >>>> After process migration, whenever new resource is created in new >>>> cgroup, it will be owned by new cgroup. >>> It sounds a little different from how other cgroups behave. I agree that >>> mostly processes will create the resources in their cgroup and won't >>> migrate, but why not move the charge during migration? >>> >> With fork() process doesn't really own the resource (unlike other file >> and socket descriptors). >> Parent process might have died also. >> There is possibly no clear way to transfer resource to right child. >> Child that cgroup picks might not even want to own RDMA resources. >> RDMA resources might be allocated by one process and freed by other >> process (though this might not be the way they use it). >> Its pretty similar to other cgroups with exception in migration area, >> such exception comes from different behavior of how RDMA resources are >> owned, created and used. >> Recent unified hierarchy patch from Tejun equally highlights to not >> frequently migrate processes among cgroups. >> >> So in current implementation, (like other), >> if process created a RDMA resource, forked a child. >> child and parent both can allocate and free more resources. >> child moved to different cgroup. But resource is shared among them. >> child can free also the resource. All crazy combinations are possible >> in theory (without much use cases). >> So at best they are charged to the first cgroup css in which >> parent/child are created and reference is hold to CSS. >> cgroup, process can die, cut css remains until RDMA resources are freed. >> This is similar to process behavior where task struct is release but >> id is hold up for a while. > > I guess there aren't a lot of options when the resources can belong to > multiple cgroups. So after migrating, new resources will belong to the > new cgroup or the old one? Resource always belongs to the cgroup in which its created, regardless of process migration. Again, its owned at the css level instead of cgroup. Therefore original cgroup can also be deleted but internal reference to data structure and that is freed and last rdma resource is freed. > > When I was talking about limiting to MAC/VLAN pairs I only meant > limiting an RDMA device's ability to use that pair (e.g. use a GID that > uses the specific MAC VLAN pair). I don't understand how that makes the > RDMA cgroup any more generic than it is. > Oh ok. That doesn't. I meant that I wanted to limit how many vlans a given container can create. We have just high level capabilities (7) to enable such creation, but not the count. >> or >>> only a subset of P_Keys and GIDs it has. Do you see such limitations >>> also as part of this cgroup? >>> >> At present no. Because GID, P_key resources are created from the >> bottom up, either by stack or by network. They are kind of not tied to >> the user processes, unlike mac, vlan, qp which are more application >> driven or administrative driven. > They are created from the network, after the network administrator > configured them this way. > >> For applications that doesn't use RDMA-CM, query_device and query_port >> will filter out the GID entries based on the network namespace in >> which caller process is running. > This could work well for RoCE, as each entry in the GID table is > associated with a net device and a network namespace. However, in > InfiniBand, the GID table isn't directly related to the network > namespace. As for the P_Keys, you could deduce the set of P_Keys of a > namespace by the set of IPoIB netdevs in the network namespace, but > InfiniBand is designed to also work without IPoIB, so I don't think it's > a good idea. Got it. Yeah, this code can be under if(device_type RoCE). > > I think it would be better to allow each cgroup to limit the pkeys and > gids its processes can use. o.k. So the use case is P_Key? So I believe requirement would similar to device cgroup. Where set of GID table entries are configured as white list entries. and when they are queried or used during create_ah or modify_qp, its compared against the white list (or in other words as ACL). If they are found in ACL, they are reported in query_device or in create_ah, modify_qp. If not they those calls are failed with appropriate status? Does this look ok? Can we address requirement as additional feature just after first path? Tejun had some other idea on this kind of requirement, and I need to discuss with him. > >> It was in my TODO list while we were working on RoCEv2 and GID >> movement changes but I never got chance to chase that fix. >> >> One of the idea I was considering is: to create virtual RDMA device >> mapped to physical device. >> And configure GID count limit via configfs for each such device. > You could probably achieve what you want by creating a virtual RDMA > device and use the device cgroup to limit access to it, but it sounds to > me like an overkill. Actually not much. Basically this virtual RDMA device points to the struct device of the physical device itself. So only overhead is linking this structure to native device structure and passing most of the calls to native ib_device with thin filter layer in control path. post_send/recv/poll_cq will directly go native device and same performance. > > Regards, > Haggai -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/