2015-11-02 13:43:53

by Haggai Eran

[permalink] [raw]
Subject: Re: RFC rdma cgroup

On 29/10/2015 20:46, Parav Pandit wrote:
> On Thu, Oct 29, 2015 at 8:27 PM, Haggai Eran <[email protected]> wrote:
>> On 28/10/2015 10:29, Parav Pandit wrote:
>>> 3. Resources are not defined by the RDMA cgroup. Resources are defined
>>> by RDMA/IB subsystem and optionally by HCA vendor device drivers.
>>> Rationale: This allows rdma cgroup to remain constant while RDMA/IB
>>> subsystem can evolve without the need of rdma cgroup update. A new
>>> resource can be easily added by the RDMA/IB subsystem without touching
>>> rdma cgroup.
>> Resources exposed by the cgroup are basically a UAPI, so we have to be
>> careful to make it stable when it evolves. I understand the need for
>> vendor specific resources, following the discussion on the previous
>> proposal, but could you write on how you plan to allow these set of
>> resources to evolve?
>
> Its fairly simple.
> Here is the code snippet on how resources are defined in my tree.
> It doesn't have the RSS work queues yet, but can be added right after
> this patch.
>
> Resource are defined as index and as match_table_t.
>
> enum rdma_resource_type {
> RDMA_VERB_RESOURCE_UCTX,
> RDMA_VERB_RESOURCE_AH,
> RDMA_VERB_RESOURCE_PD,
> RDMA_VERB_RESOURCE_CQ,
> RDMA_VERB_RESOURCE_MR,
> RDMA_VERB_RESOURCE_MW,
> RDMA_VERB_RESOURCE_SRQ,
> RDMA_VERB_RESOURCE_QP,
> RDMA_VERB_RESOURCE_FLOW,
> RDMA_VERB_RESOURCE_MAX,
> };
> So UAPI RDMA resources can evolve by just adding more entries here.
Are the names that appear in userspace also controlled by uverbs? What
about the vendor specific resources?

>>> 8. Typically each RDMA cgroup will have 0 to 4 RDMA devices. Therefore
>>> each cgroup will have 0 to 4 verbs resource pool and optionally 0 to 4
>>> hw resource pool per such device.
>>> (Nothing stops to have more devices and pools, but design is around
>>> this use case).
>> In what way does the design depend on this assumption?
>
> Current code when performs resource charging/uncharging, it needs to
> identify the resource pool which one to charge to.
> This resource pool is maintained as list_head and so its linear search
> per device.
> If we are thinking of 100 of RDMA devices per container, than liner
> search will not be good way and different data structure needs to be
> deployed.
Okay, sounds fine to me.

>>> (c) When process migrate from one to other cgroup, resource is
>>> continue to be owned by the creator cgroup (rather css).
>>> After process migration, whenever new resource is created in new
>>> cgroup, it will be owned by new cgroup.
>> It sounds a little different from how other cgroups behave. I agree that
>> mostly processes will create the resources in their cgroup and won't
>> migrate, but why not move the charge during migration?
>>
> With fork() process doesn't really own the resource (unlike other file
> and socket descriptors).
> Parent process might have died also.
> There is possibly no clear way to transfer resource to right child.
> Child that cgroup picks might not even want to own RDMA resources.
> RDMA resources might be allocated by one process and freed by other
> process (though this might not be the way they use it).
> Its pretty similar to other cgroups with exception in migration area,
> such exception comes from different behavior of how RDMA resources are
> owned, created and used.
> Recent unified hierarchy patch from Tejun equally highlights to not
> frequently migrate processes among cgroups.
>
> So in current implementation, (like other),
> if process created a RDMA resource, forked a child.
> child and parent both can allocate and free more resources.
> child moved to different cgroup. But resource is shared among them.
> child can free also the resource. All crazy combinations are possible
> in theory (without much use cases).
> So at best they are charged to the first cgroup css in which
> parent/child are created and reference is hold to CSS.
> cgroup, process can die, cut css remains until RDMA resources are freed.
> This is similar to process behavior where task struct is release but
> id is hold up for a while.

I guess there aren't a lot of options when the resources can belong to
multiple cgroups. So after migrating, new resources will belong to the
new cgroup or the old one?

>> I finally wanted to ask about other limitations an RDMA cgroup could
>> handle. It would be great to be able to limit a container to be allowed
>> to use only a subset of the MAC/VLAN pairs programmed to a device,
>
> Truly. I agree. That was one of the prime reason I originally has it
> as part of the device cgroup.
> Where RDMA was just one category.
> But Tejun's opinion was to have rdma's own cgroup.
> Current internal data structure and interface between rdma cgroup and
> uverbs are tied to ib_device structure.
> which I think easy to overcome by abstracting out as new
> resource_device which can be used beyond RDMA as well.
>
> However my bigger concern is interface to user land.
> We already have two use cases and I am inclined to make it as as
> "device resource cgroup" instead of "rdma cgroup".
> I seek Tejun's input here.
> Initial implementation can expose rdma resources under device resource
> cgroup, as it evolves we can add other net resources such as mac, vlan
> as you described.

When I was talking about limiting to MAC/VLAN pairs I only meant
limiting an RDMA device's ability to use that pair (e.g. use a GID that
uses the specific MAC VLAN pair). I don't understand how that makes the
RDMA cgroup any more generic than it is.

> or
>> only a subset of P_Keys and GIDs it has. Do you see such limitations
>> also as part of this cgroup?
>>
> At present no. Because GID, P_key resources are created from the
> bottom up, either by stack or by network. They are kind of not tied to
> the user processes, unlike mac, vlan, qp which are more application
> driven or administrative driven.
They are created from the network, after the network administrator
configured them this way.

> For applications that doesn't use RDMA-CM, query_device and query_port
> will filter out the GID entries based on the network namespace in
> which caller process is running.
This could work well for RoCE, as each entry in the GID table is
associated with a net device and a network namespace. However, in
InfiniBand, the GID table isn't directly related to the network
namespace. As for the P_Keys, you could deduce the set of P_Keys of a
namespace by the set of IPoIB netdevs in the network namespace, but
InfiniBand is designed to also work without IPoIB, so I don't think it's
a good idea.

I think it would be better to allow each cgroup to limit the pkeys and
gids its processes can use.

> It was in my TODO list while we were working on RoCEv2 and GID
> movement changes but I never got chance to chase that fix.
>
> One of the idea I was considering is: to create virtual RDMA device
> mapped to physical device.
> And configure GID count limit via configfs for each such device.
You could probably achieve what you want by creating a virtual RDMA
device and use the device cgroup to limit access to it, but it sounds to
me like an overkill.

Regards,
Haggai


2015-11-03 19:11:13

by Parav Pandit

[permalink] [raw]
Subject: Re: RFC rdma cgroup

>> Resource are defined as index and as match_table_t.
>>
>> enum rdma_resource_type {
>> RDMA_VERB_RESOURCE_UCTX,
>> RDMA_VERB_RESOURCE_AH,
>> RDMA_VERB_RESOURCE_PD,
>> RDMA_VERB_RESOURCE_CQ,
>> RDMA_VERB_RESOURCE_MR,
>> RDMA_VERB_RESOURCE_MW,
>> RDMA_VERB_RESOURCE_SRQ,
>> RDMA_VERB_RESOURCE_QP,
>> RDMA_VERB_RESOURCE_FLOW,
>> RDMA_VERB_RESOURCE_MAX,
>> };
>> So UAPI RDMA resources can evolve by just adding more entries here.
> Are the names that appear in userspace also controlled by uverbs? What
> about the vendor specific resources?

I am not sure I followed your question.
Basically any RDMA resource that is allocated through uverb API can be tracked.
uverb makes the call to charge/uncharge.
There is list rdma.resources.verbs.list. This file lists all the verbs
resource names of all the devices which have registered themselves to
rdma cgroup.
Similarly there is rdma.resource.hw.list. This file lists all hw
specific resource names, which means they are defined run time and
potentially different for each vendor.

So it looks like below,
#cat rdma.resources.verbs.list
Output:
mlx4_0 uctx ah pd cq mr mw srq qp flow
mlx4_1 uctx ah pd cq mr mw srq qp flow rss_wq

#cat rdma.resources.hw.list
hfi1 hw_qp hw_mr sw_pd
(This particular one is hypothical example, I haven't actually coded
this, unlike uverbs which is real).

>>>> (c) When process migrate from one to other cgroup, resource is
>>>> continue to be owned by the creator cgroup (rather css).
>>>> After process migration, whenever new resource is created in new
>>>> cgroup, it will be owned by new cgroup.
>>> It sounds a little different from how other cgroups behave. I agree that
>>> mostly processes will create the resources in their cgroup and won't
>>> migrate, but why not move the charge during migration?
>>>
>> With fork() process doesn't really own the resource (unlike other file
>> and socket descriptors).
>> Parent process might have died also.
>> There is possibly no clear way to transfer resource to right child.
>> Child that cgroup picks might not even want to own RDMA resources.
>> RDMA resources might be allocated by one process and freed by other
>> process (though this might not be the way they use it).
>> Its pretty similar to other cgroups with exception in migration area,
>> such exception comes from different behavior of how RDMA resources are
>> owned, created and used.
>> Recent unified hierarchy patch from Tejun equally highlights to not
>> frequently migrate processes among cgroups.
>>
>> So in current implementation, (like other),
>> if process created a RDMA resource, forked a child.
>> child and parent both can allocate and free more resources.
>> child moved to different cgroup. But resource is shared among them.
>> child can free also the resource. All crazy combinations are possible
>> in theory (without much use cases).
>> So at best they are charged to the first cgroup css in which
>> parent/child are created and reference is hold to CSS.
>> cgroup, process can die, cut css remains until RDMA resources are freed.
>> This is similar to process behavior where task struct is release but
>> id is hold up for a while.
>
> I guess there aren't a lot of options when the resources can belong to
> multiple cgroups. So after migrating, new resources will belong to the
> new cgroup or the old one?
Resource always belongs to the cgroup in which its created, regardless
of process migration.
Again, its owned at the css level instead of cgroup. Therefore
original cgroup can also be deleted but internal reference to data
structure and that is freed and last rdma resource is freed.

>
> When I was talking about limiting to MAC/VLAN pairs I only meant
> limiting an RDMA device's ability to use that pair (e.g. use a GID that
> uses the specific MAC VLAN pair). I don't understand how that makes the
> RDMA cgroup any more generic than it is.
>
Oh ok. That doesn't. I meant that I wanted to limit how many vlans a
given container can create.
We have just high level capabilities (7) to enable such creation, but
not the count.

>> or
>>> only a subset of P_Keys and GIDs it has. Do you see such limitations
>>> also as part of this cgroup?
>>>
>> At present no. Because GID, P_key resources are created from the
>> bottom up, either by stack or by network. They are kind of not tied to
>> the user processes, unlike mac, vlan, qp which are more application
>> driven or administrative driven.
> They are created from the network, after the network administrator
> configured them this way.
>
>> For applications that doesn't use RDMA-CM, query_device and query_port
>> will filter out the GID entries based on the network namespace in
>> which caller process is running.
> This could work well for RoCE, as each entry in the GID table is
> associated with a net device and a network namespace. However, in
> InfiniBand, the GID table isn't directly related to the network
> namespace. As for the P_Keys, you could deduce the set of P_Keys of a
> namespace by the set of IPoIB netdevs in the network namespace, but
> InfiniBand is designed to also work without IPoIB, so I don't think it's
> a good idea.
Got it. Yeah, this code can be under if(device_type RoCE).

>
> I think it would be better to allow each cgroup to limit the pkeys and
> gids its processes can use.

o.k. So the use case is P_Key? So I believe requirement would similar
to device cgroup.
Where set of GID table entries are configured as white list entries.
and when they are queried or used during create_ah or modify_qp, its
compared against the white list (or in other words as ACL).
If they are found in ACL, they are reported in query_device or in
create_ah, modify_qp. If not they those calls are failed with
appropriate status?
Does this look ok? Can we address requirement as additional feature
just after first path?
Tejun had some other idea on this kind of requirement, and I need to
discuss with him.

>
>> It was in my TODO list while we were working on RoCEv2 and GID
>> movement changes but I never got chance to chase that fix.
>>
>> One of the idea I was considering is: to create virtual RDMA device
>> mapped to physical device.
>> And configure GID count limit via configfs for each such device.
> You could probably achieve what you want by creating a virtual RDMA
> device and use the device cgroup to limit access to it, but it sounds to
> me like an overkill.

Actually not much. Basically this virtual RDMA device points to the
struct device of the physical device itself.
So only overhead is linking this structure to native device structure
and passing most of the calls to native ib_device with thin filter
layer in control path.
post_send/recv/poll_cq will directly go native device and same performance.


>
> Regards,
> Haggai

2015-11-04 11:59:53

by Haggai Eran

[permalink] [raw]
Subject: Re: RFC rdma cgroup

On 03/11/2015 21:11, Parav Pandit wrote:
> So it looks like below,
> #cat rdma.resources.verbs.list
> Output:
> mlx4_0 uctx ah pd cq mr mw srq qp flow
> mlx4_1 uctx ah pd cq mr mw srq qp flow rss_wq
What happens if you set a limit of rss_wq to mlx4_0 in this example?
Would it fail? I think it would be simpler for administrators if they
can configure every resource supported by uverbs. If a resource is not
supported by a specific device, you can never go over the limit anyway.

> #cat rdma.resources.hw.list
> hfi1 hw_qp hw_mr sw_pd
> (This particular one is hypothical example, I haven't actually coded
> this, unlike uverbs which is real).
Sounds fine to me. We will need to be careful to make sure that driver
maintainers don't break backward compatibility with this interface.

>> I guess there aren't a lot of options when the resources can belong to
>> multiple cgroups. So after migrating, new resources will belong to the
>> new cgroup or the old one?
> Resource always belongs to the cgroup in which its created, regardless
> of process migration.
> Again, its owned at the css level instead of cgroup. Therefore
> original cgroup can also be deleted but internal reference to data
> structure and that is freed and last rdma resource is freed.
Okay.

>>> For applications that doesn't use RDMA-CM, query_device and query_port
>>> will filter out the GID entries based on the network namespace in
>>> which caller process is running.
>> This could work well for RoCE, as each entry in the GID table is
>> associated with a net device and a network namespace. However, in
>> InfiniBand, the GID table isn't directly related to the network
>> namespace. As for the P_Keys, you could deduce the set of P_Keys of a
>> namespace by the set of IPoIB netdevs in the network namespace, but
>> InfiniBand is designed to also work without IPoIB, so I don't think it's
>> a good idea.
> Got it. Yeah, this code can be under if(device_type RoCE).
IIRC there's a core capability for the new GID table code that contains
namespace, so you can use that.

>> I think it would be better to allow each cgroup to limit the pkeys and
>> gids its processes can use.
>
> o.k. So the use case is P_Key? So I believe requirement would similar
> to device cgroup.
> Where set of GID table entries are configured as white list entries.
> and when they are queried or used during create_ah or modify_qp, its
> compared against the white list (or in other words as ACL).
> If they are found in ACL, they are reported in query_device or in
> create_ah, modify_qp. If not they those calls are failed with
> appropriate status?
> Does this look ok?
Yes, that sounds good to me.

> Can we address requirement as additional feature just after first path?
> Tejun had some other idea on this kind of requirement, and I need to
> discuss with him.
Of course. I think there's use for the RDMA cgroup even without a pkey
or GID ACL, just to make sure one application doesn't hog hardware
resources.

>>> One of the idea I was considering is: to create virtual RDMA device
>>> mapped to physical device.
>>> And configure GID count limit via configfs for each such device.
>> You could probably achieve what you want by creating a virtual RDMA
>> device and use the device cgroup to limit access to it, but it sounds to
>> me like an overkill.
>
> Actually not much. Basically this virtual RDMA device points to the
> struct device of the physical device itself.
> So only overhead is linking this structure to native device structure
> and passing most of the calls to native ib_device with thin filter
> layer in control path.
> post_send/recv/poll_cq will directly go native device and same performance.
Still, I think we already have code that wraps ib_device calls for
userspace, which is the ib_uverbs module. There's no need for an extra
layer.

Regards,
Haggai

2015-11-04 17:23:40

by Parav Pandit

[permalink] [raw]
Subject: Re: RFC rdma cgroup

On Wed, Nov 4, 2015 at 5:28 PM, Haggai Eran <[email protected]> wrote:
> On 03/11/2015 21:11, Parav Pandit wrote:
>> So it looks like below,
>> #cat rdma.resources.verbs.list
>> Output:
>> mlx4_0 uctx ah pd cq mr mw srq qp flow
>> mlx4_1 uctx ah pd cq mr mw srq qp flow rss_wq
> What happens if you set a limit of rss_wq to mlx4_0 in this example?
> Would it fail?
Yes, In above example, mlx4_0 device didn't had support for rss_wq, so
it didn't advertise in the list file that it supports rss_wq.

> I think it would be simpler for administrators if they
> can configure every resource supported by uverbs. If a resource is not
> supported by a specific device, you can never go over the limit anyway.
>
Exactly. Thats the implementation today.