Date: Wed, 2 Mar 2016 12:39:49 -0500
From: Tejun Heo <tj@kernel.org>
To: Parav Pandit <pandit.parav@gmail.com>
Cc: cgroups@vger.kernel.org, linux-doc@vger.kernel.org,
        linux-kernel@vger.kernel.org, linux-rdma@vger.kernel.org,
        lizefan@huawei.com, hannes@cmpxchg.org, dledford@redhat.com,
        liranl@mellanox.com, sean.hefty@intel.com,
        jgunthorpe@obsidianresearch.com, haggaie@mellanox.com, corbet@lwn.net,
        james.l.morris@oracle.com, serge@hallyn.com, ogerlitz@mellanox.com,
        matanb@mellanox.com, raindel@mellanox.com, akpm@linux-foundation.org,
        linux-security-module@vger.kernel.org
Subject: Re: [PATCHv9 1/3] rdmacg: Added rdma cgroup controller
Message-ID: <20160302173949.GG29826@mtj.duckdns.org>
References: <1456859137-13646-1-git-send-email-pandit.parav@gmail.com>
 <1456859137-13646-2-git-send-email-pandit.parav@gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <1456859137-13646-2-git-send-email-pandit.parav@gmail.com>
User-Agent: Mutt/1.5.24 (2015-08-30)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 5053
Lines: 176

Hello,

On Wed, Mar 02, 2016 at 12:35:35AM +0530, Parav Pandit wrote:
> diff --git a/include/linux/cgroup_rdma.h b/include/linux/cgroup_rdma.h
> new file mode 100644
> index 0000000..2da3d6c
> --- /dev/null
> +++ b/include/linux/cgroup_rdma.h
> @@ -0,0 +1,50 @@
> +#ifndef _CGROUP_RDMA_H
> +#define _CGROUP_RDMA_H
> +
> +#include <linux/cgroup.h>
> +
> +#ifdef CONFIG_CGROUP_RDMA
> +
> +struct rdma_cgroup {
> +	struct cgroup_subsys_state	css;
> +
> +	spinlock_t rpool_list_lock;	/* protects resource pool list */
> +	struct list_head rpool_head;	/* head to keep track of all resource
> +					 * pools that belongs to this cgroup.
> +					 */

I think we usually don't tail wing these comments.

> +};
> +
> +struct rdmacg_pool_info {
> +	const char **resource_name_table;
> +	int table_len;

Align fields consistently?  I've said this multiple times now but
please make the available resources constant and document them in
Documentation/cgroup-v2.txt.

> +};
> +
> +struct rdmacg_device {
> +	struct rdmacg_pool_info pool_info;
> +	struct list_head	rdmacg_list;
> +	struct list_head	rpool_head;
> +	/* protects resource pool list */
> +	spinlock_t		rpool_lock;
> +	char			*name;
> +};
> +
> +/* APIs for RDMA/IB stack to publish when a device wants to
> + * participate in resource accounting
> + */

Please follow CodingStyle.  Wasn't this pointed out a couple times
already?

> +enum rdmacg_file_type {
> +	RDMACG_RESOURCE_MAX,
> +	RDMACG_RESOURCE_STAT,
> +};

Heh, usually not a good sign.  If you're using this to call into a
common function and then switch out on the file type, just switch out
at the method level and factor out common part into helpers.

> +/* resource tracker per resource for rdma cgroup */
> +struct rdmacg_resource {
> +	int max;
> +	int usage;
> +};

Align fields?

> +/**

The above indicates kerneldoc comments, which this isn't.

> + * resource pool object which represents, per cgroup, per device
> + * resources. There are multiple instance
> + * of this object per cgroup, therefore it cannot be embedded within
> + * rdma_cgroup structure. It is maintained as list.

Consistent paragraph fill?

> + */
> +struct rdmacg_resource_pool {
> +	struct list_head cg_list;
> +	struct list_head dev_list;
> +
> +	struct rdmacg_device *device;
> +	struct rdmacg_resource *resources;
> +	struct rdma_cgroup *cg;	/* owner cg used during device cleanup */
> +
> +	int	refcnt;		/* count active user tasks of this pool */
> +	int	num_max_cnt;	/* total number counts which are set to max */
> +};

Formatting.

> +static inline void set_resource_limit(struct rdmacg_resource_pool *rpool,
> +				      int index, int new_max)

Is inline necessary?  Compiler can figure out these.

> +static struct rdmacg_resource_pool*
                                     ^
				     space

> +find_cg_rpool_locked(struct rdma_cgroup *cg,
> +		     struct rdmacg_device *device)
...
> +static void uncharge_cg_resource(struct rdma_cgroup *cg,
> +				 struct rdmacg_device *device,
> +				 int index, int num)
> +{
...
> +	rpool->refcnt--;
> +	if (rpool->refcnt == 0 && rpool->num_max_cnt == pool_info->table_len) {

If the caller charges 2 and then uncharges 1 two times, the refcnt
underflows?  Why not just track how many usages are zero?

...
> +void rdmacg_uncharge(struct rdma_cgroup *cg,
> +		     struct rdmacg_device *device,
> +		     int index, int num)
> +{
> +	struct rdma_cgroup *p;
> +
> +	for (p = cg; p; p = parent_rdmacg(p))
> +		uncharge_cg_resource(p, device, index, num);
> +
> +	css_put(&cg->css);
> +}
> +EXPORT_SYMBOL(rdmacg_uncharge);

So, I suppose the code is trying to save lock contention with
fine-grained locking; however, as the code is currently structured,
it's just increasing the number of lock ops and it'd be highly likely
to cheaper to simply use a single lock.  If you're working up the tree
grabbing lock at each level, per-node locking doesn't save you
anything.  Also, it introduces conditions where charges are spuriously
denied when there are racing requestors.  If scalability becomes an
issue, the right way to address is adding percpu front cache.

> +void rdmacg_query_limit(struct rdmacg_device *device,
> +			int *limits)
> +{
> +	struct rdma_cgroup *cg, *p;
> +	struct rdmacg_resource_pool *rpool;
> +	struct rdmacg_pool_info *pool_info;
> +	int i;
> +
> +	cg = task_rdmacg(current);
> +	pool_info = &device->pool_info;

Nothing seems to prevent @cg from going away if this races with
@current being migrated to a different cgroup.  Have you run this with
lockdep and rcu debugging enabled?  This should have triggered a
warning.

...
> +	for (p = cg; p; p = parent_rdmacg(p)) {
> +		spin_lock(&cg->rpool_list_lock);
> +		rpool = find_cg_rpool_locked(cg, device);
> +		if (rpool) {
> +			for (i = 0; i < pool_info->table_len; i++)
> +				limits[i] = min_t(int, limits[i],
> +					rpool->resources[i].max);

So, the O complexity wise, things like the above are pretty bad and
the above pattern is used in hot paths.  I suppose there can only be a
handful of devices per system?

Thanks.

-- 
tejun