Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67;
Date:   Thu, 1 Aug 2019 02:45:38 +0200 (CEST)
From:   Thomas Gleixner <tglx@linutronix.de>
To:     Gabriel Krisman Bertazi <krisman@collabora.com>
cc:     mingo@redhat.com, peterz@infradead.org, dvhart@infradead.org,
        linux-kernel@vger.kernel.org, kernel@collabora.com,
        Zebediah Figura <z.figura12@gmail.com>,
        Steven Noonan <steven@valvesoftware.com>,
        "Pierre-Loup A . Griffais" <pgriffais@valvesoftware.com>
Subject: Re: [PATCH RFC 2/2] futex: Implement mechanism to wait on any of
 several futexes
In-Reply-To: <20190730220602.28781-2-krisman@collabora.com>
Message-ID: <alpine.DEB.2.21.1908010039470.1788@nanos.tec.linutronix.de>
References: <20190730220602.28781-1-krisman@collabora.com> <20190730220602.28781-2-krisman@collabora.com>
User-Agent: Alpine 2.21 (DEB 202 2017-01-01)
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk

On Tue, 30 Jul 2019, Gabriel Krisman Bertazi wrote:
> + retry:
> +	for (i = 0; i < count; i++) {
> +		qs[i].key = FUTEX_KEY_INIT;
> +		qs[i].bitset = wb[i].bitset;
> +
> +		ret = get_futex_key(wb[i].uaddr, flags & FLAGS_SHARED,
> +				    &qs[i].key, FUTEX_READ);
> +		if (unlikely(ret != 0)) {
> +			for (--i; i >= 0; i--)
> +				put_futex_key(&qs[i].key);
> +			goto out;
> +		}
> +	}
> +
> +	set_current_state(TASK_INTERRUPTIBLE);
> +
> +	for (i = 0; i < count; i++) {
> +		ret = __futex_wait_setup(wb[i].uaddr, wb[i].val,
> +					 flags, &qs[i], &hb);
> +		if (ret) {
> +			/* Drop the failed key directly.  keys 0..(i-1)
> +			 * will be put by unqueue_me.
> +			 */
> +			put_futex_key(&qs[i].key);
> +
> +			/* Undo the partial work we did. */
> +			for (--i; i >= 0; i--)
> +				unqueue_me(&qs[i]);

That lacks a comment why this does not evaluate the return value of
unqueue_me(). If one of the already queued futexes got woken, then it's
debatable whether that counts as success or not. Whatever the decision is
it needs to be documented instead of documenting the obvious.

> +			__set_current_state(TASK_RUNNING);
> +			if (ret > 0)
> +				goto retry;
> +			goto out;
> +		}
> +
> +		/* We can't hold to the bucket lock when dealing with
> +		 * the next futex. Queue ourselves now so we can unlock
> +		 * it before moving on.
> +		 */
> +		queue_me(&qs[i], hb);
> +	}

Why  does this use two loops and two cleanup loops instead of doing
it in a single one?

> +
> +	/* There is no easy to way to check if we are wake already on
> +	 * multiple futexes without waking through each one of them.  So

waking?

> +	ret = -ERESTARTSYS;
> +	if (!abs_time)
> +		goto out;
> +
> +	ret = -ERESTART_RESTARTBLOCK;
> + out:

There is surely no more convoluted way to write this? That goto above is
just making my eyes bleed. Yes I know where you copied that from and then
removed everthing between the goto and the assignement.

     	ret = abs_time ? -ERESTART_RESTARTBLOCK : -ERESTARTSYS;

would be too simple to read.

Also this is a new interface and there is absolutely no reason to make it
misfeature compatible to FUTEX_WAIT even for the case that this might share
code. Supporting relative timeouts is wrong to begin with and for absolute
timeouts there is no reason to use restartblock.

> +static int futex_wait_multiple(u32 __user *uaddr, unsigned int flags,
> +			       u32 count, ktime_t *abs_time)
> +{
> +	struct futex_wait_block *wb;
> +	struct restart_block *restart;
> +	int ret;
> +
> +	if (!count)
> +		return -EINVAL;
> +
> +
> +	wb = kcalloc(count, sizeof(struct futex_wait_block), GFP_KERNEL);

There are a couple of wrongs here:

  - Lacks a sane upper limit for count. Relying on kcalloc() to fail is
    just wrong.
  
  - kcalloc() does a pointless zeroing for a buffer which is going to be
    overwritten anyway

  - sizeof(struct foo) instead of sizeof(*wb)

count * sizeof(*wb) must be calculated anyway because copy_from_user()
needs it. So using kcalloc() is pointless.

> +	if (!wb)
> +		return -ENOMEM;
> +
> +	if (copy_from_user(wb, uaddr,
> +			   count * sizeof(struct futex_wait_block))) {

And doing the size calculation only once spares that fugly line break.

But that's cosmetic. The way more serious issue is:

How is that supposed to work with compat tasks? 32bit and 64 bit apps do
not share the same pointer size and your UAPI struct is:

> +struct futex_wait_block {
> +	__u32 __user *uaddr;
> +	__u32 val;
> +	__u32 bitset;
> +};

A 64 bit kernel will happily read beyond the user buffer when called from a
32bit task and all struct members turn into garbage. That can be solved
with padding so pointer conversion can be avoided, but you have to be
careful about endianness.

Coming back to to allocation itself:

do_futex_wait_multiple() does another allocation depending on count. So you
call into the memory allocator twice in a row and on the way out you do the
same thing again with free.

Looking at the size constraints.

If I assume a maximum of 65 futexes which got mentioned in one of the
replies then this will allocate 7280 bytes alone for the futex_q array with
a stock debian config which has no debug options enabled which would bloat
the struct. Adding the futex_wait_block array into the same allocation
becomes larger than 8K which already exceeds thelimit of SLUB kmem
caches and forces the whole thing into the page allocator directly.

This sucks.

Also I'm confused about the 64 maximum resulting in 65 futexes comment in
one of the mails.

Can you please explain what you are trying to do exatly on the user space
side?

Thanks,

	tglx