by Kent Overstreet

Hello,

On Mon, May 13, 2013 at 06:18:41PM -0700, Kent Overstreet wrote:
> +/**
> + * percpu_ref_dead - check if a dynamic percpu refcount is shutting down
> + *
> + * Returns true if percpu_ref_kill() has been called on @ref, false otherwise.

Explanation on synchronization and use cases would be nice. People
tend to develop massive mis-uses for interfaces like this.

> + */
> +static inline int percpu_ref_dead(struct percpu_ref *ref)
> +{
> + return ref->pcpu_count == NULL;
> +}
...
> +/*
> + * The trick to implementing percpu refcounts is shutdown. We can't detect the
> + * ref hitting 0 on every put - this would require global synchronization and
> + * defeat the whole purpose of using percpu refs.
> + *
> + * What we do is require the user to keep track of the initial refcount; we know
> + * the ref can't hit 0 before the user drops the initial ref, so as long as we
> + * convert to non percpu mode before the initial ref is dropped everything
> + * works.

Can you please also explain why per-cpu wrapping is safe somewhere?

> + * Converting to non percpu mode is done with some RCUish stuff in
> + * percpu_ref_kill. Additionally, we need a bias value so that the atomic_t
> + * can't hit 0 before we've added up all the percpu refs.
> + */
> +
> +#define PCPU_COUNT_BIAS (1ULL << 31)

Are we sure this is enough? 1<<31 is a fairly large number but it's
just easy enough to breach from time to time and it's gonna be hellish
to reproduce / debug when it actually overflows. Maybe we want
atomic64_t w/ 1LLU << 63 bias? Or is there something else which
guarantees that the bias can't over/underflow?

> +int percpu_ref_tryget(struct percpu_ref *ref)
> +{
> + int ret = 1;
> +
> + preempt_disable();
> +
> + if (!percpu_ref_dead(ref))
> + percpu_ref_get(ref);
> + else
> + ret = 0;
> +
> + preempt_enable();
> +
> + return ret;
> +}

Why isn't the above one inline?

Why no /** comment on public functions? It'd be great if you can
explicitly warn about the racy nature of the function - especially,
the function may return overflowed or zero refcnt. BTW, why is this
function necessary? What's the use case?

> +unsigned percpu_ref_count(struct percpu_ref *ref)
> +{
> + unsigned __percpu *pcpu_count;
> + unsigned count = 0;
> + int cpu;
> +
> + preempt_disable();
> +
> + count = atomic_read(&ref->count);
> +
> + pcpu_count = ACCESS_ONCE(ref->pcpu_count);
> +
> + if (pcpu_count)
> + for_each_possible_cpu(cpu)
> + count += *per_cpu_ptr(pcpu_count, cpu);
> +
> + preempt_enable();
> +
> + return count;
> +}
...
> +/**
> + * percpu_ref_kill - prepare a dynamic percpu refcount for teardown
> + *
> + * Must be called before dropping the initial ref, so that percpu_ref_put()
> + * knows to check for the refcount hitting 0. If the refcount was in percpu
> + * mode, converts it back to single atomic counter mode.
> + *
> + * The caller must issue a synchronize_rcu()/call_rcu() before calling
> + * percpu_ref_put() to drop the initial ref.
> + *
> + * Returns true the first time called on @ref and false if @ref is already
> + * shutting down, so it may be used by the caller for synchronizing other parts
> + * of a two stage shutdown.
> + */

I'm not sure I like this interface. Why does it allow being called
multiple times? Why is that necessary? Wouldn't just making it
return void and trigger WARN_ON() if it detects that it's being called
multiple times better? Also, why not bool if the return value is
true/false?

> +int percpu_ref_kill(struct percpu_ref *ref)
> +{
> + unsigned __percpu *pcpu_count;
> + unsigned __percpu *old;
> + unsigned count = 0;
> + int cpu;
> +
> + pcpu_count = ACCESS_ONCE(ref->pcpu_count);
> +
> + do {
> + if (!pcpu_count)
> + return 0;
> +
> + old = pcpu_count;
> + pcpu_count = cmpxchg(&ref->pcpu_count, old, NULL);
> + } while (pcpu_count != old);
> +
> + synchronize_sched();

And this makes the whole function blocking. Why not use call_rcu() so
that the ref can be called w/o sleepable context too?

> +
> + for_each_possible_cpu(cpu)
> + count += *per_cpu_ptr(pcpu_count, cpu);
> +
> + free_percpu(pcpu_count);
> +
> + pr_debug("global %lli pcpu %i",
> + (int64_t) atomic_read(&ref->count), (int) count);
> +
> + atomic_add((int) count - PCPU_COUNT_BIAS, &ref->count);
> +
> + return 1;
> +}
> +
> +/**
> + * percpu_ref_put_initial_ref - safely drop the initial ref
> + *
> + * A percpu refcount needs a shutdown sequence before dropping the initial ref,
> + * to put it back into single atomic_t mode with the appropriate barriers so
> + * that percpu_ref_put() can safely check for it hitting 0 - this does so.
> + *
> + * Returns true if @ref hit 0.
> + */
> +int percpu_ref_put_initial_ref(struct percpu_ref *ref)
> +{
> + if (percpu_ref_kill(ref)) {
> + return percpu_ref_put(ref);
> + } else {
> + WARN_ON(1);
> + return 0;
> + }
> +}

Can we just roll the above into percpu_ref_kill()? It's much harder
to misuse if kill puts the base ref.

Thanks.

--
tejun

2013-05-14 15:03:43

by Tejun Heo

On Wed, May 15, 2013 at 05:41:21PM +0200, Oleg Nesterov wrote:
> On 05/15, Kent Overstreet wrote:
> >
> > On Tue, May 14, 2013 at 03:48:59PM +0200, Oleg Nesterov wrote:
> > > tag_free() does
> > >
> > > list_del_init(wait->list);
> > > /* WINDOW */
> > > wake_up_process(wait->task);
> > >
> > > in theory the caller of tag_alloc() can notice list_empty_careful(),
> > > return without taking pool->lock, exit, and free this task_struct.
> > >
> > > But the main problem is that it is not clear why this code reimplements
> > > add_wait_queue/wake_up_all, for what?
> >
> > To save on locking... there's really no point in another lock for the
> > wait queue. Could just use the wait queue lock instead I suppose, like
> > wait_event_interruptible_locked()
>
> Yes. Or perhaps you can reuse wait_queue_head_t->lock for move_tags().
>
> And,
>
> > (the extra spin_lock()/unlock() might not really cost anything but
> > nested irqsave()/restore() is ridiculously expensive, IME).
>
> But this is the slow path anyway. Even if you do not use _locked, how
> much this extra locking (save/restore) can make the things worse?
>
> In any case, I believe it would be much better to reuse the code we
> already have, to avoid the races and make the code more understandable.
> And to not bloat the code.
>
> Do you really think that, say,
>
> unsigned tag_alloc(struct tag_pool *pool, bool wait)
> {
> struct tag_cpu_freelist *tags;
> unsigned ret = 0;
> retry:
> tags = get_cpu_ptr(pool->tag_cpu);
> local_irq_disable();
> if (!tags->nr_free && pool->nr_free) {
> spin_lock(&pool->wq.lock);
> if (pool->nr_free)
> move_tags(...);
> spin_unlock(&pool->wq.lock);
> }
>
> if (tags->nr_free)
> ret = tags->free[--tags->nr_free];
> local_irq_enable();
> put_cpu_var(pool->tag_cpu);
>
> if (ret || !wait)
> return ret;
>
> __wait_event(&pool->wq, pool->nr_free);
> goto retry;
> }
>
> will be much slower?

The overhead from doing nested irqsave/restore() sucks. I've had it bite
me hard with the recent aio work. But screw it, it's not going to matter
that much here.

>
> > > I must admit, I do not understand what this code actually does ;)
> > > I didn't try to read it carefully though, but perhaps at least the
> > > changelog could explain more?
> >
> > The changelog is admittedly terse, but that's basically all there is to
> > it -
> > [...snip...]
>
> Yes, thanks for your explanation, I already realized what it does...
>
> Question. tag_free() does move_tags+wakeup if nr_free = pool->watermark * 2.
> Perhaps it should should also take waitqueue_active() into account ?
> tag_alloc() can sleep more than necessary, it seems.

No.

By "sleeping more than necessary" you mean sleeping when there's tags
available on other percpu freelists.

That's just unavoidable if the thing's to be percpu - efficient use of
available tags requires global knowledge. Sleeping less would require
more global cacheline contention, and would defeat the purpose of this
code.

So what we do is _bound_ that inefficiency - we cap the size of the
percpu freelists so that no more than half of the available tags can be
stuck on all the percpu freelists.

This means from the POV of work executing on one cpu, they will always
be able to use up to half the total tags (assuming they aren't
actually allocated).

So when you're deciding how many tag structs to allocate, you just
double the number you'd allocate otherwise when you're using this code.

2013-06-11 17:46:45

by Oleg Nesterov

[permalink] [raw]

Subject: Re: [PATCH 17/21] Percpu tag allocator

On 06/10, Kent Overstreet wrote:
>
> On Wed, May 15, 2013 at 05:41:21PM +0200, Oleg Nesterov wrote:
> >
> > Do you really think that, say,
> >
> > unsigned tag_alloc(struct tag_pool *pool, bool wait)
> > {
> > struct tag_cpu_freelist *tags;
> > unsigned ret = 0;
> > retry:
> > tags = get_cpu_ptr(pool->tag_cpu);
> > local_irq_disable();
> > if (!tags->nr_free && pool->nr_free) {
> > spin_lock(&pool->wq.lock);
> > if (pool->nr_free)
> > move_tags(...);
> > spin_unlock(&pool->wq.lock);
> > }
> >
> > if (tags->nr_free)
> > ret = tags->free[--tags->nr_free];
> > local_irq_enable();
> > put_cpu_var(pool->tag_cpu);
> >
> > if (ret || !wait)
> > return ret;
> >
> > __wait_event(&pool->wq, pool->nr_free);
> > goto retry;
> > }
> >
> > will be much slower?
>
> The overhead from doing nested irqsave/restore() sucks. I've had it bite
> me hard with the recent aio work.

Not sure I understand... Only __wait_event() does irqsave/restore and
we are going to sleep anyway.

> But screw it, it's not going to matter
> that much here.

Yes.

And, imho, even if we need some optimizations here, it would be better
to make a separate patch backed by the numbers or at least the detailed
explanation.

> > Question. tag_free() does move_tags+wakeup if nr_free = pool->watermark * 2.
> > Perhaps it should should also take waitqueue_active() into account ?
> > tag_alloc() can sleep more than necessary, it seems.
>
> No.
>
> By "sleeping more than necessary" you mean sleeping when there's tags
> available on other percpu freelists.

Yes,

> That's just unavoidable if the thing's to be percpu - efficient use of
> available tags requires global knowledge. Sleeping less would require
> more global cacheline contention, and would defeat the purpose of this
> code.

Yes, yes, I understand, there is a tradeoff. Just it is still not clear
to me what would be better "in practice"... So,

> So when you're deciding how many tag structs to allocate, you just
> double the number you'd allocate otherwise when you're using this code.

I am not sure this is really needed.

But OK, I see your point, thanks.

Oleg.

Subject: AIO refactoring/performance improvements/cancellation

Subject: [PATCH 12/21] aio: convert the ioctx list to radix tree

Subject: [PATCH 08/21] aio: Kill aio_rw_vect_retry()

Subject: [PATCH 20/21] direct-io: Set dio->io_error directly

Subject: [PATCH 18/21] aio: Allow cancellation without a cancel callback, new kiocb lookup

Subject: [PATCH 19/21] aio/usb: Update cancellation for new synchonization

Subject: [PATCH 21/21] block: Bio cancellation

Subject: [PATCH 17/21] Percpu tag allocator

Subject: [PATCH 16/21] mtip32xx: convert to batch completion

Subject: [PATCH 13/21] block: prep work for batch completion

Subject: [PATCH 14/21] block, aio: batch completion for bios/kiocbs

Subject: [PATCH 15/21] virtio-blk: convert to batch completion

Subject: [PATCH 11/21] aio: Kill ki_dtor

Subject: [PATCH 10/21] aio: Kill ki_users

Subject: [PATCH 07/21] aio: Don't use ctx->tail unnecessarily

Subject: [PATCH 09/21] aio: Kill unneeded kiocb members

Subject: [PATCH 02/21] aio: reqs_active -> reqs_available

Subject: [PATCH 06/21] aio: io_cancel() no longer returns the io_event

Subject: [PATCH 05/21] aio: percpu ioctx refcount

Subject: [PATCH 01/21] aio: fix kioctx not being freed after cancellation at exit time

Subject: [PATCH 04/21] Generic percpu refcounting

Subject: [PATCH 03/21] aio: percpu reqs_available

Subject: Re: [PATCH 17/21] Percpu tag allocator

Subject: Re: [PATCH 04/21] Generic percpu refcounting

Subject: Re: [PATCH 17/21] Percpu tag allocator

Subject: Re: [PATCH 04/21] Generic percpu refcounting

Subject: Re: [PATCH 17/21] Percpu tag allocator

Subject: Re: [PATCH 04/21] Generic percpu refcounting

Subject: Re: [PATCH 04/21] Generic percpu refcounting

Subject: Re: [PATCH 04/21] Generic percpu refcounting

Subject: Re: [PATCH 04/21] Generic percpu refcounting

Subject: Re: [PATCH 04/21] Generic percpu refcounting

Subject: Re: [PATCH 04/21] Generic percpu refcounting

Subject: Re: [PATCH 04/21] Generic percpu refcounting

Subject: Re: [PATCH 17/21] Percpu tag allocator

Subject: Re: [PATCH 17/21] Percpu tag allocator

Subject: Re: [PATCH 17/21] Percpu tag allocator

Subject: Re: [PATCH 17/21] Percpu tag allocator

Subject: Re: [PATCH 04/21] Generic percpu refcounting

Subject: Re: [PATCH 21/21] block: Bio cancellation

Subject: Re: [PATCH 04/21] Generic percpu refcounting

Subject: Re: [PATCH 21/21] block: Bio cancellation

Subject: Re: [PATCH 21/21] block: Bio cancellation

Subject: Re: [PATCH 17/21] Percpu tag allocator

Subject: Re: [PATCH 04/21] Generic percpu refcounting

Subject: Re: [PATCH 04/21] Generic percpu refcounting

Subject: Re: [PATCH 04/21] Generic percpu refcounting

Subject: Re: [PATCH 04/21] Generic percpu refcounting

Subject: Re: [PATCH 04/21] Generic percpu refcounting

Subject: Re: [PATCH 21/21] block: Bio cancellation

Subject: Re: [PATCH 17/21] Percpu tag allocator

Subject: Re: [PATCH 17/21] Percpu tag allocator