Date: Mon, 18 Mar 2013 17:32:37 +0000
From: Eric Wong <normalperson@yhbt.net>
To: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: linux-kernel@vger.kernel.org, Davide Libenzi <davidel@xmailserver.org>,
        Al Viro <viro@ZenIV.linux.org.uk>,
        Andrew Morton <akpm@linux-foundation.org>,
        linux-fsdevel@vger.kernel.org
Subject: Re: [RFC v2] epoll: avoid spinlock contention with wfcqueue
Message-ID: <20130318173237.GB8798@dcvr.yhbt.net>
References: <20130314044215.GA27312@dcvr.yhbt.net>
 <20130318110722.GA8798@dcvr.yhbt.net>
 <20130318130649.GA15947@Krystal>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20130318130649.GA15947@Krystal>
User-Agent: Mutt/1.5.20 (2009-06-14)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 6392
Lines: 175

Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> * Eric Wong (normalperson@yhbt.net) wrote:
> > Eric Wong <normalperson@yhbt.net> wrote:
> > > I'm posting this lightly tested version since I may not be able to do
> > > more testing/benchmarking until the weekend.
> > 
> > Still lightly tested (on an initramfs KVM, no real applications, yet).
> > 
> > > Davide's totalmess is still running, so that's probably a good sign :)
> > > http://www.xmailserver.org/totalmess.c
> > 
> > Ditto :)  Also testing with eponeshotmt, which is close to my target
> > use case: http://yhbt.net/eponeshotmt.c
> > 
> > > I will look for more ways to break this (and benchmark when I stop
> > > finding ways to break it).  No real applications tested, yet, and
> > > I think I can improve upon this, too.
> > 
> > No real apps, yet, and I need to make sure this doesn't cause
> > regressions for the traditional single-threaded event loop case.
> > 
> > This is the use case I mainly care about (multiple tasks calling
> > epoll_wait(maxevents=1) to divide work).
> > 
> > Time to wait on 4 million events (4 threads generating events,
> > 4 threads calling epoll_wait(maxevents=1) 1 million times each,
> > 10 eventfd file descriptors (fewer descriptors means higher
> > chance of contention for epi->state inside ep_poll_callback).
> > 
> > Before:
> > $ eponeshotmt -t 4 -w 4 -f 10 -c 1000000
> > real    0m 9.58s
> > user    0m 1.22s
> > sys     0m 37.08s
> > 
> > After:
> > $ eponeshotmt -t 4 -w 4 -f 10 -c 1000000
> > real    0m 6.49s
> > user    0m 1.28s
> > sys     0m 24.66s
> 
> Nice! This looks like a 31% speedup with 4 cores. It would be nice to
> see how this evolves when the number of cores and threads increase. I
> also notice that you turned the spinlock_irqsave into a mutex. Maybe
> comparison with a simple spinlock (rather than the mutex) with lead to
> interesting findings. (note that this spinlock will likely not need to
> have IRQ off, as enqueue won't need to hold the spinlock).

Unfortunately, 4 cores is all I have right now.  I'm hoping others can
help test with more cores.

I added the mutex lock to ep_poll since it's now required for
ep_events_available.  Another upside is the ep_events_available is
always coherent with the ep_send_events loop, so there's no chance of a
task entering ep_send_events on an empty ready list

I was planning on making the mutex cover a wider scope for ep_poll
before I discovered wfcqueue.  I noticed ep->lock was very contended
(and always dominating lock_stat).

Previously with ep_poll + ep_scan_ready_list + ep_send_events_proc,
it was something like this where a spin lock was taken 3 times in
quick succession:

	ep_poll:
		spin_lock;
		check ep_events_available;
		spin_unlock;

		ep_send_events:
			ep_scan_ready_list:
				mutex_lock
				spin_lock
				...
				spin_unlock

					ep_send_events_proc

				spin_lock
				...
				spin_unlock
				mutex_unlock

ep->lock was getting bounced all over since ep_poll_callback(which also
takes ep->lock) was constantly firing.  This is made worse when several
threads are calling ep_poll.  The exclusive waitqueue-ing of ep_poll
doesn't help much because of the sheer number of ep_poll_callback wakeups.

> Some comments below,
> 
> [...]
> > +static int ep_send_events(struct eventpoll *ep,
> > +			  struct epoll_event __user *uevent, int maxevents)
> > +{
> > +	int eventcnt = 0;
> >  	unsigned int revents;
> >  	struct epitem *epi;
> > -	struct epoll_event __user *uevent;
> >  	struct wakeup_source *ws;
> > +	struct wfcq_node *node, *n;
> > +	enum epoll_item_state state;
> >  	poll_table pt;
> >  
> >  	init_poll_funcptr(&pt, NULL);
> >  
> >  	/*
> > -	 * We can loop without lock because we are passed a task private list.
> > -	 * Items cannot vanish during the loop because ep_scan_ready_list() is
> > -	 * holding "mtx" during this call.
> > +	 * Items cannot vanish during the loop because we are holding
> > +	 * "mtx" during this call.
> >  	 */
> > -	for (eventcnt = 0, uevent = esed->events;
> > -	     !list_empty(head) && eventcnt < esed->maxevents;) {
> > -		epi = list_first_entry(head, struct epitem, rdllink);
> > +	ep_ready_prepare(ep);
> > +	__wfcq_for_each_safe(&ep->txlhead, &ep->txltail, node, n) {
> 
> Even though it should technically work, I don't understand why you do
> this:
> 
> eventcnt = 0;
> 
> __wfcq_for_each_safe(&ep->txlhead, &ep->txltail, node, n) {
>         ...
>         if (++eventcnt == maxevents)
>                 n = NULL; /* stop iteration */
>         __wfcq_dequeue(&ep->txlhead, &ep->txltail);
> }
> 
> Rather than this:
> 
> eventcnt = 0;
> 
> for (;;) {
>         if (eventcnt++ >= maxevents)
>                 break;
>         node = __wfcq_dequeue(&ep->txlhead, &ep->txltail);
>         if (!node)
>                 break;
>         ...
> }
> 
> The second way to express your queue would save a couple of useless ops,
> and would ensure your item is kicked out of the queue as soon as it is
> processed.

I delay the dequeue and I don't dequeue at all if put_user fails.
I should probably add a comment where I removed the list_add() to that
effect.  On the other hand, preserving event ordering when users trigger
-EFAULT might not be worth it...

> I'm also not entirely sure why you need to add enum epoll_item_state
> along with expensive atomic ops to compute the state.  Wouldn't it be
> enough to know in which queue the nodes are located ? If need be, you
> could add new queues, e.g. one per state. So instead of marking states,
> you would simply re-enqueue the nodes into per-state queues. This would
> simplify zombie management and save a couple of brains in the process. ;-)

Is there a quick way to know which queue the node is located?

ep_enqueue may fire from several places at once (ep_poll_callback,
ep_insert/ep_modify) so I think guarding it with something (currently
ep_mark_ready) is required.  We used to use ep->lock to protect all the
"if (!ep_is_linked) list_add_tail" calls, too.

For making zombies, they could appear from the middle of a queue,
so I couldn't pluck them out from either txl/rdl in O(1)

Thanks for all your help and comments.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/