Date: 29 Aug 2015 17:08:16 -0400
Message-ID: <20150829210816.16551.qmail@ns.horizon.com>
From: "George Spelvin" <linux@horizon.com>
To: oleg@redhat.com
Subject: Re: [PATCH] task_work: remove fifo ordering guarantee
Cc: eric.dumazet@gmail.com, linux@horizon.com, linux-kernel@vger.kernel.org,
        mingo@kernel.org
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 3212
Lines: 71

Oleg Nesterov wrote:
> On 08/29, Ingo Molnar wrote:
>> So I'm wondering, is there any strong reason why we couldn't use a double
>> linked list and still do FIFO and remove that silly linear list walking hack?
> 
> This will obviously enlarge callback_head, and it is often embedded.
> But this is minor.
> 
> If we use a double linked list we can't do task_work_add() lockless.
> So we will need another spinlock_t in task_struct. We can't use pi_lock.

You only need a singly linked list for FIFO, but indeed lockless
access is a pain.

For a LIFO stack, you just do a single compare-and-swap on the head.
Once an entry is in the list, it's immutable.

For FIFO, you only need one pointer in the nodes, but two in the list
head: a head pointer and a tail pointer.

The problem for lockless access is that you have to update both the next
pointer and the tail pointer, and without very specialized instructions
like 680x0's CAS2, there's no way to do them both atomically.

So the procedure to append (write) to the list is:
- Set your newly added node's next pointer to NULL.
- CAS the tail pointer.  This establishes your place in line,
  but does not make it visible to the reader.
- At this point, the next pointer is not yours any more and may be
  written by someone else, but the rest of the node may still be
  finished, if you like.
- But also, there's a sort of priority inversion problem.  If a writer
  stalls here, no following writer is visible to the reader.
- Then you update the previous tail (obtained from the CAS) to link via its
  next pointer to your newly added node.  This makes it visible to the reader.

- The reader, meanwhile, totally ignores the tail pointer, and
  only uses the head and next pointers.

Now, one trick with the above is that the tail pointer must *only*
be accessed by writers.  The single-threaded technique of representing
an empty queue by using a "struct **" tail pointer and having tail = &head
to represent an empty list is not okay.

Rather, the reader must never deallocate a node with a NULL next pointer.

If the nodes are large, you can minimize the overhead of this by having a
dedicated sentinel/dummy node which lives in the queue and gets moved to
the tail whenever it reaches the head *and* has a non-NULL next pointer.
But the reader may not depend on this node always being visible!

The reader might find the sentinel pointing to node A, and then move it to
the tail by doing a CAS on the tail pointer, but the tail pointer
by then points to B.  It's possible that A's next pointer is still NULL
and the writer adding B has not gotten to the second step yet.


Anyway, it can be done, but honestly the current "empty the stack in
batches and then reverse it" is simpler.

The cond_resched is ugly, but the overhead is minimal, and the FIFO
guarantee is convenient.


By the way, it's quite possible (LIFO or FIFO) for the writer to batch up
queue adds and only do one CAS to add an arbitrary number.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/