Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
include/linux/freelist.h | 129 +++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 129 insertions(+)
--- /dev/null
+++ b/include/linux/freelist.h
@@ -0,0 +1,129 @@
+// SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
+#ifndef FREELIST_H
+#define FREELIST_H
+
+#include <linux/atomic.h>
+
+/*
+ * Copyright: [email protected]
+ *
+ * A simple CAS-based lock-free free list. Not the fastest thing in the world
+ * under heavy contention, but simple and correct (assuming nodes are never
+ * freed until after the free list is destroyed), and fairly speedy under low
+ * contention.
+ *
+ * Adapted from: https://moodycamel.com/blog/2014/solving-the-aba-problem-for-lock-free-free-lists
+ */
+
+struct freelist_node {
+ atomic_t refs;
+ struct freelist_node *next;
+};
+
+struct freelist_head {
+ struct freelist_node *head;
+};
+
+#define REFS_ON_FREELIST 0x80000000
+#define REFS_MASK 0x7FFFFFFF
+
+static inline void __freelist_add(struct freelist_node *node, struct freelist_head *list)
+{
+ /*
+ * Since the refcount is zero, and nobody can increase it once it's
+ * zero (except us, and we run only one copy of this method per node at
+ * a time, i.e. the single thread case), then we know we can safely
+ * change the next pointer of the node; however, once the refcount is
+ * back above zero, then other threads could increase it (happens under
+ * heavy contention, when the refcount goes to zero in between a load
+ * and a refcount increment of a node in try_get, then back up to
+ * something non-zero, then the refcount increment is done by the other
+ * thread) -- so if the CAS to add the node to the actual list fails,
+ * decrese the refcount and leave the add operation to the next thread
+ * who puts the refcount back to zero (which could be us, hence the
+ * loop).
+ */
+ struct freelist_node *head = READ_ONCE(list->head);
+
+ for (;;) {
+ WRITE_ONCE(node->next, head);
+ atomic_set_release(&node->refs, 1);
+
+ if (!try_cmpxchg_release(&list->head, &head, node)) {
+ /*
+ * Hmm, the add failed, but we can only try again when
+ * the refcount goes back to zero.
+ */
+ if (atomic_fetch_add_release(REFS_ON_FREELIST - 1, &node->refs) == 1)
+ continue;
+ }
+ return;
+ }
+}
+
+static inline void freelist_add(struct freelist_node *node, struct freelist_head *list)
+{
+ /*
+ * We know that the should-be-on-freelist bit is 0 at this point, so
+ * it's safe to set it using a fetch_add.
+ */
+ if (!atomic_fetch_add_release(REFS_ON_FREELIST, &node->refs)) {
+ /*
+ * Oh look! We were the last ones referencing this node, and we
+ * know we want to add it to the free list, so let's do it!
+ */
+ __freelist_add(node, list);
+ }
+}
+
+static inline struct freelist_node *freelist_try_get(struct freelist_head *list)
+{
+ struct freelist_node *prev, *next, *head = smp_load_acquire(&list->head);
+ unsigned int refs;
+
+ while (head) {
+ prev = head;
+ refs = atomic_read(&head->refs);
+ if ((refs & REFS_MASK) == 0 ||
+ !atomic_try_cmpxchg_acquire(&head->refs, &refs, refs+1)) {
+ head = smp_load_acquire(&list->head);
+ continue;
+ }
+
+ /*
+ * Good, reference count has been incremented (it wasn't at
+ * zero), which means we can read the next and not worry about
+ * it changing between now and the time we do the CAS.
+ */
+ next = READ_ONCE(head->next);
+ if (try_cmpxchg_acquire(&list->head, &head, next)) {
+ /*
+ * Yay, got the node. This means it was on the list,
+ * which means should-be-on-freelist must be false no
+ * matter the refcount (because nobody else knows it's
+ * been taken off yet, it can't have been put back on).
+ */
+ WARN_ON_ONCE(atomic_read(&head->refs) & REFS_ON_FREELIST);
+
+ /*
+ * Decrease refcount twice, once for our ref, and once
+ * for the list's ref.
+ */
+ atomic_fetch_add(-2, &head->refs);
+
+ return head;
+ }
+
+ /*
+ * OK, the head must have changed on us, but we still need to decrement
+ * the refcount we increased.
+ */
+ refs = atomic_fetch_add(-1, &prev->refs);
+ if (refs == REFS_ON_FREELIST + 1)
+ __freelist_add(prev, list);
+ }
+
+ return NULL;
+}
+
+#endif /* FREELIST_H */
On Thu, Aug 27, 2020 at 06:12:43PM +0200, Peter Zijlstra wrote:
> +struct freelist_node {
> + atomic_t refs;
> + struct freelist_node *next;
> +};
Bah, the next patch relies on this structure to be ordered the other
way around.
Clearly writing code and listening to talks isn't ideal :-)
On Thu, Aug 27, 2020 at 12:49:20PM -0400, Cameron wrote:
> For what it's worth, the freelist.h code seems to be a faithful adaptation
> of my original blog post code. Didn't think it would end up in the Linux
> kernel one day :-)
Hehe, I ran into the traditional ABA problem for the lockless stack and
asked google, your blog came up.
I'll try and actually think about it a little when the current (virtual)
conference is over.
Are you Ok with the License I put on it, GPLv2 or BDS-2 ?
> I'm just wondering if the assumption that "nodes are never freed until
> after the free list is destroyed" will hold true in the intended use case?
It does, the nodes are only deleted once the whole freelist object is
discarded.
On Thu, Aug 27, 2020 at 12:56 PM <[email protected]> wrote:
> Are you Ok with the License I put on it, GPLv2 or BDS-2 ?
Yes, both GPLv2 and BSD-2 (my personal favourite) are fine.
On Thu, Aug 27, 2020 at 06:12:43PM +0200, Peter Zijlstra wrote:
>
>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> ---
> include/linux/freelist.h | 129 +++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 129 insertions(+)
>
> --- /dev/null
> +++ b/include/linux/freelist.h
> @@ -0,0 +1,129 @@
> +// SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
> +#ifndef FREELIST_H
> +#define FREELIST_H
> +
> +#include <linux/atomic.h>
> +
> +/*
> + * Copyright: [email protected]
> + *
> + * A simple CAS-based lock-free free list. Not the fastest thing in the world
> + * under heavy contention, but simple and correct (assuming nodes are never
> + * freed until after the free list is destroyed), and fairly speedy under low
> + * contention.
> + *
> + * Adapted from: https://moodycamel.com/blog/2014/solving-the-aba-problem-for-lock-free-free-lists
> + */
> +
> +struct freelist_node {
> + atomic_t refs;
> + struct freelist_node *next;
> +};
> +
> +struct freelist_head {
> + struct freelist_node *head;
> +};
> +
> +#define REFS_ON_FREELIST 0x80000000
> +#define REFS_MASK 0x7FFFFFFF
> +
> +static inline void __freelist_add(struct freelist_node *node, struct freelist_head *list)
> +{
> + /*
> + * Since the refcount is zero, and nobody can increase it once it's
> + * zero (except us, and we run only one copy of this method per node at
> + * a time, i.e. the single thread case), then we know we can safely
> + * change the next pointer of the node; however, once the refcount is
> + * back above zero, then other threads could increase it (happens under
> + * heavy contention, when the refcount goes to zero in between a load
> + * and a refcount increment of a node in try_get, then back up to
> + * something non-zero, then the refcount increment is done by the other
> + * thread) -- so if the CAS to add the node to the actual list fails,
> + * decrese the refcount and leave the add operation to the next thread
> + * who puts the refcount back to zero (which could be us, hence the
> + * loop).
> + */
> + struct freelist_node *head = READ_ONCE(list->head);
> +
> + for (;;) {
> + WRITE_ONCE(node->next, head);
> + atomic_set_release(&node->refs, 1);
> +
> + if (!try_cmpxchg_release(&list->head, &head, node)) {
> + /*
> + * Hmm, the add failed, but we can only try again when
> + * the refcount goes back to zero.
> + */
> + if (atomic_fetch_add_release(REFS_ON_FREELIST - 1, &node->refs) == 1)
> + continue;
> + }
> + return;
> + }
> +}
> +
> +static inline void freelist_add(struct freelist_node *node, struct freelist_head *list)
> +{
> + /*
> + * We know that the should-be-on-freelist bit is 0 at this point, so
> + * it's safe to set it using a fetch_add.
> + */
> + if (!atomic_fetch_add_release(REFS_ON_FREELIST, &node->refs)) {
> + /*
> + * Oh look! We were the last ones referencing this node, and we
> + * know we want to add it to the free list, so let's do it!
> + */
> + __freelist_add(node, list);
> + }
> +}
> +
> +static inline struct freelist_node *freelist_try_get(struct freelist_head *list)
> +{
> + struct freelist_node *prev, *next, *head = smp_load_acquire(&list->head);
> + unsigned int refs;
> +
> + while (head) {
> + prev = head;
> + refs = atomic_read(&head->refs);
> + if ((refs & REFS_MASK) == 0 ||
> + !atomic_try_cmpxchg_acquire(&head->refs, &refs, refs+1)) {
> + head = smp_load_acquire(&list->head);
> + continue;
> + }
> +
> + /*
> + * Good, reference count has been incremented (it wasn't at
> + * zero), which means we can read the next and not worry about
> + * it changing between now and the time we do the CAS.
> + */
> + next = READ_ONCE(head->next);
> + if (try_cmpxchg_acquire(&list->head, &head, next)) {
So if try_cmpxchg_acquire() fails, we don't have ACQUIRE semantics on
read of the new list->head, right? Then probably a
smp_mb__after_atomic() is needed in that case?
Regards,
Boqun
> + /*
> + * Yay, got the node. This means it was on the list,
> + * which means should-be-on-freelist must be false no
> + * matter the refcount (because nobody else knows it's
> + * been taken off yet, it can't have been put back on).
> + */
> + WARN_ON_ONCE(atomic_read(&head->refs) & REFS_ON_FREELIST);
> +
> + /*
> + * Decrease refcount twice, once for our ref, and once
> + * for the list's ref.
> + */
> + atomic_fetch_add(-2, &head->refs);
> +
> + return head;
> + }
> +
> + /*
> + * OK, the head must have changed on us, but we still need to decrement
> + * the refcount we increased.
> + */
> + refs = atomic_fetch_add(-1, &prev->refs);
> + if (refs == REFS_ON_FREELIST + 1)
> + __freelist_add(prev, list);
> + }
> +
> + return NULL;
> +}
> +
> +#endif /* FREELIST_H */
>
>
On Thu, Aug 27, 2020 at 3:08 PM Boqun Feng <[email protected]> wrote:
> So if try_cmpxchg_acquire() fails, we don't have ACQUIRE semantics on
> read of the new list->head, right? Then probably a
> smp_mb__after_atomic() is needed in that case?
Yes, there needs to be an acquire on the head after a failed cmpxchg;
does the atomic_fetch_add following that not have acquire semantics?
Cameron
On Thu, Aug 27, 2020 at 03:57:22PM -0400, Cameron wrote:
> On Thu, Aug 27, 2020 at 3:08 PM Boqun Feng <[email protected]> wrote:
> > So if try_cmpxchg_acquire() fails, we don't have ACQUIRE semantics on
> > read of the new list->head, right? Then probably a
> > smp_mb__after_atomic() is needed in that case?
>
> Yes, there needs to be an acquire on the head after a failed cmpxchg;
> does the atomic_fetch_add following that not have acquire semantics?
>
Yes, you're right, the atomic_fecth_add() following is a fully-ordered
atomic, so could provide the necessary ACQUIRE semantics. I was missing
that. Maybe a few words explaining this helps.
Regards,
Boqun
> Cameron
On Fri, Aug 28, 2020 at 12:23 AM Peter Zijlstra <[email protected]> wrote:
> +static inline void __freelist_add(struct freelist_node *node, struct freelist_head *list)
> +{
> + /*
> + * Since the refcount is zero, and nobody can increase it once it's
> + * zero (except us, and we run only one copy of this method per node at
> + * a time, i.e. the single thread case), then we know we can safely
> +
> + /*
> + * OK, the head must have changed on us, but we still need to decrement
> + * the refcount we increased.
> + */
> + refs = atomic_fetch_add(-1, &prev->refs);
> + if (refs == REFS_ON_FREELIST + 1)
> + __freelist_add(prev, list);
I'm curious whether it is correct to just set the prev->refs to zero and return
@prev? So that it can remove an unneeded "add()&get()" pair (although in
an unlikely branch) and __freelist_add() can be folded into freelist_add()
for tidier code.
Thanks
Lai.
On 08/27, Peter Zijlstra wrote:
>
> 1 file changed, 129 insertions(+)
129 lines! And I spent more than 2 hours trying to understand these
129 lines ;) looks correct...
However, I still can't understand the usage of _acquire/release ops
in this code.
> +static inline void __freelist_add(struct freelist_node *node, struct freelist_head *list)
> +{
> + /*
> + * Since the refcount is zero, and nobody can increase it once it's
> + * zero (except us, and we run only one copy of this method per node at
> + * a time, i.e. the single thread case), then we know we can safely
> + * change the next pointer of the node; however, once the refcount is
> + * back above zero, then other threads could increase it (happens under
> + * heavy contention, when the refcount goes to zero in between a load
> + * and a refcount increment of a node in try_get, then back up to
> + * something non-zero, then the refcount increment is done by the other
> + * thread) -- so if the CAS to add the node to the actual list fails,
> + * decrese the refcount and leave the add operation to the next thread
> + * who puts the refcount back to zero (which could be us, hence the
> + * loop).
> + */
> + struct freelist_node *head = READ_ONCE(list->head);
> +
> + for (;;) {
> + WRITE_ONCE(node->next, head);
> + atomic_set_release(&node->refs, 1);
> +
> + if (!try_cmpxchg_release(&list->head, &head, node)) {
OK, these 2 _release above look understandable, they pair with
atomic_try_cmpxchg_acquire/try_cmpxchg_acquire in freelist_try_get().
> + /*
> + * Hmm, the add failed, but we can only try again when
> + * the refcount goes back to zero.
> + */
> + if (atomic_fetch_add_release(REFS_ON_FREELIST - 1, &node->refs) == 1)
> + continue;
Do we really need _release ? Why can't atomic_fetch_add_relaxed() work?
> +static inline struct freelist_node *freelist_try_get(struct freelist_head *list)
> +{
> + struct freelist_node *prev, *next, *head = smp_load_acquire(&list->head);
> + unsigned int refs;
> +
> + while (head) {
> + prev = head;
> + refs = atomic_read(&head->refs);
> + if ((refs & REFS_MASK) == 0 ||
> + !atomic_try_cmpxchg_acquire(&head->refs, &refs, refs+1)) {
> + head = smp_load_acquire(&list->head);
> + continue;
> + }
> +
> + /*
> + * Good, reference count has been incremented (it wasn't at
> + * zero), which means we can read the next and not worry about
> + * it changing between now and the time we do the CAS.
> + */
> + next = READ_ONCE(head->next);
> + if (try_cmpxchg_acquire(&list->head, &head, next)) {
> + /*
> + * Yay, got the node. This means it was on the list,
> + * which means should-be-on-freelist must be false no
> + * matter the refcount (because nobody else knows it's
> + * been taken off yet, it can't have been put back on).
> + */
> + WARN_ON_ONCE(atomic_read(&head->refs) & REFS_ON_FREELIST);
> +
> + /*
> + * Decrease refcount twice, once for our ref, and once
> + * for the list's ref.
> + */
> + atomic_fetch_add(-2, &head->refs);
Do we the barriers implied by _fetch_? Why can't atomic_sub(2, refs) work?
> + /*
> + * OK, the head must have changed on us, but we still need to decrement
> + * the refcount we increased.
> + */
> + refs = atomic_fetch_add(-1, &prev->refs);
Cosmetic, but why not atomic_fetch_dec() ?
Oleg.
On Fri, Aug 28, 2020 at 04:46:52PM +0200, Oleg Nesterov wrote:
> On 08/27, Peter Zijlstra wrote:
> >
> > 1 file changed, 129 insertions(+)
>
> 129 lines! And I spent more than 2 hours trying to understand these
> 129 lines ;) looks correct...
Yes, even though it already has a bunch of comments, I do feel we can
maybe improve on that a little.
For now I went for a 1:1 transliteration of the blog post though.
> > + /*
> > + * Yay, got the node. This means it was on the list,
> > + * which means should-be-on-freelist must be false no
> > + * matter the refcount (because nobody else knows it's
> > + * been taken off yet, it can't have been put back on).
> > + */
> > + WARN_ON_ONCE(atomic_read(&head->refs) & REFS_ON_FREELIST);
> > +
> > + /*
> > + * Decrease refcount twice, once for our ref, and once
> > + * for the list's ref.
> > + */
> > + atomic_fetch_add(-2, &head->refs);
>
> Do we the barriers implied by _fetch_? Why can't atomic_sub(2, refs) work?
I think we can, the original has std::memory_order_relaxed here. So I
should've used atomic_fetch_add_relaxed() but since we don't use the
return value, atomic_sub() would work just fine too.
> > + /*
> > + * OK, the head must have changed on us, but we still need to decrement
> > + * the refcount we increased.
> > + */
> > + refs = atomic_fetch_add(-1, &prev->refs);
>
> Cosmetic, but why not atomic_fetch_dec() ?
The original had that, I didn't want to risk more bugs by 'improving'
things. But yes, that can definitely become dec().
> I'm curious whether it is correct to just set the prev->refs to zero and return
> @prev? So that it can remove an unneeded "add()&get()" pair (although in
> an unlikely branch) and __freelist_add() can be folded into freelist_add()
> for tidier code.
That is a very good question. I believe it would be correct, yes, and
would certainly simplify the code. The reference count is zero, so
nobody can increment it again, and REFS_ON_FREELIST is set (the
should-be-on-freelist flag), so instead of adding it to the freelist
to be removed later, it can be returned directly.
> On Fri, Aug 28, 2020 at 04:46:52PM +0200, Oleg Nesterov wrote:
> > 129 lines! And I spent more than 2 hours trying to understand these
> > 129 lines ;) looks correct...
Hahaha. Sorry about that. Some of the most mind-bending code I've ever
written. :-)
> > + /*
> > + * Hmm, the add failed, but we can only try again when
> > + * the refcount goes back to zero.
> > + */
> > + if (atomic_fetch_add_release(REFS_ON_FREELIST - 1, &node->refs) == 1)
> > + continue;
> Do we really need _release ? Why can't atomic_fetch_add_relaxed() work?
The release is to synchronize with the acquire in the compare-exchange
of try_get. However, I think you're right, between the previous
release-write to the refs and that point, there are no memory effects
that need to be synchronized, and so it could be safely replaced with
a relaxed operation.
> Cosmetic, but why not atomic_fetch_dec() ?
This is just one of my idiosyncrasies. I prefer exclusively using
atomic add, I find it easier to read. Feel free to change them all to
subs!
Cameron
> >
> > Do we the barriers implied by _fetch_? Why can't atomic_sub(2, refs) work?
>
> I think we can, the original has std::memory_order_relaxed here. So I
> should've used atomic_fetch_add_relaxed() but since we don't use the
> return value, atomic_sub() would work just fine too.
>
> > > + /*
> > > + * OK, the head must have changed on us, but we still need to decrement
> > > + * the refcount we increased.
> > > + */
> > > + refs = atomic_fetch_add(-1, &prev->refs);
> >
> > Cosmetic, but why not atomic_fetch_dec() ?
>
> The original had that, I didn't want to risk more bugs by 'improving'
> things. But yes, that can definitely become dec().