2023-09-21 21:37:37

by Mirsad Todorovac

[permalink] [raw]
Subject: BUG: maple_tree: KCSAN: data-race in mas_topiary_replace / mtree_range_walk

Hi,

On the 6.6-rc2 vanilla torvalds tree kernel, KCSAN had reported a number of data-races:

[ 6413.367326] ==================================================================
[ 6413.367349] BUG: KCSAN: data-race in mas_topiary_replace / mtree_range_walk

[ 6413.367375] write to 0xffff8883a0c5db00 of 8 bytes by task 5475 on cpu 24:
[ 6413.367386] mas_topiary_replace (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:491 /home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:1701 /home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:2590)
[ 6413.367399] mas_spanning_rebalance.isra.0 (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:2664 /home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:2961)
[ 6413.367413] mas_wr_spanning_store.isra.0 (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:3894)
[ 6413.367428] mas_wr_store_entry.isra.0 (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:4242)
[ 6413.367442] mas_store_prealloc (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:256 /home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:5408)
[ 6413.367455] vma_merge (/home/marvin/linux/kernel/torvalds2/mm/internal.h:1127 /home/marvin/linux/kernel/torvalds2/mm/mmap.c:1005)
[ 6413.367466] mprotect_fixup (/home/marvin/linux/kernel/torvalds2/mm/mprotect.c:632)
[ 6413.367480] do_mprotect_pkey (/home/marvin/linux/kernel/torvalds2/mm/mprotect.c:819)
[ 6413.367494] __x64_sys_mprotect (/home/marvin/linux/kernel/torvalds2/mm/mprotect.c:837)
[ 6413.367503] do_syscall_64 (/home/marvin/linux/kernel/torvalds2/arch/x86/entry/common.c:50 /home/marvin/linux/kernel/torvalds2/arch/x86/entry/common.c:80)
[ 6413.367517] entry_SYSCALL_64_after_hwframe (/home/marvin/linux/kernel/torvalds2/arch/x86/entry/entry_64.S:120)

[ 6413.367534] read to 0xffff8883a0c5db00 of 8 bytes by task 5558 on cpu 11:
[ 6413.367545] mtree_range_walk (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:539 /home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:2789)
[ 6413.367556] mas_walk (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:251 /home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:4844)
[ 6413.367567] lock_vma_under_rcu (/home/marvin/linux/kernel/torvalds2/mm/memory.c:5436)
[ 6413.367579] do_user_addr_fault (/home/marvin/linux/kernel/torvalds2/arch/x86/mm/fault.c:1357)
[ 6413.367593] exc_page_fault (/home/marvin/linux/kernel/torvalds2/./arch/x86/include/asm/paravirt.h:695 /home/marvin/linux/kernel/torvalds2/arch/x86/mm/fault.c:1513 /home/marvin/linux/kernel/torvalds2/arch/x86/mm/fault.c:1561)
[ 6413.367602] asm_exc_page_fault (/home/marvin/linux/kernel/torvalds2/./arch/x86/include/asm/idtentry.h:570)

[ 6413.367617] value changed: 0xffff888341d43116 -> 0xffff888340f92116

[ 6413.367632] Reported by Kernel Concurrency Sanitizer on:
[ 6413.367640] CPU: 11 PID: 5558 Comm: ThreadPoolForeg Tainted: G L 6.6.0-rc2-kcsan-00143-gb5cbe7c00aa0 #41
[ 6413.367653] Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, BIOS 1.21 04/26/2023
[ 6413.367660] ==================================================================

For your convenience, took the trouble of finding those suspicious lines of code:

The write side:

lib/maple_tree.c:491
--------------------
456 /*
457 * mas_set_parent() - Set the parent node and encode the slot
458 * @enode: The encoded maple node.
459 * @parent: The encoded maple node that is the parent of @enode.
460 * @slot: The slot that @enode resides in @parent.
461 *
462 * Slot number is encoded in the enode->parent bit 3-6 or 2-6, depending on the
463 * parent type.
464 */
465 static inline
466 void mas_set_parent(struct ma_state *mas, struct maple_enode *enode,
467 const struct maple_enode *parent, unsigned char slot)
468 {
469 unsigned long val = (unsigned long)parent;
470 unsigned long shift;
471 unsigned long type;
472 enum maple_type p_type = mte_node_type(parent);
473
474 MAS_BUG_ON(mas, p_type == maple_dense);
475 MAS_BUG_ON(mas, p_type == maple_leaf_64);
476
477 switch (p_type) {
478 case maple_range_64:
479 case maple_arange_64:
480 shift = MAPLE_PARENT_SLOT_SHIFT;
481 type = MAPLE_PARENT_RANGE64;
482 break;
483 default:
484 case maple_dense:
485 case maple_leaf_64:
486 shift = type = 0;
487 break;
488 }
489
490 val &= ~MAPLE_NODE_MASK; /* Clear all node metadata in parent */
→ 491 val |= (slot << shift) | type;
492 mte_to_node(enode)->parent = ma_parent_ptr(val);
493 }

lib/maple_tree.c:1701
---------------------
1682 /*
1683 * mas_adopt_children() - Set the parent pointer of all nodes in @parent to
1684 * @parent with the slot encoded.
1685 * @mas - the maple state (for the tree)
1686 * @parent - the maple encoded node containing the children.
1687 */
1688 static inline void mas_adopt_children(struct ma_state *mas,
1689 struct maple_enode *parent)
1690 {
1691 enum maple_type type = mte_node_type(parent);
1692 struct maple_node *node = mte_to_node(parent);
1693 void __rcu **slots = ma_slots(node, type);
1694 unsigned long *pivots = ma_pivots(node, type);
1695 struct maple_enode *child;
1696 unsigned char offset;
1697
1698 offset = ma_data_end(node, type, pivots, mas->max);
1699 do {
1700 child = mas_slot_locked(mas, slots, offset);
→ 1701 mas_set_parent(mas, child, parent, offset);
1702 } while (offset--);
1703 }

2562 static inline void mas_topiary_replace(struct ma_state *mas,
2563 struct maple_enode *old_enode)
2564 {
2565 struct ma_state tmp[3], tmp_next[3];
2566 MA_TOPIARY(subtrees, mas->tree);
2567 bool in_rcu;
2568 int i, n;
2569
2570 /* Place data in tree & then mark node as old */
2571 mas_put_in_tree(mas, old_enode);
2572
2573 /* Update the parent pointers in the tree */
2574 tmp[0] = *mas;
2575 tmp[0].offset = 0;
2576 tmp[1].node = MAS_NONE;
2577 tmp[2].node = MAS_NONE;
2578 while (!mte_is_leaf(tmp[0].node)) {
2579 n = 0;
2580 for (i = 0; i < 3; i++) {
2581 if (mas_is_none(&tmp[i]))
2582 continue;
2583
2584 while (n < 3) {
2585 if (!mas_find_child(&tmp[i], &tmp_next[n]))
2586 break;
2587 n++;
2588 }
2589
→ 2590 mas_adopt_children(&tmp[i], tmp[i].node);
2591 }
2592
2593 if (MAS_WARN_ON(mas, n == 0))
2594 break;
2595
2596 while (n < 3)
2597 tmp_next[n++].node = MAS_NONE;
2598
2599 for (i = 0; i < 3; i++)
2600 tmp[i] = tmp_next[i];
2601 }
2602
2603 /* Collect the old nodes that need to be discarded */
2604 if (mte_is_leaf(old_enode))
2605 return mas_free(mas, old_enode);
2606
2607 tmp[0] = *mas;
2608 tmp[0].offset = 0;
2609 tmp[0].node = old_enode;
2610 tmp[1].node = MAS_NONE;
2611 tmp[2].node = MAS_NONE;
2612 in_rcu = mt_in_rcu(mas->tree);
2613 do {
2614 n = 0;
2615 for (i = 0; i < 3; i++) {
2616 if (mas_is_none(&tmp[i]))
2617 continue;

The read side:

527 /*
528 * ma_dead_node() - check if the @enode is dead.
529 * @enode: The encoded maple node
530 *
531 * Return: true if dead, false otherwise.
532 */
533 static inline bool ma_dead_node(const struct maple_node *node)
534 {
535 struct maple_node *parent;
536
537 /* Do not reorder reads from the node prior to the parent check */
538 smp_rmb();
→ 539 parent = (void *)((unsigned long) node->parent & ~MAPLE_NODE_MASK);
540 return (parent == node);
541 }

2767 static inline void *mtree_range_walk(struct ma_state *mas)
2768 {
2769 unsigned long *pivots;
2770 unsigned char offset;
2771 struct maple_node *node;
2772 struct maple_enode *next, *last;
2773 enum maple_type type;
2774 void __rcu **slots;
2775 unsigned char end;
2776 unsigned long max, min;
2777 unsigned long prev_max, prev_min;
2778
2779 next = mas->node;
2780 min = mas->min;
2781 max = mas->max;
2782 do {
2783 offset = 0;
2784 last = next;
2785 node = mte_to_node(next);
2786 type = mte_node_type(next);
2787 pivots = ma_pivots(node, type);
2788 end = ma_data_end(node, type, pivots, max);
→ 2789 if (unlikely(ma_dead_node(node)))
2790 goto dead_node;
2791
2792 if (pivots[offset] >= mas->index) {
2793 prev_max = max;
2794 prev_min = min;
2795 max = pivots[offset];
2796 goto next;
2797 }
2798
2799 do {
2800 offset++;
2801 } while ((offset < end) && (pivots[offset] < mas->index));

As it is evident, ma_dead_node() expands to:

527 /*
528 * ma_dead_node() - check if the @enode is dead.
529 * @enode: The encoded maple node
530 *
531 * Return: true if dead, false otherwise.
532 */
533 static inline bool ma_dead_node(const struct maple_node *node)
534 {
535 struct maple_node *parent;
536
537 /* Do not reorder reads from the node prior to the parent check */
538 smp_rmb();
→ 539 parent = (void *)((unsigned long) node->parent & ~MAPLE_NODE_MASK);
540 return (parent == node);
541 }

as above, but the smb_rmb() protection is here, so the KCSAN warning should be double-checked for
validity.

But I do not really understand maple trees to their depth, I am only reporting the symptomatic
outlook of the reported data-race.

This is all-in-all a very interested subject, and I wish there was a way to just slurp all those
interesting kernel intrinsics into the brain, but it just ain't that easy. Forgive me for being
open ...

Best regards,
Mirsad Todorovac


2023-09-23 04:27:59

by Liam R. Howlett

[permalink] [raw]
Subject: Re: BUG: maple_tree: KCSAN: data-race in mas_topiary_replace / mtree_range_walk

* Mirsad Todorovac <[email protected]> [230921 17:01]:
> Hi,
>

Hello!

> On the 6.6-rc2 vanilla torvalds tree kernel, KCSAN had reported a number of data-races:
>
> [ 6413.367326] ==================================================================
> [ 6413.367349] BUG: KCSAN: data-race in mas_topiary_replace / mtree_range_walk
>
> [ 6413.367375] write to 0xffff8883a0c5db00 of 8 bytes by task 5475 on cpu 24:
> [ 6413.367386] mas_topiary_replace (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:491 /home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:1701 /home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:2590)
> [ 6413.367399] mas_spanning_rebalance.isra.0 (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:2664 /home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:2961)
> [ 6413.367413] mas_wr_spanning_store.isra.0 (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:3894)
> [ 6413.367428] mas_wr_store_entry.isra.0 (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:4242)
> [ 6413.367442] mas_store_prealloc (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:256 /home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:5408)
> [ 6413.367455] vma_merge (/home/marvin/linux/kernel/torvalds2/mm/internal.h:1127 /home/marvin/linux/kernel/torvalds2/mm/mmap.c:1005)
> [ 6413.367466] mprotect_fixup (/home/marvin/linux/kernel/torvalds2/mm/mprotect.c:632)
> [ 6413.367480] do_mprotect_pkey (/home/marvin/linux/kernel/torvalds2/mm/mprotect.c:819)
> [ 6413.367494] __x64_sys_mprotect (/home/marvin/linux/kernel/torvalds2/mm/mprotect.c:837)
> [ 6413.367503] do_syscall_64 (/home/marvin/linux/kernel/torvalds2/arch/x86/entry/common.c:50 /home/marvin/linux/kernel/torvalds2/arch/x86/entry/common.c:80)
> [ 6413.367517] entry_SYSCALL_64_after_hwframe (/home/marvin/linux/kernel/torvalds2/arch/x86/entry/entry_64.S:120)
>
> [ 6413.367534] read to 0xffff8883a0c5db00 of 8 bytes by task 5558 on cpu 11:
> [ 6413.367545] mtree_range_walk (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:539 /home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:2789)
> [ 6413.367556] mas_walk (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:251 /home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:4844)
> [ 6413.367567] lock_vma_under_rcu (/home/marvin/linux/kernel/torvalds2/mm/memory.c:5436)
> [ 6413.367579] do_user_addr_fault (/home/marvin/linux/kernel/torvalds2/arch/x86/mm/fault.c:1357)
> [ 6413.367593] exc_page_fault (/home/marvin/linux/kernel/torvalds2/./arch/x86/include/asm/paravirt.h:695 /home/marvin/linux/kernel/torvalds2/arch/x86/mm/fault.c:1513 /home/marvin/linux/kernel/torvalds2/arch/x86/mm/fault.c:1561)
> [ 6413.367602] asm_exc_page_fault (/home/marvin/linux/kernel/torvalds2/./arch/x86/include/asm/idtentry.h:570)
>
> [ 6413.367617] value changed: 0xffff888341d43116 -> 0xffff888340f92116
>
> [ 6413.367632] Reported by Kernel Concurrency Sanitizer on:
> [ 6413.367640] CPU: 11 PID: 5558 Comm: ThreadPoolForeg Tainted: G L 6.6.0-rc2-kcsan-00143-gb5cbe7c00aa0 #41
> [ 6413.367653] Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, BIOS 1.21 04/26/2023
> [ 6413.367660] ==================================================================
>
> For your convenience, took the trouble of finding those suspicious lines of code:
>
> The write side:
>
> lib/maple_tree.c:491
> --------------------
> 456 /*
> 457 * mas_set_parent() - Set the parent node and encode the slot
> 458 * @enode: The encoded maple node.
> 459 * @parent: The encoded maple node that is the parent of @enode.
> 460 * @slot: The slot that @enode resides in @parent.
> 461 *
> 462 * Slot number is encoded in the enode->parent bit 3-6 or 2-6, depending on the
> 463 * parent type.
> 464 */
> 465 static inline
> 466 void mas_set_parent(struct ma_state *mas, struct maple_enode *enode,
> 467 const struct maple_enode *parent, unsigned char slot)
> 468 {
> 469 unsigned long val = (unsigned long)parent;
> 470 unsigned long shift;
> 471 unsigned long type;
> 472 enum maple_type p_type = mte_node_type(parent);
> 473
> 474 MAS_BUG_ON(mas, p_type == maple_dense);
> 475 MAS_BUG_ON(mas, p_type == maple_leaf_64);
> 476
> 477 switch (p_type) {
> 478 case maple_range_64:
> 479 case maple_arange_64:
> 480 shift = MAPLE_PARENT_SLOT_SHIFT;
> 481 type = MAPLE_PARENT_RANGE64;
> 482 break;
> 483 default:
> 484 case maple_dense:
> 485 case maple_leaf_64:
> 486 shift = type = 0;
> 487 break;
> 488 }
> 489
> 490 val &= ~MAPLE_NODE_MASK; /* Clear all node metadata in parent */
> → 491 val |= (slot << shift) | type;
> 492 mte_to_node(enode)->parent = ma_parent_ptr(val);
> 493 }
>
> lib/maple_tree.c:1701
> ---------------------
> 1682 /*
> 1683 * mas_adopt_children() - Set the parent pointer of all nodes in @parent to
> 1684 * @parent with the slot encoded.
> 1685 * @mas - the maple state (for the tree)
> 1686 * @parent - the maple encoded node containing the children.
> 1687 */
> 1688 static inline void mas_adopt_children(struct ma_state *mas,
> 1689 struct maple_enode *parent)
> 1690 {
> 1691 enum maple_type type = mte_node_type(parent);
> 1692 struct maple_node *node = mte_to_node(parent);
> 1693 void __rcu **slots = ma_slots(node, type);
> 1694 unsigned long *pivots = ma_pivots(node, type);
> 1695 struct maple_enode *child;
> 1696 unsigned char offset;
> 1697
> 1698 offset = ma_data_end(node, type, pivots, mas->max);
> 1699 do {
> 1700 child = mas_slot_locked(mas, slots, offset);
> → 1701 mas_set_parent(mas, child, parent, offset);
> 1702 } while (offset--);
> 1703 }
>
> 2562 static inline void mas_topiary_replace(struct ma_state *mas,
> 2563 struct maple_enode *old_enode)
> 2564 {
> 2565 struct ma_state tmp[3], tmp_next[3];
> 2566 MA_TOPIARY(subtrees, mas->tree);
> 2567 bool in_rcu;
> 2568 int i, n;
> 2569
> 2570 /* Place data in tree & then mark node as old */
> 2571 mas_put_in_tree(mas, old_enode);
> 2572
> 2573 /* Update the parent pointers in the tree */
> 2574 tmp[0] = *mas;
> 2575 tmp[0].offset = 0;
> 2576 tmp[1].node = MAS_NONE;
> 2577 tmp[2].node = MAS_NONE;
> 2578 while (!mte_is_leaf(tmp[0].node)) {
> 2579 n = 0;
> 2580 for (i = 0; i < 3; i++) {
> 2581 if (mas_is_none(&tmp[i]))
> 2582 continue;
> 2583
> 2584 while (n < 3) {
> 2585 if (!mas_find_child(&tmp[i], &tmp_next[n]))
> 2586 break;
> 2587 n++;
> 2588 }
> 2589
> → 2590 mas_adopt_children(&tmp[i], tmp[i].node);
> 2591 }
> 2592
> 2593 if (MAS_WARN_ON(mas, n == 0))
> 2594 break;
> 2595
> 2596 while (n < 3)
> 2597 tmp_next[n++].node = MAS_NONE;
> 2598
> 2599 for (i = 0; i < 3; i++)
> 2600 tmp[i] = tmp_next[i];
> 2601 }
> 2602
> 2603 /* Collect the old nodes that need to be discarded */
> 2604 if (mte_is_leaf(old_enode))
> 2605 return mas_free(mas, old_enode);
> 2606
> 2607 tmp[0] = *mas;
> 2608 tmp[0].offset = 0;
> 2609 tmp[0].node = old_enode;
> 2610 tmp[1].node = MAS_NONE;
> 2611 tmp[2].node = MAS_NONE;
> 2612 in_rcu = mt_in_rcu(mas->tree);
> 2613 do {
> 2614 n = 0;
> 2615 for (i = 0; i < 3; i++) {
> 2616 if (mas_is_none(&tmp[i]))
> 2617 continue;
>
> The read side:
>
> 527 /*
> 528 * ma_dead_node() - check if the @enode is dead.
> 529 * @enode: The encoded maple node
> 530 *
> 531 * Return: true if dead, false otherwise.
> 532 */
> 533 static inline bool ma_dead_node(const struct maple_node *node)
> 534 {
> 535 struct maple_node *parent;
> 536
> 537 /* Do not reorder reads from the node prior to the parent check */
> 538 smp_rmb();
> → 539 parent = (void *)((unsigned long) node->parent & ~MAPLE_NODE_MASK);
> 540 return (parent == node);
> 541 }
>
> 2767 static inline void *mtree_range_walk(struct ma_state *mas)
> 2768 {
> 2769 unsigned long *pivots;
> 2770 unsigned char offset;
> 2771 struct maple_node *node;
> 2772 struct maple_enode *next, *last;
> 2773 enum maple_type type;
> 2774 void __rcu **slots;
> 2775 unsigned char end;
> 2776 unsigned long max, min;
> 2777 unsigned long prev_max, prev_min;
> 2778
> 2779 next = mas->node;
> 2780 min = mas->min;
> 2781 max = mas->max;
> 2782 do {
> 2783 offset = 0;
> 2784 last = next;
> 2785 node = mte_to_node(next);
> 2786 type = mte_node_type(next);
> 2787 pivots = ma_pivots(node, type);
> 2788 end = ma_data_end(node, type, pivots, max);
> → 2789 if (unlikely(ma_dead_node(node)))
> 2790 goto dead_node;
> 2791
> 2792 if (pivots[offset] >= mas->index) {
> 2793 prev_max = max;
> 2794 prev_min = min;
> 2795 max = pivots[offset];
> 2796 goto next;
> 2797 }
> 2798
> 2799 do {
> 2800 offset++;
> 2801 } while ((offset < end) && (pivots[offset] < mas->index));
>
> As it is evident, ma_dead_node() expands to:
>
> 527 /*
> 528 * ma_dead_node() - check if the @enode is dead.
> 529 * @enode: The encoded maple node
> 530 *
> 531 * Return: true if dead, false otherwise.
> 532 */
> 533 static inline bool ma_dead_node(const struct maple_node *node)
> 534 {
> 535 struct maple_node *parent;
> 536
> 537 /* Do not reorder reads from the node prior to the parent check */
> 538 smp_rmb();
> → 539 parent = (void *)((unsigned long) node->parent & ~MAPLE_NODE_MASK);
> 540 return (parent == node);
> 541 }
>
> as above, but the smb_rmb() protection is here, so the KCSAN warning should be double-checked for
> validity.
>
> But I do not really understand maple trees to their depth, I am only reporting the symptomatic
> outlook of the reported data-race.

This is the most complex part of the maple tree, and unique to the
ability to store a range over multiple existing entries. I recently
rewrote this particular section to avoid a potential live-lock issue.

I am confident that the parent pointer will not be set to the node
pointer as checked by ma_dead_node(). But in an abundance of caution
and to resolve this potential false-positive, I am looking at this in
more detail.

This race is to see if the walk down the tree into unchanged data will
be seen before it is placed in the new subtree with its data also
unchanged. Since we know the parent can never be the node, I feel this
is safe - but I'm not willing to live with the complaints from kasan.

>
> This is all-in-all a very interested subject, and I wish there was a way to just slurp all those
> interesting kernel intrinsics into the brain, but it just ain't that easy. Forgive me for being
> open ...

I appreciate that and your detailed analysis with code pointers of where
this happens. Is this easy to recreate? If so, how? Can you attach
your kernel config?

Thanks,
Liam

2023-09-23 07:29:50

by Mirsad Todorovac

[permalink] [raw]
Subject: Re: BUG: maple_tree: KCSAN: data-race in mas_topiary_replace / mtree_range_walk

On 9/22/23 15:51, Liam R. Howlett wrote:
> * Mirsad Todorovac <[email protected]> [230921 17:01]:
>> Hi,
>>
>
> Hello!

Hi again,

>> On the 6.6-rc2 vanilla torvalds tree kernel, KCSAN had reported a number of data-races:
>>
>> [ 6413.367326] ==================================================================
>> [ 6413.367349] BUG: KCSAN: data-race in mas_topiary_replace / mtree_range_walk
>>
>> [ 6413.367375] write to 0xffff8883a0c5db00 of 8 bytes by task 5475 on cpu 24:
>> [ 6413.367386] mas_topiary_replace (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:491 /home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:1701 /home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:2590)
>> [ 6413.367399] mas_spanning_rebalance.isra.0 (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:2664 /home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:2961)
>> [ 6413.367413] mas_wr_spanning_store.isra.0 (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:3894)
>> [ 6413.367428] mas_wr_store_entry.isra.0 (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:4242)
>> [ 6413.367442] mas_store_prealloc (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:256 /home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:5408)
>> [ 6413.367455] vma_merge (/home/marvin/linux/kernel/torvalds2/mm/internal.h:1127 /home/marvin/linux/kernel/torvalds2/mm/mmap.c:1005)
>> [ 6413.367466] mprotect_fixup (/home/marvin/linux/kernel/torvalds2/mm/mprotect.c:632)
>> [ 6413.367480] do_mprotect_pkey (/home/marvin/linux/kernel/torvalds2/mm/mprotect.c:819)
>> [ 6413.367494] __x64_sys_mprotect (/home/marvin/linux/kernel/torvalds2/mm/mprotect.c:837)
>> [ 6413.367503] do_syscall_64 (/home/marvin/linux/kernel/torvalds2/arch/x86/entry/common.c:50 /home/marvin/linux/kernel/torvalds2/arch/x86/entry/common.c:80)
>> [ 6413.367517] entry_SYSCALL_64_after_hwframe (/home/marvin/linux/kernel/torvalds2/arch/x86/entry/entry_64.S:120)
>>
>> [ 6413.367534] read to 0xffff8883a0c5db00 of 8 bytes by task 5558 on cpu 11:
>> [ 6413.367545] mtree_range_walk (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:539 /home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:2789)
>> [ 6413.367556] mas_walk (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:251 /home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:4844)
>> [ 6413.367567] lock_vma_under_rcu (/home/marvin/linux/kernel/torvalds2/mm/memory.c:5436)
>> [ 6413.367579] do_user_addr_fault (/home/marvin/linux/kernel/torvalds2/arch/x86/mm/fault.c:1357)
>> [ 6413.367593] exc_page_fault (/home/marvin/linux/kernel/torvalds2/./arch/x86/include/asm/paravirt.h:695 /home/marvin/linux/kernel/torvalds2/arch/x86/mm/fault.c:1513 /home/marvin/linux/kernel/torvalds2/arch/x86/mm/fault.c:1561)
>> [ 6413.367602] asm_exc_page_fault (/home/marvin/linux/kernel/torvalds2/./arch/x86/include/asm/idtentry.h:570)
>>
>> [ 6413.367617] value changed: 0xffff888341d43116 -> 0xffff888340f92116
>>
>> [ 6413.367632] Reported by Kernel Concurrency Sanitizer on:
>> [ 6413.367640] CPU: 11 PID: 5558 Comm: ThreadPoolForeg Tainted: G L 6.6.0-rc2-kcsan-00143-gb5cbe7c00aa0 #41
>> [ 6413.367653] Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, BIOS 1.21 04/26/2023
>> [ 6413.367660] ==================================================================
>>
>> For your convenience, took the trouble of finding those suspicious lines of code:
>>
>> The write side:
>>
>> lib/maple_tree.c:491
>> --------------------
>> 456 /*
>> 457 * mas_set_parent() - Set the parent node and encode the slot
>> 458 * @enode: The encoded maple node.
>> 459 * @parent: The encoded maple node that is the parent of @enode.
>> 460 * @slot: The slot that @enode resides in @parent.
>> 461 *
>> 462 * Slot number is encoded in the enode->parent bit 3-6 or 2-6, depending on the
>> 463 * parent type.
>> 464 */
>> 465 static inline
>> 466 void mas_set_parent(struct ma_state *mas, struct maple_enode *enode,
>> 467 const struct maple_enode *parent, unsigned char slot)
>> 468 {
>> 469 unsigned long val = (unsigned long)parent;
>> 470 unsigned long shift;
>> 471 unsigned long type;
>> 472 enum maple_type p_type = mte_node_type(parent);
>> 473
>> 474 MAS_BUG_ON(mas, p_type == maple_dense);
>> 475 MAS_BUG_ON(mas, p_type == maple_leaf_64);
>> 476
>> 477 switch (p_type) {
>> 478 case maple_range_64:
>> 479 case maple_arange_64:
>> 480 shift = MAPLE_PARENT_SLOT_SHIFT;
>> 481 type = MAPLE_PARENT_RANGE64;
>> 482 break;
>> 483 default:
>> 484 case maple_dense:
>> 485 case maple_leaf_64:
>> 486 shift = type = 0;
>> 487 break;
>> 488 }
>> 489
>> 490 val &= ~MAPLE_NODE_MASK; /* Clear all node metadata in parent */
>> → 491 val |= (slot << shift) | type;
>> 492 mte_to_node(enode)->parent = ma_parent_ptr(val);
>> 493 }
>>
>> lib/maple_tree.c:1701
>> ---------------------
>> 1682 /*
>> 1683 * mas_adopt_children() - Set the parent pointer of all nodes in @parent to
>> 1684 * @parent with the slot encoded.
>> 1685 * @mas - the maple state (for the tree)
>> 1686 * @parent - the maple encoded node containing the children.
>> 1687 */
>> 1688 static inline void mas_adopt_children(struct ma_state *mas,
>> 1689 struct maple_enode *parent)
>> 1690 {
>> 1691 enum maple_type type = mte_node_type(parent);
>> 1692 struct maple_node *node = mte_to_node(parent);
>> 1693 void __rcu **slots = ma_slots(node, type);
>> 1694 unsigned long *pivots = ma_pivots(node, type);
>> 1695 struct maple_enode *child;
>> 1696 unsigned char offset;
>> 1697
>> 1698 offset = ma_data_end(node, type, pivots, mas->max);
>> 1699 do {
>> 1700 child = mas_slot_locked(mas, slots, offset);
>> → 1701 mas_set_parent(mas, child, parent, offset);
>> 1702 } while (offset--);
>> 1703 }
>>
>> 2562 static inline void mas_topiary_replace(struct ma_state *mas,
>> 2563 struct maple_enode *old_enode)
>> 2564 {
>> 2565 struct ma_state tmp[3], tmp_next[3];
>> 2566 MA_TOPIARY(subtrees, mas->tree);
>> 2567 bool in_rcu;
>> 2568 int i, n;
>> 2569
>> 2570 /* Place data in tree & then mark node as old */
>> 2571 mas_put_in_tree(mas, old_enode);
>> 2572
>> 2573 /* Update the parent pointers in the tree */
>> 2574 tmp[0] = *mas;
>> 2575 tmp[0].offset = 0;
>> 2576 tmp[1].node = MAS_NONE;
>> 2577 tmp[2].node = MAS_NONE;
>> 2578 while (!mte_is_leaf(tmp[0].node)) {
>> 2579 n = 0;
>> 2580 for (i = 0; i < 3; i++) {
>> 2581 if (mas_is_none(&tmp[i]))
>> 2582 continue;
>> 2583
>> 2584 while (n < 3) {
>> 2585 if (!mas_find_child(&tmp[i], &tmp_next[n]))
>> 2586 break;
>> 2587 n++;
>> 2588 }
>> 2589
>> → 2590 mas_adopt_children(&tmp[i], tmp[i].node);
>> 2591 }
>> 2592
>> 2593 if (MAS_WARN_ON(mas, n == 0))
>> 2594 break;
>> 2595
>> 2596 while (n < 3)
>> 2597 tmp_next[n++].node = MAS_NONE;
>> 2598
>> 2599 for (i = 0; i < 3; i++)
>> 2600 tmp[i] = tmp_next[i];
>> 2601 }
>> 2602
>> 2603 /* Collect the old nodes that need to be discarded */
>> 2604 if (mte_is_leaf(old_enode))
>> 2605 return mas_free(mas, old_enode);
>> 2606
>> 2607 tmp[0] = *mas;
>> 2608 tmp[0].offset = 0;
>> 2609 tmp[0].node = old_enode;
>> 2610 tmp[1].node = MAS_NONE;
>> 2611 tmp[2].node = MAS_NONE;
>> 2612 in_rcu = mt_in_rcu(mas->tree);
>> 2613 do {
>> 2614 n = 0;
>> 2615 for (i = 0; i < 3; i++) {
>> 2616 if (mas_is_none(&tmp[i]))
>> 2617 continue;
>>
>> The read side:
>>
>> 527 /*
>> 528 * ma_dead_node() - check if the @enode is dead.
>> 529 * @enode: The encoded maple node
>> 530 *
>> 531 * Return: true if dead, false otherwise.
>> 532 */
>> 533 static inline bool ma_dead_node(const struct maple_node *node)
>> 534 {
>> 535 struct maple_node *parent;
>> 536
>> 537 /* Do not reorder reads from the node prior to the parent check */
>> 538 smp_rmb();
>> → 539 parent = (void *)((unsigned long) node->parent & ~MAPLE_NODE_MASK);
>> 540 return (parent == node);
>> 541 }
>>
>> 2767 static inline void *mtree_range_walk(struct ma_state *mas)
>> 2768 {
>> 2769 unsigned long *pivots;
>> 2770 unsigned char offset;
>> 2771 struct maple_node *node;
>> 2772 struct maple_enode *next, *last;
>> 2773 enum maple_type type;
>> 2774 void __rcu **slots;
>> 2775 unsigned char end;
>> 2776 unsigned long max, min;
>> 2777 unsigned long prev_max, prev_min;
>> 2778
>> 2779 next = mas->node;
>> 2780 min = mas->min;
>> 2781 max = mas->max;
>> 2782 do {
>> 2783 offset = 0;
>> 2784 last = next;
>> 2785 node = mte_to_node(next);
>> 2786 type = mte_node_type(next);
>> 2787 pivots = ma_pivots(node, type);
>> 2788 end = ma_data_end(node, type, pivots, max);
>> → 2789 if (unlikely(ma_dead_node(node)))
>> 2790 goto dead_node;
>> 2791
>> 2792 if (pivots[offset] >= mas->index) {
>> 2793 prev_max = max;
>> 2794 prev_min = min;
>> 2795 max = pivots[offset];
>> 2796 goto next;
>> 2797 }
>> 2798
>> 2799 do {
>> 2800 offset++;
>> 2801 } while ((offset < end) && (pivots[offset] < mas->index));
>>
>> As it is evident, ma_dead_node() expands to:
>>
>> 527 /*
>> 528 * ma_dead_node() - check if the @enode is dead.
>> 529 * @enode: The encoded maple node
>> 530 *
>> 531 * Return: true if dead, false otherwise.
>> 532 */
>> 533 static inline bool ma_dead_node(const struct maple_node *node)
>> 534 {
>> 535 struct maple_node *parent;
>> 536
>> 537 /* Do not reorder reads from the node prior to the parent check */
>> 538 smp_rmb();
>> → 539 parent = (void *)((unsigned long) node->parent & ~MAPLE_NODE_MASK);
>> 540 return (parent == node);
>> 541 }
>>
>> as above, but the smb_rmb() protection is here, so the KCSAN warning should be double-checked for
>> validity.
>>
>> But I do not really understand maple trees to their depth, I am only reporting the symptomatic
>> outlook of the reported data-race.
>
> This is the most complex part of the maple tree, and unique to the
> ability to store a range over multiple existing entries. I recently
> rewrote this particular section to avoid a potential live-lock issue.

I see.

> I am confident that the parent pointer will not be set to the node
> pointer as checked by ma_dead_node(). But in an abundance of caution
> and to resolve this potential false-positive, I am looking at this in
> more detail.
>
> This race is to see if the walk down the tree into unchanged data will
> be seen before it is placed in the new subtree with its data also
> unchanged. Since we know the parent can never be the node, I feel this
> is safe - but I'm not willing to live with the complaints from kasan.

I cannot analyse a couple of thousand lines at such a short notice.

It looks suspicious because val in line 491 in a local variable and thus
isn't available elsewhere.

Maybe the compiler (gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0) did something
to the code?

>> This is all-in-all a very interested subject, and I wish there was a way to just slurp all those
>> interesting kernel intrinsics into the brain, but it just ain't that easy. Forgive me for being
>> open ...
>
> I appreciate that and your detailed analysis with code pointers of where
> this happens. Is this easy to recreate? If so, how? Can you attach
> your kernel config?

Got that attached first, so I do not forget. :-/

Recreate? Actually it happened quite a number of times on my configuration (480+?).

(Please NOTE: the log is stack decoded, but not deduplicated.)

Apparently, everything and anything uses page allocation, so maybe the config
will help you recreate.

> Thanks,
> Liam

Best regards,
Mirsad


Attachments:
config-6.6.0-rc2-kcsan-00143-gb5cbe7c00aa0.xz (57.10 kB)
mtree-kcsan-01-decoded.log.xz (82.47 kB)
Download all attachments

2023-09-29 04:20:34

by Liam R. Howlett

[permalink] [raw]
Subject: Re: BUG: maple_tree: KCSAN: data-race in mas_topiary_replace / mtree_range_walk

* Mirsad Todorovac <[email protected]> [230923 03:27]:
> On 9/22/23 15:51, Liam R. Howlett wrote:

...

> > > [ 6413.367326] ==================================================================
> > > [ 6413.367349] BUG: KCSAN: data-race in mas_topiary_replace / mtree_range_walk
> > >
> > > [ 6413.367375] write to 0xffff8883a0c5db00 of 8 bytes by task 5475 on cpu 24:
> > > [ 6413.367386] mas_topiary_replace (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:491 /home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:1701 /home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:2590)
> > > [ 6413.367399] mas_spanning_rebalance.isra.0 (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:2664 /home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:2961)
> > > [ 6413.367413] mas_wr_spanning_store.isra.0 (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:3894)
> > > [ 6413.367428] mas_wr_store_entry.isra.0 (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:4242)
> > > [ 6413.367442] mas_store_prealloc (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:256 /home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:5408)
> > > [ 6413.367455] vma_merge (/home/marvin/linux/kernel/torvalds2/mm/internal.h:1127 /home/marvin/linux/kernel/torvalds2/mm/mmap.c:1005)
> > > [ 6413.367466] mprotect_fixup (/home/marvin/linux/kernel/torvalds2/mm/mprotect.c:632)
> > > [ 6413.367480] do_mprotect_pkey (/home/marvin/linux/kernel/torvalds2/mm/mprotect.c:819)
> > > [ 6413.367494] __x64_sys_mprotect (/home/marvin/linux/kernel/torvalds2/mm/mprotect.c:837)
> > > [ 6413.367503] do_syscall_64 (/home/marvin/linux/kernel/torvalds2/arch/x86/entry/common.c:50 /home/marvin/linux/kernel/torvalds2/arch/x86/entry/common.c:80)
> > > [ 6413.367517] entry_SYSCALL_64_after_hwframe (/home/marvin/linux/kernel/torvalds2/arch/x86/entry/entry_64.S:120)
> > >
> > > [ 6413.367534] read to 0xffff8883a0c5db00 of 8 bytes by task 5558 on cpu 11:
> > > [ 6413.367545] mtree_range_walk (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:539 /home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:2789)
> > > [ 6413.367556] mas_walk (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:251 /home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:4844)
> > > [ 6413.367567] lock_vma_under_rcu (/home/marvin/linux/kernel/torvalds2/mm/memory.c:5436)
> > > [ 6413.367579] do_user_addr_fault (/home/marvin/linux/kernel/torvalds2/arch/x86/mm/fault.c:1357)
> > > [ 6413.367593] exc_page_fault (/home/marvin/linux/kernel/torvalds2/./arch/x86/include/asm/paravirt.h:695 /home/marvin/linux/kernel/torvalds2/arch/x86/mm/fault.c:1513 /home/marvin/linux/kernel/torvalds2/arch/x86/mm/fault.c:1561)
> > > [ 6413.367602] asm_exc_page_fault (/home/marvin/linux/kernel/torvalds2/./arch/x86/include/asm/idtentry.h:570)
> > >
> > > [ 6413.367617] value changed: 0xffff888341d43116 -> 0xffff888340f92116
> > >
> > > [ 6413.367632] Reported by Kernel Concurrency Sanitizer on:
> > > [ 6413.367640] CPU: 11 PID: 5558 Comm: ThreadPoolForeg Tainted: G L 6.6.0-rc2-kcsan-00143-gb5cbe7c00aa0 #41
> > > [ 6413.367653] Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, BIOS 1.21 04/26/2023
> > > [ 6413.367660] ==================================================================
> > >
> > > For your convenience, took the trouble of finding those suspicious lines of code:
> > >
> > > The write side:
> > >
> > > lib/maple_tree.c:491
> > > --------------------
> > > 456 /*
> > > 457 * mas_set_parent() - Set the parent node and encode the slot
> > > 458 * @enode: The encoded maple node.
> > > 459 * @parent: The encoded maple node that is the parent of @enode.
> > > 460 * @slot: The slot that @enode resides in @parent.
> > > 461 *
> > > 462 * Slot number is encoded in the enode->parent bit 3-6 or 2-6, depending on the
> > > 463 * parent type.
> > > 464 */
> > > 465 static inline
> > > 466 void mas_set_parent(struct ma_state *mas, struct maple_enode *enode,
> > > 467 const struct maple_enode *parent, unsigned char slot)
> > > 468 {
> > > 469 unsigned long val = (unsigned long)parent;
> > > 470 unsigned long shift;
> > > 471 unsigned long type;
> > > 472 enum maple_type p_type = mte_node_type(parent);
> > > 473
> > > 474 MAS_BUG_ON(mas, p_type == maple_dense);
> > > 475 MAS_BUG_ON(mas, p_type == maple_leaf_64);
> > > 476
> > > 477 switch (p_type) {
> > > 478 case maple_range_64:
> > > 479 case maple_arange_64:
> > > 480 shift = MAPLE_PARENT_SLOT_SHIFT;
> > > 481 type = MAPLE_PARENT_RANGE64;
> > > 482 break;
> > > 483 default:
> > > 484 case maple_dense:
> > > 485 case maple_leaf_64:
> > > 486 shift = type = 0;
> > > 487 break;
> > > 488 }
> > > 489
> > > 490 val &= ~MAPLE_NODE_MASK; /* Clear all node metadata in parent */
> > > → 491 val |= (slot << shift) | type;
> > > 492 mte_to_node(enode)->parent = ma_parent_ptr(val);
> > > 493 }

This should probably be 492, not 491. I know what is racing here and it
shouldn't cause issue.

...
> > > The read side:
> > >
> > > 527 /*
> > > 528 * ma_dead_node() - check if the @enode is dead.
> > > 529 * @enode: The encoded maple node
> > > 530 *
> > > 531 * Return: true if dead, false otherwise.
> > > 532 */
> > > 533 static inline bool ma_dead_node(const struct maple_node *node)
> > > 534 {
> > > 535 struct maple_node *parent;
> > > 536
> > > 537 /* Do not reorder reads from the node prior to the parent check */
> > > 538 smp_rmb();
> > > → 539 parent = (void *)((unsigned long) node->parent & ~MAPLE_NODE_MASK);
> > > 540 return (parent == node);
> > > 541 }

This is the correct line.

...
> > >
> > > as above, but the smb_rmb() protection is here, so the KCSAN warning should be double-checked for
> > > validity.
> > >
> > > But I do not really understand maple trees to their depth, I am only reporting the symptomatic
> > > outlook of the reported data-race.
> >
> > This is the most complex part of the maple tree, and unique to the
> > ability to store a range over multiple existing entries. I recently
> > rewrote this particular section to avoid a potential live-lock issue.
>
> I see.
>
> > I am confident that the parent pointer will not be set to the node
> > pointer as checked by ma_dead_node(). But in an abundance of caution
> > and to resolve this potential false-positive, I am looking at this in
> > more detail.
> >
> > This race is to see if the walk down the tree into unchanged data will
> > be seen before it is placed in the new subtree with its data also
> > unchanged. Since we know the parent can never be the node, I feel this
> > is safe - but I'm not willing to live with the complaints from kasan.
>
> I cannot analyse a couple of thousand lines at such a short notice.

Don't worry, I will :)

>
> It looks suspicious because val in line 491 in a local variable and thus
> isn't available elsewhere.

It is used in the node->parent, as described above. It is a race, but
it doesn't matter who wins.

>
> Maybe the compiler (gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0) did something
> to the code?

Probably off-by-one line.

>
> > > This is all-in-all a very interested subject, and I wish there was a way to just slurp all those
> > > interesting kernel intrinsics into the brain, but it just ain't that easy. Forgive me for being
> > > open ...
> >
> > I appreciate that and your detailed analysis with code pointers of where
> > this happens. Is this easy to recreate? If so, how? Can you attach
> > your kernel config?
>
> Got that attached first, so I do not forget. :-/
>
> Recreate? Actually it happened quite a number of times on my configuration (480+?).

I'm having issues recreating it because I hit a lot of races before this
one in my test setup. I will keep working on reproducing this race, but
in the mean time can you test the attached patch with your setup?

...

Thanks,
Liam


Attachments:
(No filename) (8.23 kB)
race_fix.patch (393.00 B)
Download all attachments

2023-09-30 07:05:14

by Mirsad Todorovac

[permalink] [raw]
Subject: Re: BUG: maple_tree: KCSAN: data-race in mas_topiary_replace / mtree_range_walk



On 9/28/23 21:59, Liam R. Howlett wrote:
> * Mirsad Todorovac <[email protected]> [230923 03:27]:
>> On 9/22/23 15:51, Liam R. Howlett wrote:
>
> ...
>
>>>> [ 6413.367326] ==================================================================
>>>> [ 6413.367349] BUG: KCSAN: data-race in mas_topiary_replace / mtree_range_walk
>>>>
>>>> [ 6413.367375] write to 0xffff8883a0c5db00 of 8 bytes by task 5475 on cpu 24:
>>>> [ 6413.367386] mas_topiary_replace (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:491 /home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:1701 /home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:2590)
>>>> [ 6413.367399] mas_spanning_rebalance.isra.0 (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:2664 /home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:2961)
>>>> [ 6413.367413] mas_wr_spanning_store.isra.0 (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:3894)
>>>> [ 6413.367428] mas_wr_store_entry.isra.0 (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:4242)
>>>> [ 6413.367442] mas_store_prealloc (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:256 /home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:5408)
>>>> [ 6413.367455] vma_merge (/home/marvin/linux/kernel/torvalds2/mm/internal.h:1127 /home/marvin/linux/kernel/torvalds2/mm/mmap.c:1005)
>>>> [ 6413.367466] mprotect_fixup (/home/marvin/linux/kernel/torvalds2/mm/mprotect.c:632)
>>>> [ 6413.367480] do_mprotect_pkey (/home/marvin/linux/kernel/torvalds2/mm/mprotect.c:819)
>>>> [ 6413.367494] __x64_sys_mprotect (/home/marvin/linux/kernel/torvalds2/mm/mprotect.c:837)
>>>> [ 6413.367503] do_syscall_64 (/home/marvin/linux/kernel/torvalds2/arch/x86/entry/common.c:50 /home/marvin/linux/kernel/torvalds2/arch/x86/entry/common.c:80)
>>>> [ 6413.367517] entry_SYSCALL_64_after_hwframe (/home/marvin/linux/kernel/torvalds2/arch/x86/entry/entry_64.S:120)
>>>>
>>>> [ 6413.367534] read to 0xffff8883a0c5db00 of 8 bytes by task 5558 on cpu 11:
>>>> [ 6413.367545] mtree_range_walk (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:539 /home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:2789)
>>>> [ 6413.367556] mas_walk (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:251 /home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:4844)
>>>> [ 6413.367567] lock_vma_under_rcu (/home/marvin/linux/kernel/torvalds2/mm/memory.c:5436)
>>>> [ 6413.367579] do_user_addr_fault (/home/marvin/linux/kernel/torvalds2/arch/x86/mm/fault.c:1357)
>>>> [ 6413.367593] exc_page_fault (/home/marvin/linux/kernel/torvalds2/./arch/x86/include/asm/paravirt.h:695 /home/marvin/linux/kernel/torvalds2/arch/x86/mm/fault.c:1513 /home/marvin/linux/kernel/torvalds2/arch/x86/mm/fault.c:1561)
>>>> [ 6413.367602] asm_exc_page_fault (/home/marvin/linux/kernel/torvalds2/./arch/x86/include/asm/idtentry.h:570)
>>>>
>>>> [ 6413.367617] value changed: 0xffff888341d43116 -> 0xffff888340f92116
>>>>
>>>> [ 6413.367632] Reported by Kernel Concurrency Sanitizer on:
>>>> [ 6413.367640] CPU: 11 PID: 5558 Comm: ThreadPoolForeg Tainted: G L 6.6.0-rc2-kcsan-00143-gb5cbe7c00aa0 #41
>>>> [ 6413.367653] Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, BIOS 1.21 04/26/2023
>>>> [ 6413.367660] ==================================================================
>>>>
>>>> For your convenience, took the trouble of finding those suspicious lines of code:
>>>>
>>>> The write side:
>>>>
>>>> lib/maple_tree.c:491
>>>> --------------------
>>>> 456 /*
>>>> 457 * mas_set_parent() - Set the parent node and encode the slot
>>>> 458 * @enode: The encoded maple node.
>>>> 459 * @parent: The encoded maple node that is the parent of @enode.
>>>> 460 * @slot: The slot that @enode resides in @parent.
>>>> 461 *
>>>> 462 * Slot number is encoded in the enode->parent bit 3-6 or 2-6, depending on the
>>>> 463 * parent type.
>>>> 464 */
>>>> 465 static inline
>>>> 466 void mas_set_parent(struct ma_state *mas, struct maple_enode *enode,
>>>> 467 const struct maple_enode *parent, unsigned char slot)
>>>> 468 {
>>>> 469 unsigned long val = (unsigned long)parent;
>>>> 470 unsigned long shift;
>>>> 471 unsigned long type;
>>>> 472 enum maple_type p_type = mte_node_type(parent);
>>>> 473
>>>> 474 MAS_BUG_ON(mas, p_type == maple_dense);
>>>> 475 MAS_BUG_ON(mas, p_type == maple_leaf_64);
>>>> 476
>>>> 477 switch (p_type) {
>>>> 478 case maple_range_64:
>>>> 479 case maple_arange_64:
>>>> 480 shift = MAPLE_PARENT_SLOT_SHIFT;
>>>> 481 type = MAPLE_PARENT_RANGE64;
>>>> 482 break;
>>>> 483 default:
>>>> 484 case maple_dense:
>>>> 485 case maple_leaf_64:
>>>> 486 shift = type = 0;
>>>> 487 break;
>>>> 488 }
>>>> 489
>>>> 490 val &= ~MAPLE_NODE_MASK; /* Clear all node metadata in parent */
>>>> → 491 val |= (slot << shift) | type;
>>>> 492 mte_to_node(enode)->parent = ma_parent_ptr(val);
>>>> 493 }
>
> This should probably be 492, not 491. I know what is racing here and it
> shouldn't cause issue.

That seems logical to me, too.

> ...
>>>> The read side:
>>>>
>>>> 527 /*
>>>> 528 * ma_dead_node() - check if the @enode is dead.
>>>> 529 * @enode: The encoded maple node
>>>> 530 *
>>>> 531 * Return: true if dead, false otherwise.
>>>> 532 */
>>>> 533 static inline bool ma_dead_node(const struct maple_node *node)
>>>> 534 {
>>>> 535 struct maple_node *parent;
>>>> 536
>>>> 537 /* Do not reorder reads from the node prior to the parent check */
>>>> 538 smp_rmb();
>>>> → 539 parent = (void *)((unsigned long) node->parent & ~MAPLE_NODE_MASK);
>>>> 540 return (parent == node);
>>>> 541 }
>
> This is the correct line.

Good. Now it makes sense.

> ...
>>>>
>>>> as above, but the smb_rmb() protection is here, so the KCSAN warning should be double-checked for
>>>> validity.
>>>>
>>>> But I do not really understand maple trees to their depth, I am only reporting the symptomatic
>>>> outlook of the reported data-race.
>>>
>>> This is the most complex part of the maple tree, and unique to the
>>> ability to store a range over multiple existing entries. I recently
>>> rewrote this particular section to avoid a potential live-lock issue.
>>
>> I see.
>>
>>> I am confident that the parent pointer will not be set to the node
>>> pointer as checked by ma_dead_node(). But in an abundance of caution
>>> and to resolve this potential false-positive, I am looking at this in
>>> more detail.
>>>
>>> This race is to see if the walk down the tree into unchanged data will
>>> be seen before it is placed in the new subtree with its data also
>>> unchanged. Since we know the parent can never be the node, I feel this
>>> is safe - but I'm not willing to live with the complaints from kasan.
>>
>> I cannot analyse a couple of thousand lines at such a short notice.
>
> Don't worry, I will :)

:-)

>> It looks suspicious because val in line 491 in a local variable and thus
>> isn't available elsewhere.
>
> It is used in the node->parent, as described above. It is a race, but
> it doesn't matter who wins.
>
>>
>> Maybe the compiler (gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0) did something
>> to the code?
>
> Probably off-by-one line.
>
>>
>>>> This is all-in-all a very interested subject, and I wish there was a way to just slurp all those
>>>> interesting kernel intrinsics into the brain, but it just ain't that easy. Forgive me for being
>>>> open ...
>>>
>>> I appreciate that and your detailed analysis with code pointers of where
>>> this happens. Is this easy to recreate? If so, how? Can you attach
>>> your kernel config?
>>
>> Got that attached first, so I do not forget. :-/
>>
>> Recreate? Actually it happened quite a number of times on my configuration (480+?).
>
> I'm having issues recreating it because I hit a lot of races before this
> one in my test setup. I will keep working on reproducing this race, but
> in the mean time can you test the attached patch with your setup?

I have tried that on the recent vanilla torvalds tree kernel, and it still shows a number of
data-races:

[ 157.300020] BUG: KCSAN: data-race in __call_rcu_common.constprop.0 / mtree_range_walk
[ 181.319610] BUG: KCSAN: data-race in mas_topiary_replace / mtree_range_walk
[ 196.868714] BUG: KCSAN: data-race in __call_rcu_common.constprop.0 / mtree_range_walk
[ 201.291452] BUG: KCSAN: data-race in mas_topiary_replace / mtree_range_walk
[ 226.185733] BUG: KCSAN: data-race in mas_topiary_replace / mtree_range_walk
[ 227.996664] BUG: KCSAN: data-race in __call_rcu_common.constprop.0 / mtree_range_walk
[ 228.447678] BUG: KCSAN: data-race in mas_topiary_replace / mtree_range_walk
[ 229.270275] BUG: KCSAN: data-race in mas_wr_modify / mtree_range_walk
[ 229.889751] BUG: KCSAN: data-race in __call_rcu_common.constprop.0 / mtree_range_walk
[ 231.469987] BUG: KCSAN: data-race in mas_topiary_replace / mtree_range_walk
[ 234.872378] BUG: KCSAN: data-race in mas_wr_node_store / mtree_range_walk
[ 235.159590] BUG: KCSAN: data-race in mas_topiary_replace / mtree_range_walk
[ 235.186811] BUG: KCSAN: data-race in mas_wr_node_store / mtree_range_walk
[ 239.658108] BUG: KCSAN: data-race in mas_topiary_replace / mtree_range_walk
[ 241.893093] BUG: KCSAN: data-race in mas_topiary_replace / mas_walk
[ 242.510624] BUG: KCSAN: data-race in mas_topiary_replace / mtree_range_walk
[ 244.692616] BUG: KCSAN: data-race in __call_rcu_common.constprop.0 / mtree_range_walk
[ 245.122822] BUG: KCSAN: data-race in mas_topiary_replace / mtree_range_walk
[ 246.000424] BUG: KCSAN: data-race in mas_wr_node_store / mtree_range_walk
[ 246.678092] BUG: KCSAN: data-race in mas_topiary_replace / mtree_range_walk
[ 251.742829] BUG: KCSAN: data-race in mas_topiary_replace / mas_walk
[ 251.762047] BUG: KCSAN: data-race in mas_topiary_replace / mtree_range_walk
[ 254.148155] BUG: KCSAN: data-race in mas_wr_node_store / mtree_range_walk
[ 263.442284] BUG: KCSAN: data-race in mas_topiary_replace / mas_walk
[ 267.004013] BUG: KCSAN: data-race in mas_topiary_replace / mtree_range_walk
[ 270.086804] BUG: KCSAN: data-race in mas_topiary_replace / mtree_range_walk
[ 270.116319] BUG: KCSAN: data-race in mas_topiary_replace / mtree_range_walk
[ 353.218023] BUG: KCSAN: data-race in __call_rcu_common.constprop.0 / mtree_range_walk
[ 962.775499] BUG: KCSAN: data-race in __call_rcu_common.constprop.0 / mtree_range_walk
[ 962.783627] BUG: KCSAN: data-race in mas_topiary_replace / mtree_range_walk
[ 964.604473] BUG: KCSAN: data-race in mas_topiary_replace / mtree_range_walk
[ 967.502757] BUG: KCSAN: data-race in __call_rcu_common.constprop.0 / mtree_range_walk
[ 967.539662] BUG: KCSAN: data-race in mas_topiary_replace / mtree_range_walk
[ 970.156674] BUG: KCSAN: data-race in __call_rcu_common.constprop.0 / mtree_range_walk
[ 970.629596] BUG: KCSAN: data-race in mas_wr_modify / mtree_range_walk
[ 970.718835] BUG: KCSAN: data-race in mas_topiary_replace / mas_walk
[ 971.023748] BUG: KCSAN: data-race in __call_rcu_common.constprop.0 / mtree_range_walk
[ 971.058710] BUG: KCSAN: data-race in mas_topiary_replace / mtree_range_walk
[ 971.584682] BUG: KCSAN: data-race in mas_wr_modify / mtree_range_walk
[ 973.808208] BUG: KCSAN: data-race in __call_rcu_common.constprop.0 / mtree_range_walk
[ 974.993192] BUG: KCSAN: data-race in mas_topiary_replace / mtree_range_walk
[ 975.076067] BUG: KCSAN: data-race in __call_rcu_common.constprop.0 / mtree_range_walk
[ 977.460561] BUG: KCSAN: data-race in mas_topiary_replace / mtree_range_walk
[ 977.715483] BUG: KCSAN: data-race in mas_topiary_replace / mtree_range_walk
[ 978.078659] BUG: KCSAN: data-race in mas_topiary_replace / mtree_range_walk
[ 978.186831] BUG: KCSAN: data-race in __call_rcu_common.constprop.0 / mtree_range_walk
[ 995.034856] BUG: KCSAN: data-race in __call_rcu_common.constprop.0 / mtree_range_walk
[ 1006.951970] BUG: KCSAN: data-race in mas_wr_node_store / mtree_range_walk
[ 1072.950367] BUG: KCSAN: data-race in __call_rcu_common.constprop.0 / mtree_range_walk
[ 1146.949829] BUG: KCSAN: data-race in mas_wr_node_store / mtree_range_walk
[ 1174.940825] BUG: KCSAN: data-race in mas_topiary_replace / mtree_range_walk
[ 1245.325405] BUG: KCSAN: data-race in __call_rcu_common.constprop.0 / mtree_range_walk
[ 1287.939229] BUG: KCSAN: data-race in __call_rcu_common.constprop.0 / mtree_range_walk
[ 1523.333994] BUG: KCSAN: data-race in __call_rcu_common.constprop.0 / mtree_range_walk

However, for example, this one doesn't make sense:

==================================================================
BUG: KCSAN: data-race in mas_topiary_replace / mas_walk

write to 0xffff88811a9e0a00 of 8 bytes by task 9477 on cpu 15:
mas_topiary_replace (lib/maple_tree.c:304 lib/maple_tree.c:1731 lib/maple_tree.c:2572)
mas_spanning_rebalance.isra.0 (lib/maple_tree.c:2665 lib/maple_tree.c:2962)
mas_wr_spanning_store.isra.0 (lib/maple_tree.c:3895)
mas_wr_store_entry.isra.0 (lib/maple_tree.c:4243)
mas_store_gfp (lib/maple_tree.c:6090 lib/maple_tree.c:5386)
do_vmi_align_munmap (mm/internal.h:1090 mm/mmap.c:2535)
do_vmi_munmap (mm/mmap.c:2623)
mmap_region (mm/mmap.c:2673)
do_mmap (mm/mmap.c:1354)
vm_mmap_pgoff (mm/util.c:546)
ksys_mmap_pgoff (mm/mmap.c:1400)
__x64_sys_mmap (arch/x86/kernel/sys_x86_64.c:86)
do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)

read to 0xffff88811a9e0a00 of 8 bytes by task 9475 on cpu 28:
mas_walk (lib/maple_tree.c:524 lib/maple_tree.c:556 lib/maple_tree.c:1375 lib/maple_tree.c:1359 lib/maple_tree.c:3690 lib/maple_tree.c:4844)
lock_vma_under_rcu (mm/memory.c:5436)
do_user_addr_fault (arch/x86/mm/fault.c:1357)
exc_page_fault (./arch/x86/include/asm/paravirt.h:695 arch/x86/mm/fault.c:1513 arch/x86/mm/fault.c:1561)
asm_exc_page_fault (./arch/x86/include/asm/idtentry.h:570)

value changed: 0xffff88811a7e6341 -> 0xffff88811a9e0a00

Reported by Kernel Concurrency Sanitizer on:
CPU: 28 PID: 9475 Comm: chrome Tainted: G L 6.6.0-rc3-kcsan-00146-g9f3ebbef746f-dirty #2
Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, BIOS 1.21 04/26/2023
==================================================================

On the write side, you have:

302 static inline void mte_set_node_dead(struct maple_enode *mn)
303 {
→ 304 mte_to_node(mn)->parent = ma_parent_ptr(mte_to_node(mn));
305 smp_wmb(); /* Needed for RCU */
306 }

On the read side:

521 static inline struct maple_node *mte_parent(const struct maple_enode *enode)
522 {
523 return (void *)((unsigned long)
→ 524 (mte_to_node(enode)->parent) & ~MAPLE_NODE_MASK);
525 }

and:

549 static inline bool mte_dead_node(const struct maple_enode *enode)
550 {
551 struct maple_node *parent, *node;
552
553 node = mte_to_node(enode);
554 /* Do not reorder reads from the node prior to the parent check */
555 smp_rmb();
→ 556 parent = mte_parent(enode);
557 return (parent == node);
558 }

I see both smp_wmb() and smp_rmb().

I don't know if this is a KCSAN false positive because I haven't seen what my GCC assembly looks like.

Maybe my AMD Ryzen 9 7950X with 16 cores / 32 threads aggravated things more than it is expected,
and it is not doing what it is supposed to with the memory barriers?

Looks like GCC did some "smart optimisation" of the write in line 304 or read in line 556?

I am only trying to imagine what the maple tree is going through attacked from 16 cores, 32 threads
and couple of hundred threads ...

Please find attached the decoded but not deduplicated dmesg logs. The config is attached, but it is
unmodified from the previous build.

>
> ...
>
> Thanks,
> Liam

Not at all. I am really puzzled by this and as I said it is a great exercise for my little grey cells.

I've caught a glimpse of Mr. McKenney's article: http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2010.07.23a.pdf
and since 2010 things probably got much worse.

I would try this:

Write side:

302 static inline void mte_set_node_dead(struct maple_enode *mn)
303 {
→ 304 WRITE_ONCE(mte_to_node(mn)->parent, ma_parent_ptr(mte_to_node(mn)));
305 smp_wmb(); /* Needed for RCU */
306 }

On the read side:

521 static inline struct maple_node *mte_parent(const struct maple_enode *enode)
522 {
523 return (void *)((unsigned long)
→ 524 READ_ONCE(mte_to_node(enode)->parent) & ~MAPLE_NODE_MASK);
525 }

and:

549 static inline bool mte_dead_node(const struct maple_enode *enode)
550 {
551 struct maple_node *parent, *node;
552
553 node = mte_to_node(enode);
554 /* Do not reorder reads from the node prior to the parent check */
555 smp_rmb();
→ 556 parent = mte_parent(enode);
557 return (parent == node);
558 }

... but I could as well lobotomise my Linux with experiments like that.

Best regards,
Mirsad Todorovac


Attachments:
mtree-kcsan-20230930-decoded.log.xz (2.35 kB)
mas-kcsan-20230930-02-decoded.log.xz (0.98 kB)
config-6.6.0-rc3-kcsan-00146-g9f3ebbef746f-dirty.xz (57.10 kB)
Download all attachments

2023-10-01 21:09:19

by Mirsad Todorovac

[permalink] [raw]
Subject: Re: BUG: maple_tree: KCSAN: data-race in mas_topiary_replace / mtree_range_walk [EXPERIMENTAL PATCH]

On 9/28/23 21:59, Liam R. Howlett wrote:
> * Mirsad Todorovac <[email protected]> [230923 03:27]:
>> On 9/22/23 15:51, Liam R. Howlett wrote:
>
> ...
>
>>>> [ 6413.367326] ==================================================================
>>>> [ 6413.367349] BUG: KCSAN: data-race in mas_topiary_replace / mtree_range_walk
>>>>
>>>> [ 6413.367375] write to 0xffff8883a0c5db00 of 8 bytes by task 5475 on cpu 24:
>>>> [ 6413.367386] mas_topiary_replace (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:491 /home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:1701 /home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:2590)
>>>> [ 6413.367399] mas_spanning_rebalance.isra.0 (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:2664 /home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:2961)
>>>> [ 6413.367413] mas_wr_spanning_store.isra.0 (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:3894)
>>>> [ 6413.367428] mas_wr_store_entry.isra.0 (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:4242)
>>>> [ 6413.367442] mas_store_prealloc (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:256 /home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:5408)
>>>> [ 6413.367455] vma_merge (/home/marvin/linux/kernel/torvalds2/mm/internal.h:1127 /home/marvin/linux/kernel/torvalds2/mm/mmap.c:1005)
>>>> [ 6413.367466] mprotect_fixup (/home/marvin/linux/kernel/torvalds2/mm/mprotect.c:632)
>>>> [ 6413.367480] do_mprotect_pkey (/home/marvin/linux/kernel/torvalds2/mm/mprotect.c:819)
>>>> [ 6413.367494] __x64_sys_mprotect (/home/marvin/linux/kernel/torvalds2/mm/mprotect.c:837)
>>>> [ 6413.367503] do_syscall_64 (/home/marvin/linux/kernel/torvalds2/arch/x86/entry/common.c:50 /home/marvin/linux/kernel/torvalds2/arch/x86/entry/common.c:80)
>>>> [ 6413.367517] entry_SYSCALL_64_after_hwframe (/home/marvin/linux/kernel/torvalds2/arch/x86/entry/entry_64.S:120)
>>>>
>>>> [ 6413.367534] read to 0xffff8883a0c5db00 of 8 bytes by task 5558 on cpu 11:
>>>> [ 6413.367545] mtree_range_walk (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:539 /home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:2789)
>>>> [ 6413.367556] mas_walk (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:251 /home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:4844)
>>>> [ 6413.367567] lock_vma_under_rcu (/home/marvin/linux/kernel/torvalds2/mm/memory.c:5436)
>>>> [ 6413.367579] do_user_addr_fault (/home/marvin/linux/kernel/torvalds2/arch/x86/mm/fault.c:1357)
>>>> [ 6413.367593] exc_page_fault (/home/marvin/linux/kernel/torvalds2/./arch/x86/include/asm/paravirt.h:695 /home/marvin/linux/kernel/torvalds2/arch/x86/mm/fault.c:1513 /home/marvin/linux/kernel/torvalds2/arch/x86/mm/fault.c:1561)
>>>> [ 6413.367602] asm_exc_page_fault (/home/marvin/linux/kernel/torvalds2/./arch/x86/include/asm/idtentry.h:570)
>>>>
>>>> [ 6413.367617] value changed: 0xffff888341d43116 -> 0xffff888340f92116
>>>>
>>>> [ 6413.367632] Reported by Kernel Concurrency Sanitizer on:
>>>> [ 6413.367640] CPU: 11 PID: 5558 Comm: ThreadPoolForeg Tainted: G L 6.6.0-rc2-kcsan-00143-gb5cbe7c00aa0 #41
>>>> [ 6413.367653] Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, BIOS 1.21 04/26/2023
>>>> [ 6413.367660] ==================================================================
>>>>
>>>> For your convenience, took the trouble of finding those suspicious lines of code:
>>>>
>>>> The write side:
>>>>
>>>> lib/maple_tree.c:491
>>>> --------------------
>>>> 456 /*
>>>> 457 * mas_set_parent() - Set the parent node and encode the slot
>>>> 458 * @enode: The encoded maple node.
>>>> 459 * @parent: The encoded maple node that is the parent of @enode.
>>>> 460 * @slot: The slot that @enode resides in @parent.
>>>> 461 *
>>>> 462 * Slot number is encoded in the enode->parent bit 3-6 or 2-6, depending on the
>>>> 463 * parent type.
>>>> 464 */
>>>> 465 static inline
>>>> 466 void mas_set_parent(struct ma_state *mas, struct maple_enode *enode,
>>>> 467 const struct maple_enode *parent, unsigned char slot)
>>>> 468 {
>>>> 469 unsigned long val = (unsigned long)parent;
>>>> 470 unsigned long shift;
>>>> 471 unsigned long type;
>>>> 472 enum maple_type p_type = mte_node_type(parent);
>>>> 473
>>>> 474 MAS_BUG_ON(mas, p_type == maple_dense);
>>>> 475 MAS_BUG_ON(mas, p_type == maple_leaf_64);
>>>> 476
>>>> 477 switch (p_type) {
>>>> 478 case maple_range_64:
>>>> 479 case maple_arange_64:
>>>> 480 shift = MAPLE_PARENT_SLOT_SHIFT;
>>>> 481 type = MAPLE_PARENT_RANGE64;
>>>> 482 break;
>>>> 483 default:
>>>> 484 case maple_dense:
>>>> 485 case maple_leaf_64:
>>>> 486 shift = type = 0;
>>>> 487 break;
>>>> 488 }
>>>> 489
>>>> 490 val &= ~MAPLE_NODE_MASK; /* Clear all node metadata in parent */
>>>> → 491 val |= (slot << shift) | type;
>>>> 492 mte_to_node(enode)->parent = ma_parent_ptr(val);
>>>> 493 }
>
> This should probably be 492, not 491. I know what is racing here and it
> shouldn't cause issue.
>
> ...
>>>> The read side:
>>>>
>>>> 527 /*
>>>> 528 * ma_dead_node() - check if the @enode is dead.
>>>> 529 * @enode: The encoded maple node
>>>> 530 *
>>>> 531 * Return: true if dead, false otherwise.
>>>> 532 */
>>>> 533 static inline bool ma_dead_node(const struct maple_node *node)
>>>> 534 {
>>>> 535 struct maple_node *parent;
>>>> 536
>>>> 537 /* Do not reorder reads from the node prior to the parent check */
>>>> 538 smp_rmb();
>>>> → 539 parent = (void *)((unsigned long) node->parent & ~MAPLE_NODE_MASK);
>>>> 540 return (parent == node);
>>>> 541 }
>
> This is the correct line.
>
> ...
>>>>
>>>> as above, but the smb_rmb() protection is here, so the KCSAN warning should be double-checked for
>>>> validity.
>>>>
>>>> But I do not really understand maple trees to their depth, I am only reporting the symptomatic
>>>> outlook of the reported data-race.
>>>
>>> This is the most complex part of the maple tree, and unique to the
>>> ability to store a range over multiple existing entries. I recently
>>> rewrote this particular section to avoid a potential live-lock issue.
>>
>> I see.
>>
>>> I am confident that the parent pointer will not be set to the node
>>> pointer as checked by ma_dead_node(). But in an abundance of caution
>>> and to resolve this potential false-positive, I am looking at this in
>>> more detail.
>>>
>>> This race is to see if the walk down the tree into unchanged data will
>>> be seen before it is placed in the new subtree with its data also
>>> unchanged. Since we know the parent can never be the node, I feel this
>>> is safe - but I'm not willing to live with the complaints from kasan.
>>
>> I cannot analyse a couple of thousand lines at such a short notice.
>
> Don't worry, I will :)
>
>>
>> It looks suspicious because val in line 491 in a local variable and thus
>> isn't available elsewhere.
>
> It is used in the node->parent, as described above. It is a race, but
> it doesn't matter who wins.
>
>>
>> Maybe the compiler (gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0) did something
>> to the code?
>
> Probably off-by-one line.
>
>>
>>>> This is all-in-all a very interested subject, and I wish there was a way to just slurp all those
>>>> interesting kernel intrinsics into the brain, but it just ain't that easy. Forgive me for being
>>>> open ...
>>>
>>> I appreciate that and your detailed analysis with code pointers of where
>>> this happens. Is this easy to recreate? If so, how? Can you attach
>>> your kernel config?
>>
>> Got that attached first, so I do not forget. :-/
>>
>> Recreate? Actually it happened quite a number of times on my configuration (480+?).
>
> I'm having issues recreating it because I hit a lot of races before this
> one in my test setup. I will keep working on reproducing this race, but
> in the mean time can you test the attached patch with your setup?

They say that one patch speaks more than a thousand words.

I am just running this little patch that actually silences all of the KCSAN warnings.

I cannot tell if these reported data races are the actual bugs, but it is possible that
the Ubuntu 22.04 gcc is doing some funny stuff when optimising. In Prof. McKenney's
book I've read about the load-tearing and store-tearing. AFAICS, memory barriers should
prevent load/store reordering, but not the compiler optimisations.

Please find two versions of the patch attached.

While mas->index and pivots[offset] in maple_range_walk can change concurrently,
I am not smart enough to see whether you expect that in your algorithm or is it a potential
bug triggered by GCC optimisations and aggressive Ryzen 9 7950X parallelism.

This patch is being tested and I am right now writing from a kernel build using this diff v2.
Your patch included.

Best regards,
Mirsad Todorovac

marvin@defiant:~/linux/kernel/torvalds3$ uname -rms
Linux 6.6.0-rc3-kcsan-00230-g8f633369413f-dirty x86_64
marvin@defiant:~/linux/kernel/torvalds3$ git diff | cat -

--------------------------------------------------------------
diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index ee1ff0c59fd7..472ebc43d9fe 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -58,6 +58,7 @@
#include <linux/slab.h>
#include <linux/limits.h>
#include <asm/barrier.h>
+#include <asm/rwonce.h>

#define CREATE_TRACE_POINTS
#include <trace/events/maple_tree.h>
@@ -301,7 +302,7 @@ static inline struct maple_node *mas_mn(const struct ma_state *mas)
*/
static inline void mte_set_node_dead(struct maple_enode *mn)
{
- mte_to_node(mn)->parent = ma_parent_ptr(mte_to_node(mn));
+ WRITE_ONCE(mte_to_node(mn)->parent, ma_parent_ptr(mte_to_node(mn)));
smp_wmb(); /* Needed for RCU */
}

@@ -521,7 +522,7 @@ static inline unsigned int mte_parent_slot(const struct maple_enode *enode)
static inline struct maple_node *mte_parent(const struct maple_enode *enode)
{
return (void *)((unsigned long)
- (mte_to_node(enode)->parent) & ~MAPLE_NODE_MASK);
+ (READ_ONCE(mte_to_node(enode)->parent)) & ~MAPLE_NODE_MASK);
}

/*
@@ -536,7 +537,7 @@ static inline bool ma_dead_node(const struct maple_node *node)

/* Do not reorder reads from the node prior to the parent check */
smp_rmb();
- parent = (void *)((unsigned long) node->parent & ~MAPLE_NODE_MASK);
+ parent = (void *)((unsigned long) READ_ONCE(node->parent) & ~MAPLE_NODE_MASK);
return (parent == node);
}

@@ -1699,6 +1700,7 @@ static inline void mas_adopt_children(struct ma_state *mas,
do {
child = mas_slot_locked(mas, slots, offset);
mas_set_parent(mas, child, parent, offset);
+ smp_wmb(); /* Needed for RCU */
} while (offset--);
}

@@ -2775,6 +2777,7 @@ static inline void *mtree_range_walk(struct ma_state *mas)
unsigned char end;
unsigned long max, min;
unsigned long prev_max, prev_min;
+ unsigned long my_pivot, mas_index;

next = mas->node;
min = mas->min;
@@ -2789,22 +2792,27 @@ static inline void *mtree_range_walk(struct ma_state *mas)
if (unlikely(ma_dead_node(node)))
goto dead_node;

- if (pivots[offset] >= mas->index) {
+ my_pivot = READ_ONCE(pivots[offset]);
+ mas_index = READ_ONCE(mas->index);
+
+ if (my_pivot >= mas_index) {
prev_max = max;
prev_min = min;
- max = pivots[offset];
+ max = my_pivot;
goto next;
}

do {
offset++;
- } while ((offset < end) && (pivots[offset] < mas->index));
+ my_pivot = READ_ONCE(pivots[offset]);
+ } while ((offset < end) && (my_pivot < mas_index));

prev_min = min;
- min = pivots[offset - 1] + 1;
+ min = READ_ONCE(pivots[offset - 1]) + 1;
prev_max = max;
- if (likely(offset < end && pivots[offset]))
- max = pivots[offset];
+
+ if (likely(offset < end && my_pivot))
+ max = my_pivot;

next:
slots = ma_slots(node, type);
marvin@defiant:~/linux/kernel/torvalds3$


Attachments:
maple-cummulative-v2.diff (2.64 kB)
maple-cummulative.diff (2.63 kB)
Download all attachments

2023-10-02 21:03:15

by Liam R. Howlett

[permalink] [raw]
Subject: Re: BUG: maple_tree: KCSAN: data-race in mas_topiary_replace / mtree_range_walk [EXPERIMENTAL PATCH]

* Mirsad Todorovac <[email protected]> [231001 17:09]:

...

>
> They say that one patch speaks more than a thousand words.
>
> I am just running this little patch that actually silences all of the KCSAN warnings.
>
> I cannot tell if these reported data races are the actual bugs, but it is possible that
> the Ubuntu 22.04 gcc is doing some funny stuff when optimising. In Prof. McKenney's
> book I've read about the load-tearing and store-tearing. AFAICS, memory barriers should
> prevent load/store reordering, but not the compiler optimisations.
>
> Please find two versions of the patch attached.
>
> While mas->index and pivots[offset] in maple_range_walk can change concurrently,
> I am not smart enough to see whether you expect that in your algorithm or is it a potential
> bug triggered by GCC optimisations and aggressive Ryzen 9 7950X parallelism.

None of this is necessary, for sure.

I will have to look at this when I have more time to investigate. This
will likely not be soon, however.

Thanks,
Liam