Commit 6edda04ccc7c ("mm/kmemleak: prevent soft lockup in first
object iteration loop of kmemleak_scan()") fixes soft lockup problem
in kmemleak_scan() by periodically doing a cond_resched(). It does
take a reference of the current object before doing it. Unfortunately,
if the object has been deleted from the object_list, the next object
pointed to by its next pointer may no longer be valid after coming
back from cond_resched(). This can result in use-after-free and other
nasty problem.
Fix this problem by adding a del_state flag into kmemleak_object
structure to synchronize the object deletion process between
kmemleak_cond_resched() and __remove_object() to make sure that the
object remained in the object_list in the duration of the cond_resched()
call.
Fixes: 6edda04ccc7c ("mm/kmemleak: prevent soft lockup in first object iteration loop of kmemleak_scan()")
Signed-off-by: Waiman Long <[email protected]>
---
mm/kmemleak.c | 35 +++++++++++++++++++++++++++++------
1 file changed, 29 insertions(+), 6 deletions(-)
diff --git a/mm/kmemleak.c b/mm/kmemleak.c
index e7cb521236bf..0ece170fc9ef 100644
--- a/mm/kmemleak.c
+++ b/mm/kmemleak.c
@@ -13,11 +13,12 @@
*
* The following locks and mutexes are used by kmemleak:
*
- * - kmemleak_lock (raw_spinlock_t): protects the object_list modifications and
- * accesses to the object_tree_root (or object_phys_tree_root). The
- * object_list is the main list holding the metadata (struct kmemleak_object)
- * for the allocated memory blocks. The object_tree_root and object_phys_tree_root
- * are red black trees used to look-up metadata based on a pointer to the
+ * - kmemleak_lock (raw_spinlock_t): protects the object_list as well as
+ * del_state modifications and accesses to the object_tree_root (or
+ * object_phys_tree_root). The object_list is the main list holding the
+ * metadata (struct kmemleak_object) for the allocated memory blocks.
+ * The object_tree_root and object_phys_tree_root are red
+ * black trees used to look-up metadata based on a pointer to the
* corresponding memory block. The object_phys_tree_root is for objects
* allocated with physical address. The kmemleak_object structures are
* added to the object_list and object_tree_root (or object_phys_tree_root)
@@ -147,6 +148,7 @@ struct kmemleak_object {
struct rcu_head rcu; /* object_list lockless traversal */
/* object usage count; object freed when use_count == 0 */
atomic_t use_count;
+ unsigned int del_state; /* deletion state */
unsigned long pointer;
size_t size;
/* pass surplus references to this pointer */
@@ -177,6 +179,11 @@ struct kmemleak_object {
/* flag set for object allocated with physical address */
#define OBJECT_PHYS (1 << 4)
+/* set when __remove_object() called */
+#define DELSTATE_REMOVED (1 << 0)
+/* set to temporarily prevent deletion from object_list */
+#define DELSTATE_NO_DELETE (1 << 1)
+
#define HEX_PREFIX " "
/* number of bytes to print per line; must be 16 or 32 */
#define HEX_ROW_SIZE 16
@@ -567,7 +574,9 @@ static void __remove_object(struct kmemleak_object *object)
rb_erase(&object->rb_node, object->flags & OBJECT_PHYS ?
&object_phys_tree_root :
&object_tree_root);
- list_del_rcu(&object->object_list);
+ if (!(object->del_state & DELSTATE_NO_DELETE))
+ list_del_rcu(&object->object_list);
+ object->del_state |= DELSTATE_REMOVED;
}
/*
@@ -633,6 +642,7 @@ static void __create_object(unsigned long ptr, size_t size,
object->count = 0; /* white color initially */
object->jiffies = jiffies;
object->checksum = 0;
+ object->del_state = 0;
/* task information */
if (in_hardirq()) {
@@ -1470,9 +1480,22 @@ static void kmemleak_cond_resched(struct kmemleak_object *object)
if (!get_object(object))
return; /* Try next object */
+ raw_spin_lock_irq(&kmemleak_lock);
+ if (object->del_state & DELSTATE_REMOVED)
+ goto unlock_put; /* Object removed */
+ object->del_state |= DELSTATE_NO_DELETE;
+ raw_spin_unlock_irq(&kmemleak_lock);
+
rcu_read_unlock();
cond_resched();
rcu_read_lock();
+
+ raw_spin_lock_irq(&kmemleak_lock);
+ if (object->del_state & DELSTATE_REMOVED)
+ list_del_rcu(&object->object_list);
+ object->del_state &= ~DELSTATE_NO_DELETE;
+unlock_put:
+ raw_spin_unlock_irq(&kmemleak_lock);
put_object(object);
}
--
2.31.1
Hi Waiman,
Thanks for your effort on trying to fix this.
On Wed, Jan 18, 2023 at 11:01:11PM -0500, Waiman Long wrote:
> @@ -567,7 +574,9 @@ static void __remove_object(struct kmemleak_object *object)
> rb_erase(&object->rb_node, object->flags & OBJECT_PHYS ?
> &object_phys_tree_root :
> &object_tree_root);
> - list_del_rcu(&object->object_list);
> + if (!(object->del_state & DELSTATE_NO_DELETE))
> + list_del_rcu(&object->object_list);
> + object->del_state |= DELSTATE_REMOVED;
> }
So IIUC, this prevents the current object being scanned from being
removed from the list during the kmemleak_cond_resched() call.
> /*
> @@ -633,6 +642,7 @@ static void __create_object(unsigned long ptr, size_t size,
> object->count = 0; /* white color initially */
> object->jiffies = jiffies;
> object->checksum = 0;
> + object->del_state = 0;
>
> /* task information */
> if (in_hardirq()) {
> @@ -1470,9 +1480,22 @@ static void kmemleak_cond_resched(struct kmemleak_object *object)
> if (!get_object(object))
> return; /* Try next object */
>
> + raw_spin_lock_irq(&kmemleak_lock);
> + if (object->del_state & DELSTATE_REMOVED)
> + goto unlock_put; /* Object removed */
> + object->del_state |= DELSTATE_NO_DELETE;
> + raw_spin_unlock_irq(&kmemleak_lock);
> +
> rcu_read_unlock();
> cond_resched();
> rcu_read_lock();
> +
> + raw_spin_lock_irq(&kmemleak_lock);
> + if (object->del_state & DELSTATE_REMOVED)
> + list_del_rcu(&object->object_list);
> + object->del_state &= ~DELSTATE_NO_DELETE;
> +unlock_put:
> + raw_spin_unlock_irq(&kmemleak_lock);
> put_object(object);
> }
I'm not sure this was the only problem. We do have the problem that the
current object may be removed from the list, solved above, but another
scenario I had in mind is the next object being released during this
brief resched period. The RCU relies on object->next->next being valid
but, with a brief rcu_read_unlock(), the object->next could be freed,
reallocated, so object->next->next invalid.
--
Catalin
On 1/20/23 14:18, Catalin Marinas wrote:
> Hi Waiman,
>
> Thanks for your effort on trying to fix this.
>
> On Wed, Jan 18, 2023 at 11:01:11PM -0500, Waiman Long wrote:
>> @@ -567,7 +574,9 @@ static void __remove_object(struct kmemleak_object *object)
>> rb_erase(&object->rb_node, object->flags & OBJECT_PHYS ?
>> &object_phys_tree_root :
>> &object_tree_root);
>> - list_del_rcu(&object->object_list);
>> + if (!(object->del_state & DELSTATE_NO_DELETE))
>> + list_del_rcu(&object->object_list);
>> + object->del_state |= DELSTATE_REMOVED;
>> }
> So IIUC, this prevents the current object being scanned from being
> removed from the list during the kmemleak_cond_resched() call.
Yes, that is the point.
>
>> /*
>> @@ -633,6 +642,7 @@ static void __create_object(unsigned long ptr, size_t size,
>> object->count = 0; /* white color initially */
>> object->jiffies = jiffies;
>> object->checksum = 0;
>> + object->del_state = 0;
>>
>> /* task information */
>> if (in_hardirq()) {
>> @@ -1470,9 +1480,22 @@ static void kmemleak_cond_resched(struct kmemleak_object *object)
>> if (!get_object(object))
>> return; /* Try next object */
>>
>> + raw_spin_lock_irq(&kmemleak_lock);
>> + if (object->del_state & DELSTATE_REMOVED)
>> + goto unlock_put; /* Object removed */
>> + object->del_state |= DELSTATE_NO_DELETE;
>> + raw_spin_unlock_irq(&kmemleak_lock);
>> +
>> rcu_read_unlock();
>> cond_resched();
>> rcu_read_lock();
>> +
>> + raw_spin_lock_irq(&kmemleak_lock);
>> + if (object->del_state & DELSTATE_REMOVED)
>> + list_del_rcu(&object->object_list);
>> + object->del_state &= ~DELSTATE_NO_DELETE;
>> +unlock_put:
>> + raw_spin_unlock_irq(&kmemleak_lock);
>> put_object(object);
>> }
> I'm not sure this was the only problem. We do have the problem that the
> current object may be removed from the list, solved above, but another
> scenario I had in mind is the next object being released during this
> brief resched period. The RCU relies on object->next->next being valid
> but, with a brief rcu_read_unlock(), the object->next could be freed,
> reallocated, so object->next->next invalid.
Looking at the following scenario,
object->next => A (removed)
A->next => B (removed)
As object->next is pointing to A, A must still be allocated and not
freed yet. Now if B is also removed, there are 2 possible case.
1) B is removed from the list after the removal of A. In that case, it
is not possible that A is allocated, but B is freed.
2) B is removed before A. A->next can't pointed to B when it is being
removed. Due to weak memory ordering, it is possible that another cpu
can see A->next still pointing to B. In that case, I believe that it is
still within the grace period where neither A or B is freed.
In fact, it is no different from a regular scanning of the object list
without ever called cond_resched().
Cheers,
Longman
On Fri, Jan 20, 2023 at 05:54:28PM -0500, Waiman Long wrote:
> On 1/20/23 14:18, Catalin Marinas wrote:
> > > /*
> > > @@ -633,6 +642,7 @@ static void __create_object(unsigned long ptr, size_t size,
> > > object->count = 0; /* white color initially */
> > > object->jiffies = jiffies;
> > > object->checksum = 0;
> > > + object->del_state = 0;
> > > /* task information */
> > > if (in_hardirq()) {
> > > @@ -1470,9 +1480,22 @@ static void kmemleak_cond_resched(struct kmemleak_object *object)
> > > if (!get_object(object))
> > > return; /* Try next object */
> > > + raw_spin_lock_irq(&kmemleak_lock);
> > > + if (object->del_state & DELSTATE_REMOVED)
> > > + goto unlock_put; /* Object removed */
> > > + object->del_state |= DELSTATE_NO_DELETE;
> > > + raw_spin_unlock_irq(&kmemleak_lock);
> > > +
> > > rcu_read_unlock();
> > > cond_resched();
> > > rcu_read_lock();
> > > +
> > > + raw_spin_lock_irq(&kmemleak_lock);
> > > + if (object->del_state & DELSTATE_REMOVED)
> > > + list_del_rcu(&object->object_list);
> > > + object->del_state &= ~DELSTATE_NO_DELETE;
> > > +unlock_put:
> > > + raw_spin_unlock_irq(&kmemleak_lock);
> > > put_object(object);
> > > }
> > I'm not sure this was the only problem. We do have the problem that the
> > current object may be removed from the list, solved above, but another
> > scenario I had in mind is the next object being released during this
> > brief resched period. The RCU relies on object->next->next being valid
> > but, with a brief rcu_read_unlock(), the object->next could be freed,
> > reallocated, so object->next->next invalid.
>
> Looking at the following scenario,
>
> object->next => A (removed)
> A->next => B (removed)
>
> As object->next is pointing to A, A must still be allocated and not freed
> yet. Now if B is also removed, there are 2 possible case.
>
> 1) B is removed from the list after the removal of A. In that case, it is
> not possible that A is allocated, but B is freed.
>
> 2) B is removed before A. A->next can't pointed to B when it is being
> removed. Due to weak memory ordering, it is possible that another cpu can
> see A->next still pointing to B. In that case, I believe that it is still
> within the grace period where neither A or B is freed.
>
> In fact, it is no different from a regular scanning of the object list
> without ever called cond_resched().
More like thinking out loud:
The lockless RCU loop relies on object->next->next being valid within
the grace period (A not freed). Due to weak memory ordering, the looping
CPU may not observe the object->next update (removal of A) by another
CPU, so it continues to loop over it. But since we do an
rcu_read_unlock() in the middle of the loop, I don't think these
assumptions are still valid, so A may be freed.
What we need is that object->next reading for the following iteration
either sees the updated object->next (B) or it sees A but the latter
still around. I think this holds with the proposed
kmemleak_cond_resched() since we now start a new grace period with
rcu_read_lock() followed by taking and releasing kmemleak_lock. The
latter would give us the memory ordering required since removing object
A from the list does take the lock.
So yeah, you are probably right, I just find it hard to get my head
around ;). I still think it would be simpler with a single kmemleak_lock
(no object->lock) but that's more involved than a simple fix.
Assuming your (and my) reasoning above is correct:
Reviewed-by: Catalin Marinas <[email protected]>
On 1/23/23 14:24, Catalin Marinas wrote:
> On Fri, Jan 20, 2023 at 05:54:28PM -0500, Waiman Long wrote:
>> On 1/20/23 14:18, Catalin Marinas wrote:
>>>> /*
>>>> @@ -633,6 +642,7 @@ static void __create_object(unsigned long ptr, size_t size,
>>>> object->count = 0; /* white color initially */
>>>> object->jiffies = jiffies;
>>>> object->checksum = 0;
>>>> + object->del_state = 0;
>>>> /* task information */
>>>> if (in_hardirq()) {
>>>> @@ -1470,9 +1480,22 @@ static void kmemleak_cond_resched(struct kmemleak_object *object)
>>>> if (!get_object(object))
>>>> return; /* Try next object */
>>>> + raw_spin_lock_irq(&kmemleak_lock);
>>>> + if (object->del_state & DELSTATE_REMOVED)
>>>> + goto unlock_put; /* Object removed */
>>>> + object->del_state |= DELSTATE_NO_DELETE;
>>>> + raw_spin_unlock_irq(&kmemleak_lock);
>>>> +
>>>> rcu_read_unlock();
>>>> cond_resched();
>>>> rcu_read_lock();
>>>> +
>>>> + raw_spin_lock_irq(&kmemleak_lock);
>>>> + if (object->del_state & DELSTATE_REMOVED)
>>>> + list_del_rcu(&object->object_list);
>>>> + object->del_state &= ~DELSTATE_NO_DELETE;
>>>> +unlock_put:
>>>> + raw_spin_unlock_irq(&kmemleak_lock);
>>>> put_object(object);
>>>> }
>>> I'm not sure this was the only problem. We do have the problem that the
>>> current object may be removed from the list, solved above, but another
>>> scenario I had in mind is the next object being released during this
>>> brief resched period. The RCU relies on object->next->next being valid
>>> but, with a brief rcu_read_unlock(), the object->next could be freed,
>>> reallocated, so object->next->next invalid.
>> Looking at the following scenario,
>>
>> object->next => A (removed)
>> A->next => B (removed)
>>
>> As object->next is pointing to A, A must still be allocated and not freed
>> yet. Now if B is also removed, there are 2 possible case.
>>
>> 1) B is removed from the list after the removal of A. In that case, it is
>> not possible that A is allocated, but B is freed.
>>
>> 2) B is removed before A. A->next can't pointed to B when it is being
>> removed. Due to weak memory ordering, it is possible that another cpu can
>> see A->next still pointing to B. In that case, I believe that it is still
>> within the grace period where neither A or B is freed.
>>
>> In fact, it is no different from a regular scanning of the object list
>> without ever called cond_resched().
> More like thinking out loud:
>
> The lockless RCU loop relies on object->next->next being valid within
> the grace period (A not freed). Due to weak memory ordering, the looping
> CPU may not observe the object->next update (removal of A) by another
> CPU, so it continues to loop over it. But since we do an
> rcu_read_unlock() in the middle of the loop, I don't think these
> assumptions are still valid, so A may be freed.
>
> What we need is that object->next reading for the following iteration
> either sees the updated object->next (B) or it sees A but the latter
> still around. I think this holds with the proposed
> kmemleak_cond_resched() since we now start a new grace period with
> rcu_read_lock() followed by taking and releasing kmemleak_lock. The
> latter would give us the memory ordering required since removing object
> A from the list does take the lock.
>
> So yeah, you are probably right, I just find it hard to get my head
> around ;). I still think it would be simpler with a single kmemleak_lock
> (no object->lock) but that's more involved than a simple fix.
>
> Assuming your (and my) reasoning above is correct:
>
> Reviewed-by: Catalin Marinas <[email protected]>
I should have mentioned the fact that taking the kmemleak_lock will post
some ordering guarantee since it is done after a new rcu_read_lock(). So
yes, even if both A and B are removed from the object_list, they should
still be around and not freed yet.
Thanks for your review.
Cheers,
Longman