__list_lru_walk_one() can hold the spin lock for longer duration
if there are more number of entries to be isolated.
This results in "BUG: spinlock lockup suspected" in the below path -
[<ffffff8eca0fb0bc>] spin_bug+0x90
[<ffffff8eca0fb220>] do_raw_spin_lock+0xfc
[<ffffff8ecafb7798>] _raw_spin_lock+0x28
[<ffffff8eca1ae884>] list_lru_add+0x28
[<ffffff8eca1f5dac>] dput+0x1c8
[<ffffff8eca1eb46c>] path_put+0x20
[<ffffff8eca1eb73c>] terminate_walk+0x3c
[<ffffff8eca1eee58>] path_lookupat+0x100
[<ffffff8eca1f00fc>] filename_lookup+0x6c
[<ffffff8eca1f0264>] user_path_at_empty+0x54
[<ffffff8eca1e066c>] SyS_faccessat+0xd0
[<ffffff8eca084e30>] el0_svc_naked+0x24
This nlru->lock has been acquired by another CPU in this path -
[<ffffff8eca1f5fd0>] d_lru_shrink_move+0x34
[<ffffff8eca1f6180>] dentry_lru_isolate_shrink+0x48
[<ffffff8eca1aeafc>] __list_lru_walk_one.isra.10+0x94
[<ffffff8eca1aec34>] list_lru_walk_node+0x40
[<ffffff8eca1f6620>] shrink_dcache_sb+0x60
[<ffffff8eca1e56a8>] do_remount_sb+0xbc
[<ffffff8eca1e583c>] do_emergency_remount+0xb0
[<ffffff8eca0ba510>] process_one_work+0x228
[<ffffff8eca0bb158>] worker_thread+0x2e0
[<ffffff8eca0c040c>] kthread+0xf4
[<ffffff8eca084dd0>] ret_from_fork+0x10
Link: http://marc.info/?t=149511514800002&r=1&w=2
Fix-suggested-by: Jan kara <[email protected]>
Signed-off-by: Sahitya Tummala <[email protected]>
---
mm/list_lru.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/mm/list_lru.c b/mm/list_lru.c
index 5d8dffd..1af0709 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -249,6 +249,8 @@ restart:
default:
BUG();
}
+ if (cond_resched_lock(&nlru->lock))
+ goto restart;
}
spin_unlock(&nlru->lock);
--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.
On Mon 12-06-17 06:17:20, Sahitya Tummala wrote:
> __list_lru_walk_one() can hold the spin lock for longer duration
> if there are more number of entries to be isolated.
>
> This results in "BUG: spinlock lockup suspected" in the below path -
>
> [<ffffff8eca0fb0bc>] spin_bug+0x90
> [<ffffff8eca0fb220>] do_raw_spin_lock+0xfc
> [<ffffff8ecafb7798>] _raw_spin_lock+0x28
> [<ffffff8eca1ae884>] list_lru_add+0x28
> [<ffffff8eca1f5dac>] dput+0x1c8
> [<ffffff8eca1eb46c>] path_put+0x20
> [<ffffff8eca1eb73c>] terminate_walk+0x3c
> [<ffffff8eca1eee58>] path_lookupat+0x100
> [<ffffff8eca1f00fc>] filename_lookup+0x6c
> [<ffffff8eca1f0264>] user_path_at_empty+0x54
> [<ffffff8eca1e066c>] SyS_faccessat+0xd0
> [<ffffff8eca084e30>] el0_svc_naked+0x24
>
> This nlru->lock has been acquired by another CPU in this path -
>
> [<ffffff8eca1f5fd0>] d_lru_shrink_move+0x34
> [<ffffff8eca1f6180>] dentry_lru_isolate_shrink+0x48
> [<ffffff8eca1aeafc>] __list_lru_walk_one.isra.10+0x94
> [<ffffff8eca1aec34>] list_lru_walk_node+0x40
> [<ffffff8eca1f6620>] shrink_dcache_sb+0x60
> [<ffffff8eca1e56a8>] do_remount_sb+0xbc
> [<ffffff8eca1e583c>] do_emergency_remount+0xb0
> [<ffffff8eca0ba510>] process_one_work+0x228
> [<ffffff8eca0bb158>] worker_thread+0x2e0
> [<ffffff8eca0c040c>] kthread+0xf4
> [<ffffff8eca084dd0>] ret_from_fork+0x10
>
> Link: http://marc.info/?t=149511514800002&r=1&w=2
> Fix-suggested-by: Jan kara <[email protected]>
> Signed-off-by: Sahitya Tummala <[email protected]>
Looks good to me. You can add:
Reviewed-by: Jan Kara <[email protected]>
Honza
> ---
> mm/list_lru.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/mm/list_lru.c b/mm/list_lru.c
> index 5d8dffd..1af0709 100644
> --- a/mm/list_lru.c
> +++ b/mm/list_lru.c
> @@ -249,6 +249,8 @@ restart:
> default:
> BUG();
> }
> + if (cond_resched_lock(&nlru->lock))
> + goto restart;
> }
>
> spin_unlock(&nlru->lock);
> --
> Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Mon, 12 Jun 2017 06:17:20 +0530 Sahitya Tummala <[email protected]> wrote:
> __list_lru_walk_one() can hold the spin lock for longer duration
> if there are more number of entries to be isolated.
>
> This results in "BUG: spinlock lockup suspected" in the below path -
>
> [<ffffff8eca0fb0bc>] spin_bug+0x90
> [<ffffff8eca0fb220>] do_raw_spin_lock+0xfc
> [<ffffff8ecafb7798>] _raw_spin_lock+0x28
> [<ffffff8eca1ae884>] list_lru_add+0x28
> [<ffffff8eca1f5dac>] dput+0x1c8
> [<ffffff8eca1eb46c>] path_put+0x20
> [<ffffff8eca1eb73c>] terminate_walk+0x3c
> [<ffffff8eca1eee58>] path_lookupat+0x100
> [<ffffff8eca1f00fc>] filename_lookup+0x6c
> [<ffffff8eca1f0264>] user_path_at_empty+0x54
> [<ffffff8eca1e066c>] SyS_faccessat+0xd0
> [<ffffff8eca084e30>] el0_svc_naked+0x24
>
> This nlru->lock has been acquired by another CPU in this path -
>
> [<ffffff8eca1f5fd0>] d_lru_shrink_move+0x34
> [<ffffff8eca1f6180>] dentry_lru_isolate_shrink+0x48
> [<ffffff8eca1aeafc>] __list_lru_walk_one.isra.10+0x94
> [<ffffff8eca1aec34>] list_lru_walk_node+0x40
> [<ffffff8eca1f6620>] shrink_dcache_sb+0x60
> [<ffffff8eca1e56a8>] do_remount_sb+0xbc
> [<ffffff8eca1e583c>] do_emergency_remount+0xb0
> [<ffffff8eca0ba510>] process_one_work+0x228
> [<ffffff8eca0bb158>] worker_thread+0x2e0
> [<ffffff8eca0c040c>] kthread+0xf4
> [<ffffff8eca084dd0>] ret_from_fork+0x10
>
> Link: http://marc.info/?t=149511514800002&r=1&w=2
> Fix-suggested-by: Jan kara <[email protected]>
> Signed-off-by: Sahitya Tummala <[email protected]>
> ---
> mm/list_lru.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/mm/list_lru.c b/mm/list_lru.c
> index 5d8dffd..1af0709 100644
> --- a/mm/list_lru.c
> +++ b/mm/list_lru.c
> @@ -249,6 +249,8 @@ restart:
> default:
> BUG();
> }
> + if (cond_resched_lock(&nlru->lock))
> + goto restart;
> }
>
> spin_unlock(&nlru->lock);
This is rather worrying.
a) Why are we spending so long holding that lock that this is occurring?
b) With this patch, we're restarting the entire scan. Are there
situations in which this loop will never terminate, or will take a
very long time? Suppose that this process is getting rescheds
blasted at it for some reason?
IOW this looks like a bit of a band-aid and a deeper analysis and
understanding might be needed.
On 6/16/2017 2:35 AM, Andrew Morton wrote:
> diff --git a/mm/list_lru.c b/mm/list_lru.c
>> index 5d8dffd..1af0709 100644
>> --- a/mm/list_lru.c
>> +++ b/mm/list_lru.c
>> @@ -249,6 +249,8 @@ restart:
>> default:
>> BUG();
>> }
>> + if (cond_resched_lock(&nlru->lock))
>> + goto restart;
>> }
>>
>> spin_unlock(&nlru->lock);
> This is rather worrying.
>
> a) Why are we spending so long holding that lock that this is occurring?
At the time of crash I see that __list_lru_walk_one() shows number of
entries isolated as 1774475 with nr_items still pending as 130748. On my
system, I see that for dentries of 100000, it takes around 75ms for
__list_lru_walk_one() to complete. So for a total of 1900000 dentries as
in issue scenario, it will take upto 1425ms, which explains why the spin
lockup condition got hit on the other CPU.
It looks like __list_lru_walk_one() is expected to take more time if
there are more number of dentries present. And I think it is a valid
scenario to have those many number dentries.
> b) With this patch, we're restarting the entire scan. Are there
> situations in which this loop will never terminate, or will take a
> very long time? Suppose that this process is getting rescheds
> blasted at it for some reason?
In the above scenario, I observed that the dentry entries from lru list
are removedall the time i.e LRU_REMOVED is returned from the
isolate (dentry_lru_isolate()) callback. I don't know if there is any case
where we skip several entries in the lru list and restartseveral times due
to this cond_resched_lock(). This can happen even with theexisting code
if LRU_RETRY is returned often from the isolate callback.
> IOW this looks like a bit of a band-aid and a deeper analysis and
> understanding might be needed.
--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.
Hello,
On Thu, Jun 15, 2017 at 02:05:23PM -0700, Andrew Morton wrote:
> On Mon, 12 Jun 2017 06:17:20 +0530 Sahitya Tummala <[email protected]> wrote:
>
> > __list_lru_walk_one() can hold the spin lock for longer duration
> > if there are more number of entries to be isolated.
> >
> > This results in "BUG: spinlock lockup suspected" in the below path -
> >
> > [<ffffff8eca0fb0bc>] spin_bug+0x90
> > [<ffffff8eca0fb220>] do_raw_spin_lock+0xfc
> > [<ffffff8ecafb7798>] _raw_spin_lock+0x28
> > [<ffffff8eca1ae884>] list_lru_add+0x28
> > [<ffffff8eca1f5dac>] dput+0x1c8
> > [<ffffff8eca1eb46c>] path_put+0x20
> > [<ffffff8eca1eb73c>] terminate_walk+0x3c
> > [<ffffff8eca1eee58>] path_lookupat+0x100
> > [<ffffff8eca1f00fc>] filename_lookup+0x6c
> > [<ffffff8eca1f0264>] user_path_at_empty+0x54
> > [<ffffff8eca1e066c>] SyS_faccessat+0xd0
> > [<ffffff8eca084e30>] el0_svc_naked+0x24
> >
> > This nlru->lock has been acquired by another CPU in this path -
> >
> > [<ffffff8eca1f5fd0>] d_lru_shrink_move+0x34
> > [<ffffff8eca1f6180>] dentry_lru_isolate_shrink+0x48
> > [<ffffff8eca1aeafc>] __list_lru_walk_one.isra.10+0x94
> > [<ffffff8eca1aec34>] list_lru_walk_node+0x40
> > [<ffffff8eca1f6620>] shrink_dcache_sb+0x60
> > [<ffffff8eca1e56a8>] do_remount_sb+0xbc
> > [<ffffff8eca1e583c>] do_emergency_remount+0xb0
> > [<ffffff8eca0ba510>] process_one_work+0x228
> > [<ffffff8eca0bb158>] worker_thread+0x2e0
> > [<ffffff8eca0c040c>] kthread+0xf4
> > [<ffffff8eca084dd0>] ret_from_fork+0x10
> >
> > Link: http://marc.info/?t=149511514800002&r=1&w=2
> > Fix-suggested-by: Jan kara <[email protected]>
> > Signed-off-by: Sahitya Tummala <[email protected]>
> > ---
> > mm/list_lru.c | 2 ++
> > 1 file changed, 2 insertions(+)
> >
> > diff --git a/mm/list_lru.c b/mm/list_lru.c
> > index 5d8dffd..1af0709 100644
> > --- a/mm/list_lru.c
> > +++ b/mm/list_lru.c
> > @@ -249,6 +249,8 @@ restart:
> > default:
> > BUG();
> > }
> > + if (cond_resched_lock(&nlru->lock))
> > + goto restart;
> > }
> >
> > spin_unlock(&nlru->lock);
>
> This is rather worrying.
>
> a) Why are we spending so long holding that lock that this is occurring?
>
> b) With this patch, we're restarting the entire scan. Are there
> situations in which this loop will never terminate, or will take a
> very long time? Suppose that this process is getting rescheds
> blasted at it for some reason?
>
> IOW this looks like a bit of a band-aid and a deeper analysis and
> understanding might be needed.
The goal of list_lru_walk is removing inactive entries from the lru list
(LRU_REMOVED). Memory shrinkers may also choose to move active entries
to the tail of the lru list (LRU_ROTATED). LRU_SKIP is supposed to be
returned only to avoid a possible deadlock. So I don't see how
restarting lru walk could have adverse effects.
However, I do find this patch kinda ugly, because:
- list_lru_walk already gives you a way to avoid a lockup - just make
the callback reschedule and return LRU_RETRY every now and then, see
shadow_lru_isolate() for an example. Alternatively, you can limit the
number of entries scanned in one go (nr_to_walk) and reschedule
between calls. This is what shrink_slab() does: the number of
dentries scanned without releasing the lock is limited to 1024, see
how super_block::s_shrink is initialized.
- Someone might want to call list_lru_walk with a spin lock held, and I
don't see anything wrong in doing that. With your patch it can't be
done anymore.
That said, I think it would be better to patch shrink_dcache_sb() or
dentry_lru_isolate_shrink() instead of list_lru_walk() in order to fix
this lockup.
Hello,
On 6/17/2017 4:44 PM, Vladimir Davydov wrote:
>
> That said, I think it would be better to patch shrink_dcache_sb() or
> dentry_lru_isolate_shrink() instead of list_lru_walk() in order to fix
> this lockup.
Thanks for the review. I will enhance the patch as per your suggestion.
--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.
__list_lru_walk_one() acquires nlru spin lock (nlru->lock) for
longer duration if there are more number of items in the lru list.
As per the current code, it can hold the spin lock for upto maximum
UINT_MAX entries at a time. So if there are more number of items in
the lru list, then "BUG: spinlock lockup suspected" is observed in
the below path -
[<ffffff8eca0fb0bc>] spin_bug+0x90
[<ffffff8eca0fb220>] do_raw_spin_lock+0xfc
[<ffffff8ecafb7798>] _raw_spin_lock+0x28
[<ffffff8eca1ae884>] list_lru_add+0x28
[<ffffff8eca1f5dac>] dput+0x1c8
[<ffffff8eca1eb46c>] path_put+0x20
[<ffffff8eca1eb73c>] terminate_walk+0x3c
[<ffffff8eca1eee58>] path_lookupat+0x100
[<ffffff8eca1f00fc>] filename_lookup+0x6c
[<ffffff8eca1f0264>] user_path_at_empty+0x54
[<ffffff8eca1e066c>] SyS_faccessat+0xd0
[<ffffff8eca084e30>] el0_svc_naked+0x24
This nlru->lock is acquired by another CPU in this path -
[<ffffff8eca1f5fd0>] d_lru_shrink_move+0x34
[<ffffff8eca1f6180>] dentry_lru_isolate_shrink+0x48
[<ffffff8eca1aeafc>] __list_lru_walk_one.isra.10+0x94
[<ffffff8eca1aec34>] list_lru_walk_node+0x40
[<ffffff8eca1f6620>] shrink_dcache_sb+0x60
[<ffffff8eca1e56a8>] do_remount_sb+0xbc
[<ffffff8eca1e583c>] do_emergency_remount+0xb0
[<ffffff8eca0ba510>] process_one_work+0x228
[<ffffff8eca0bb158>] worker_thread+0x2e0
[<ffffff8eca0c040c>] kthread+0xf4
[<ffffff8eca084dd0>] ret_from_fork+0x10
Fix this lockup by reducing the number of entries to be shrinked
from the lru list to 1024 at once. Also, add cond_resched() before
processing the lru list again.
Link: http://marc.info/?t=149722864900001&r=1&w=2
Fix-suggested-by: Jan kara <[email protected]>
Fix-suggested-by: Vladimir Davydov <[email protected]>
Signed-off-by: Sahitya Tummala <[email protected]>
---
v2: patch shrink_dcache_sb() instead of list_lru_walk()
---
fs/dcache.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index cddf397..c8ca150 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1133,10 +1133,11 @@ void shrink_dcache_sb(struct super_block *sb)
LIST_HEAD(dispose);
freed = list_lru_walk(&sb->s_dentry_lru,
- dentry_lru_isolate_shrink, &dispose, UINT_MAX);
+ dentry_lru_isolate_shrink, &dispose, 1024);
this_cpu_sub(nr_dentry_unused, freed);
shrink_dentry_list(&dispose);
+ cond_resched();
} while (freed > 0);
}
EXPORT_SYMBOL(shrink_dcache_sb);
--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.
On Wed, Jun 21, 2017 at 12:09:15PM +0530, Sahitya Tummala wrote:
> __list_lru_walk_one() acquires nlru spin lock (nlru->lock) for
> longer duration if there are more number of items in the lru list.
> As per the current code, it can hold the spin lock for upto maximum
> UINT_MAX entries at a time. So if there are more number of items in
> the lru list, then "BUG: spinlock lockup suspected" is observed in
> the below path -
>
> [<ffffff8eca0fb0bc>] spin_bug+0x90
> [<ffffff8eca0fb220>] do_raw_spin_lock+0xfc
> [<ffffff8ecafb7798>] _raw_spin_lock+0x28
> [<ffffff8eca1ae884>] list_lru_add+0x28
> [<ffffff8eca1f5dac>] dput+0x1c8
> [<ffffff8eca1eb46c>] path_put+0x20
> [<ffffff8eca1eb73c>] terminate_walk+0x3c
> [<ffffff8eca1eee58>] path_lookupat+0x100
> [<ffffff8eca1f00fc>] filename_lookup+0x6c
> [<ffffff8eca1f0264>] user_path_at_empty+0x54
> [<ffffff8eca1e066c>] SyS_faccessat+0xd0
> [<ffffff8eca084e30>] el0_svc_naked+0x24
>
> This nlru->lock is acquired by another CPU in this path -
>
> [<ffffff8eca1f5fd0>] d_lru_shrink_move+0x34
> [<ffffff8eca1f6180>] dentry_lru_isolate_shrink+0x48
> [<ffffff8eca1aeafc>] __list_lru_walk_one.isra.10+0x94
> [<ffffff8eca1aec34>] list_lru_walk_node+0x40
> [<ffffff8eca1f6620>] shrink_dcache_sb+0x60
> [<ffffff8eca1e56a8>] do_remount_sb+0xbc
> [<ffffff8eca1e583c>] do_emergency_remount+0xb0
> [<ffffff8eca0ba510>] process_one_work+0x228
> [<ffffff8eca0bb158>] worker_thread+0x2e0
> [<ffffff8eca0c040c>] kthread+0xf4
> [<ffffff8eca084dd0>] ret_from_fork+0x10
>
> Fix this lockup by reducing the number of entries to be shrinked
> from the lru list to 1024 at once. Also, add cond_resched() before
> processing the lru list again.
>
> Link: http://marc.info/?t=149722864900001&r=1&w=2
> Fix-suggested-by: Jan kara <[email protected]>
> Fix-suggested-by: Vladimir Davydov <[email protected]>
> Signed-off-by: Sahitya Tummala <[email protected]>
> ---
> v2: patch shrink_dcache_sb() instead of list_lru_walk()
> ---
> fs/dcache.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/fs/dcache.c b/fs/dcache.c
> index cddf397..c8ca150 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -1133,10 +1133,11 @@ void shrink_dcache_sb(struct super_block *sb)
> LIST_HEAD(dispose);
>
> freed = list_lru_walk(&sb->s_dentry_lru,
> - dentry_lru_isolate_shrink, &dispose, UINT_MAX);
> + dentry_lru_isolate_shrink, &dispose, 1024);
>
> this_cpu_sub(nr_dentry_unused, freed);
> shrink_dentry_list(&dispose);
> + cond_resched();
> } while (freed > 0);
In an extreme case, a single invocation of list_lru_walk() can skip all
1024 dentries, in which case 'freed' will be 0 forcing us to break the
loop prematurely. I think we should loop until there's at least one
dentry left on the LRU, i.e.
while (list_lru_count(&sb->s_dentry_lru) > 0)
However, even that wouldn't be quite correct, because list_lru_count()
iterates over all memory cgroups to sum list_lru_one->nr_items, which
can race with memcg offlining code migrating dentries off a dead cgroup
(see memcg_drain_all_list_lrus()). So it looks like to make this check
race-free, we need to account the number of entries on the LRU not only
per memcg, but also per node, i.e. add list_lru_node->nr_items.
Fortunately, list_lru entries can't be migrated between NUMA nodes.
> }
> EXPORT_SYMBOL(shrink_dcache_sb);
On 6/21/2017 10:01 PM, Vladimir Davydov wrote:
>
>> index cddf397..c8ca150 100644
>> --- a/fs/dcache.c
>> +++ b/fs/dcache.c
>> @@ -1133,10 +1133,11 @@ void shrink_dcache_sb(struct super_block *sb)
>> LIST_HEAD(dispose);
>>
>> freed = list_lru_walk(&sb->s_dentry_lru,
>> - dentry_lru_isolate_shrink, &dispose, UINT_MAX);
>> + dentry_lru_isolate_shrink, &dispose, 1024);
>>
>> this_cpu_sub(nr_dentry_unused, freed);
>> shrink_dentry_list(&dispose);
>> + cond_resched();
>> } while (freed > 0);
> In an extreme case, a single invocation of list_lru_walk() can skip all
> 1024 dentries, in which case 'freed' will be 0 forcing us to break the
> loop prematurely. I think we should loop until there's at least one
> dentry left on the LRU, i.e.
>
> while (list_lru_count(&sb->s_dentry_lru) > 0)
>
> However, even that wouldn't be quite correct, because list_lru_count()
> iterates over all memory cgroups to sum list_lru_one->nr_items, which
> can race with memcg offlining code migrating dentries off a dead cgroup
> (see memcg_drain_all_list_lrus()). So it looks like to make this check
> race-free, we need to account the number of entries on the LRU not only
> per memcg, but also per node, i.e. add list_lru_node->nr_items.
> Fortunately, list_lru entries can't be migrated between NUMA nodes.
It looks like list_lru_count() is iterating per node before iterating
over all memory
cgroups as below -
unsigned long list_lru_count_node(struct list_lru *lru, int nid)
{
long count = 0;
int memcg_idx;
count += __list_lru_count_one(lru, nid, -1);
if (list_lru_memcg_aware(lru)) {
for_each_memcg_cache_index(memcg_idx)
count += __list_lru_count_one(lru, nid, memcg_idx);
}
return count;
}
The first call to __list_lru_count_one() is iterating all the items per
node i.e, nlru->lru->nr_items.
Is my understanding correct? If not, could you please clarify on how to
get the lru items per node?
--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.
On Thu, Jun 22, 2017 at 10:01:39PM +0530, Sahitya Tummala wrote:
>
>
> On 6/21/2017 10:01 PM, Vladimir Davydov wrote:
> >
> >>index cddf397..c8ca150 100644
> >>--- a/fs/dcache.c
> >>+++ b/fs/dcache.c
> >>@@ -1133,10 +1133,11 @@ void shrink_dcache_sb(struct super_block *sb)
> >> LIST_HEAD(dispose);
> >> freed = list_lru_walk(&sb->s_dentry_lru,
> >>- dentry_lru_isolate_shrink, &dispose, UINT_MAX);
> >>+ dentry_lru_isolate_shrink, &dispose, 1024);
> >> this_cpu_sub(nr_dentry_unused, freed);
> >> shrink_dentry_list(&dispose);
> >>+ cond_resched();
> >> } while (freed > 0);
> >In an extreme case, a single invocation of list_lru_walk() can skip all
> >1024 dentries, in which case 'freed' will be 0 forcing us to break the
> >loop prematurely. I think we should loop until there's at least one
> >dentry left on the LRU, i.e.
> >
> > while (list_lru_count(&sb->s_dentry_lru) > 0)
> >
> >However, even that wouldn't be quite correct, because list_lru_count()
> >iterates over all memory cgroups to sum list_lru_one->nr_items, which
> >can race with memcg offlining code migrating dentries off a dead cgroup
> >(see memcg_drain_all_list_lrus()). So it looks like to make this check
> >race-free, we need to account the number of entries on the LRU not only
> >per memcg, but also per node, i.e. add list_lru_node->nr_items.
> >Fortunately, list_lru entries can't be migrated between NUMA nodes.
> It looks like list_lru_count() is iterating per node before iterating over
> all memory
> cgroups as below -
>
> unsigned long list_lru_count_node(struct list_lru *lru, int nid)
> {
> long count = 0;
> int memcg_idx;
>
> count += __list_lru_count_one(lru, nid, -1);
> if (list_lru_memcg_aware(lru)) {
> for_each_memcg_cache_index(memcg_idx)
> count += __list_lru_count_one(lru, nid, memcg_idx);
> }
> return count;
> }
>
> The first call to __list_lru_count_one() is iterating all the items per node
> i.e, nlru->lru->nr_items.
lru->node[nid].lru.nr_items returned by __list_lru_count_one(lru, nid, -1)
only counts items accounted to the root cgroup, not the total number of
entries on the node.
> Is my understanding correct? If not, could you please clarify on how to get
> the lru items per node?
What I mean is iterating over list_lru_node->memcg_lrus to count the
number of entries on the node is racy. For example, suppose you have
three cgroups with the following values of list_lru_one->nr_items:
0 0 10
While list_lru_count_node() is at #1, cgroup #2 is offlined and its
list_lru_one is drained, i.e. its entries are migrated to the parent
cgroup, which happens to be #0, i.e. we see the following picture:
10 0 0
^^^
memcg_ids points here in list_lru_count_node()
Then the count returned by list_lru_count_node() will be 0, although
there are still 10 entries on the list.
To avoid this race, we could keep list_lru_node->lock locked while
walking over list_lru_node->memcg_lrus, but that's too heavy. I'd prefer
adding list_lru_node->nr_count which would be equal to the total number
of list_lru entries on the node, i.e. sum of list_lru_node->lru.nr_lrus
and list_lru_node->memcg_lrus->lru[]->nr_items.
list_lru_count_node() iterates over all memcgs to get
the total number of entries on the node but it can race with
memcg_drain_all_list_lrus(), which migrates the entries from
a dead cgroup to another. This can return incorrect number of
entries from list_lru_count_node().
Fix this by keeping track of entries per node and simply return
it in list_lru_count_node().
Signed-off-by: Sahitya Tummala <[email protected]>
---
include/linux/list_lru.h | 1 +
mm/list_lru.c | 14 ++++++--------
2 files changed, 7 insertions(+), 8 deletions(-)
diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index cb0ba9f..eff61bc 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -44,6 +44,7 @@ struct list_lru_node {
/* for cgroup aware lrus points to per cgroup lists, otherwise NULL */
struct list_lru_memcg *memcg_lrus;
#endif
+ long nr_count;
} ____cacheline_aligned_in_smp;
struct list_lru {
diff --git a/mm/list_lru.c b/mm/list_lru.c
index 234676e..d417b9f 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -117,6 +117,7 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item)
l = list_lru_from_kmem(nlru, item);
list_add_tail(item, &l->list);
l->nr_items++;
+ nlru->nr_count++;
spin_unlock(&nlru->lock);
return true;
}
@@ -136,6 +137,7 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item)
l = list_lru_from_kmem(nlru, item);
list_del_init(item);
l->nr_items--;
+ nlru->nr_count--;
spin_unlock(&nlru->lock);
return true;
}
@@ -183,15 +185,10 @@ unsigned long list_lru_count_one(struct list_lru *lru,
unsigned long list_lru_count_node(struct list_lru *lru, int nid)
{
- long count = 0;
- int memcg_idx;
+ struct list_lru_node *nlru;
- count += __list_lru_count_one(lru, nid, -1);
- if (list_lru_memcg_aware(lru)) {
- for_each_memcg_cache_index(memcg_idx)
- count += __list_lru_count_one(lru, nid, memcg_idx);
- }
- return count;
+ nlru = &lru->node[nid];
+ return nlru->nr_count;
}
EXPORT_SYMBOL_GPL(list_lru_count_node);
@@ -226,6 +223,7 @@ unsigned long list_lru_count_node(struct list_lru *lru, int nid)
assert_spin_locked(&nlru->lock);
case LRU_REMOVED:
isolated++;
+ nlru->nr_count--;
/*
* If the lru lock has been dropped, our list
* traversal is now invalid and so we have to
--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.
__list_lru_walk_one() acquires nlru spin lock (nlru->lock) for
longer duration if there are more number of items in the lru list.
As per the current code, it can hold the spin lock for upto maximum
UINT_MAX entries at a time. So if there are more number of items in
the lru list, then "BUG: spinlock lockup suspected" is observed in
the below path -
[<ffffff8eca0fb0bc>] spin_bug+0x90
[<ffffff8eca0fb220>] do_raw_spin_lock+0xfc
[<ffffff8ecafb7798>] _raw_spin_lock+0x28
[<ffffff8eca1ae884>] list_lru_add+0x28
[<ffffff8eca1f5dac>] dput+0x1c8
[<ffffff8eca1eb46c>] path_put+0x20
[<ffffff8eca1eb73c>] terminate_walk+0x3c
[<ffffff8eca1eee58>] path_lookupat+0x100
[<ffffff8eca1f00fc>] filename_lookup+0x6c
[<ffffff8eca1f0264>] user_path_at_empty+0x54
[<ffffff8eca1e066c>] SyS_faccessat+0xd0
[<ffffff8eca084e30>] el0_svc_naked+0x24
This nlru->lock is acquired by another CPU in this path -
[<ffffff8eca1f5fd0>] d_lru_shrink_move+0x34
[<ffffff8eca1f6180>] dentry_lru_isolate_shrink+0x48
[<ffffff8eca1aeafc>] __list_lru_walk_one.isra.10+0x94
[<ffffff8eca1aec34>] list_lru_walk_node+0x40
[<ffffff8eca1f6620>] shrink_dcache_sb+0x60
[<ffffff8eca1e56a8>] do_remount_sb+0xbc
[<ffffff8eca1e583c>] do_emergency_remount+0xb0
[<ffffff8eca0ba510>] process_one_work+0x228
[<ffffff8eca0bb158>] worker_thread+0x2e0
[<ffffff8eca0c040c>] kthread+0xf4
[<ffffff8eca084dd0>] ret_from_fork+0x10
Fix this lockup by reducing the number of entries to be shrinked
from the lru list to 1024 at once. Also, add cond_resched() before
processing the lru list again.
Link: http://marc.info/?t=149722864900001&r=1&w=2
Fix-suggested-by: Jan kara <[email protected]>
Fix-suggested-by: Vladimir Davydov <[email protected]>
Signed-off-by: Sahitya Tummala <[email protected]>
---
v3: use list_lru_count() instead of freed in while loop to
cover an extreme case where a single invocation of list_lru_walk()
can skip all 1024 dentries, in which case 'freed' will be 0 forcing
us to break the loop prematurely.
v2: patch shrink_dcache_sb() instead of list_lru_walk()
---
fs/dcache.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index a9f995f..1161390 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1133,11 +1133,12 @@ void shrink_dcache_sb(struct super_block *sb)
LIST_HEAD(dispose);
freed = list_lru_walk(&sb->s_dentry_lru,
- dentry_lru_isolate_shrink, &dispose, UINT_MAX);
+ dentry_lru_isolate_shrink, &dispose, 1024);
this_cpu_sub(nr_dentry_unused, freed);
shrink_dentry_list(&dispose);
- } while (freed > 0);
+ cond_resched();
+ } while (list_lru_count(&sb->s_dentry_lru) > 0);
}
EXPORT_SYMBOL(shrink_dcache_sb);
--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.
On Wed, Jun 28, 2017 at 11:37:23AM +0530, Sahitya Tummala wrote:
> list_lru_count_node() iterates over all memcgs to get
> the total number of entries on the node but it can race with
> memcg_drain_all_list_lrus(), which migrates the entries from
> a dead cgroup to another. This can return incorrect number of
> entries from list_lru_count_node().
>
> Fix this by keeping track of entries per node and simply return
> it in list_lru_count_node().
>
> Signed-off-by: Sahitya Tummala <[email protected]>
> ---
> include/linux/list_lru.h | 1 +
> mm/list_lru.c | 14 ++++++--------
> 2 files changed, 7 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
> index cb0ba9f..eff61bc 100644
> --- a/include/linux/list_lru.h
> +++ b/include/linux/list_lru.h
> @@ -44,6 +44,7 @@ struct list_lru_node {
> /* for cgroup aware lrus points to per cgroup lists, otherwise NULL */
> struct list_lru_memcg *memcg_lrus;
> #endif
> + long nr_count;
'nr_count' sounds awkward. I think it should be called 'nr_items'.
Other than that, looks good to me.
list_lru_count_node() iterates over all memcgs to get
the total number of entries on the node but it can race with
memcg_drain_all_list_lrus(), which migrates the entries from
a dead cgroup to another. This can return incorrect number of
entries from list_lru_count_node().
Fix this by keeping track of entries per node and simply return
it in list_lru_count_node().
Signed-off-by: Sahitya Tummala <[email protected]>
---
include/linux/list_lru.h | 1 +
mm/list_lru.c | 14 ++++++--------
2 files changed, 7 insertions(+), 8 deletions(-)
diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index cb0ba9f..fa7fd03 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -44,6 +44,7 @@ struct list_lru_node {
/* for cgroup aware lrus points to per cgroup lists, otherwise NULL */
struct list_lru_memcg *memcg_lrus;
#endif
+ long nr_items;
} ____cacheline_aligned_in_smp;
struct list_lru {
diff --git a/mm/list_lru.c b/mm/list_lru.c
index 234676e..7a40fa2 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -117,6 +117,7 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item)
l = list_lru_from_kmem(nlru, item);
list_add_tail(item, &l->list);
l->nr_items++;
+ nlru->nr_items++;
spin_unlock(&nlru->lock);
return true;
}
@@ -136,6 +137,7 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item)
l = list_lru_from_kmem(nlru, item);
list_del_init(item);
l->nr_items--;
+ nlru->nr_items--;
spin_unlock(&nlru->lock);
return true;
}
@@ -183,15 +185,10 @@ unsigned long list_lru_count_one(struct list_lru *lru,
unsigned long list_lru_count_node(struct list_lru *lru, int nid)
{
- long count = 0;
- int memcg_idx;
+ struct list_lru_node *nlru;
- count += __list_lru_count_one(lru, nid, -1);
- if (list_lru_memcg_aware(lru)) {
- for_each_memcg_cache_index(memcg_idx)
- count += __list_lru_count_one(lru, nid, memcg_idx);
- }
- return count;
+ nlru = &lru->node[nid];
+ return nlru->nr_items;
}
EXPORT_SYMBOL_GPL(list_lru_count_node);
@@ -226,6 +223,7 @@ unsigned long list_lru_count_node(struct list_lru *lru, int nid)
assert_spin_locked(&nlru->lock);
case LRU_REMOVED:
isolated++;
+ nlru->nr_items--;
/*
* If the lru lock has been dropped, our list
* traversal is now invalid and so we have to
--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.
__list_lru_walk_one() acquires nlru spin lock (nlru->lock) for
longer duration if there are more number of items in the lru list.
As per the current code, it can hold the spin lock for upto maximum
UINT_MAX entries at a time. So if there are more number of items in
the lru list, then "BUG: spinlock lockup suspected" is observed in
the below path -
[<ffffff8eca0fb0bc>] spin_bug+0x90
[<ffffff8eca0fb220>] do_raw_spin_lock+0xfc
[<ffffff8ecafb7798>] _raw_spin_lock+0x28
[<ffffff8eca1ae884>] list_lru_add+0x28
[<ffffff8eca1f5dac>] dput+0x1c8
[<ffffff8eca1eb46c>] path_put+0x20
[<ffffff8eca1eb73c>] terminate_walk+0x3c
[<ffffff8eca1eee58>] path_lookupat+0x100
[<ffffff8eca1f00fc>] filename_lookup+0x6c
[<ffffff8eca1f0264>] user_path_at_empty+0x54
[<ffffff8eca1e066c>] SyS_faccessat+0xd0
[<ffffff8eca084e30>] el0_svc_naked+0x24
This nlru->lock is acquired by another CPU in this path -
[<ffffff8eca1f5fd0>] d_lru_shrink_move+0x34
[<ffffff8eca1f6180>] dentry_lru_isolate_shrink+0x48
[<ffffff8eca1aeafc>] __list_lru_walk_one.isra.10+0x94
[<ffffff8eca1aec34>] list_lru_walk_node+0x40
[<ffffff8eca1f6620>] shrink_dcache_sb+0x60
[<ffffff8eca1e56a8>] do_remount_sb+0xbc
[<ffffff8eca1e583c>] do_emergency_remount+0xb0
[<ffffff8eca0ba510>] process_one_work+0x228
[<ffffff8eca0bb158>] worker_thread+0x2e0
[<ffffff8eca0c040c>] kthread+0xf4
[<ffffff8eca084dd0>] ret_from_fork+0x10
Fix this lockup by reducing the number of entries to be shrinked
from the lru list to 1024 at once. Also, add cond_resched() before
processing the lru list again.
Link: http://marc.info/?t=149722864900001&r=1&w=2
Fix-suggested-by: Jan kara <[email protected]>
Fix-suggested-by: Vladimir Davydov <[email protected]>
Signed-off-by: Sahitya Tummala <[email protected]>
---
fs/dcache.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index a9f995f..1161390 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1133,11 +1133,12 @@ void shrink_dcache_sb(struct super_block *sb)
LIST_HEAD(dispose);
freed = list_lru_walk(&sb->s_dentry_lru,
- dentry_lru_isolate_shrink, &dispose, UINT_MAX);
+ dentry_lru_isolate_shrink, &dispose, 1024);
this_cpu_sub(nr_dentry_unused, freed);
shrink_dentry_list(&dispose);
- } while (freed > 0);
+ cond_resched();
+ } while (list_lru_count(&sb->s_dentry_lru) > 0);
}
EXPORT_SYMBOL(shrink_dcache_sb);
--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.
On Thu, 29 Jun 2017 09:09:35 +0530 Sahitya Tummala <[email protected]> wrote:
> __list_lru_walk_one() acquires nlru spin lock (nlru->lock) for
> longer duration if there are more number of items in the lru list.
> As per the current code, it can hold the spin lock for upto maximum
> UINT_MAX entries at a time. So if there are more number of items in
> the lru list, then "BUG: spinlock lockup suspected" is observed in
> the below path -
>
> ...
>
> Fix this lockup by reducing the number of entries to be shrinked
> from the lru list to 1024 at once. Also, add cond_resched() before
> processing the lru list again.
>
> ...
>
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -1133,11 +1133,12 @@ void shrink_dcache_sb(struct super_block *sb)
> LIST_HEAD(dispose);
>
> freed = list_lru_walk(&sb->s_dentry_lru,
> - dentry_lru_isolate_shrink, &dispose, UINT_MAX);
> + dentry_lru_isolate_shrink, &dispose, 1024);
>
> this_cpu_sub(nr_dentry_unused, freed);
> shrink_dentry_list(&dispose);
> - } while (freed > 0);
> + cond_resched();
> + } while (list_lru_count(&sb->s_dentry_lru) > 0);
> }
> EXPORT_SYMBOL(shrink_dcache_sb);
I'll add a cc:stable to this one - a large dentry list is a relatively
common thing.
I'm assumng that [1/2] does not need to be backported, OK?
On 6/30/2017 4:18 AM, Andrew Morton wrote:
>
>> --- a/fs/dcache.c
>> +++ b/fs/dcache.c
>> @@ -1133,11 +1133,12 @@ void shrink_dcache_sb(struct super_block *sb)
>> LIST_HEAD(dispose);
>>
>> freed = list_lru_walk(&sb->s_dentry_lru,
>> - dentry_lru_isolate_shrink, &dispose, UINT_MAX);
>> + dentry_lru_isolate_shrink, &dispose, 1024);
>>
>> this_cpu_sub(nr_dentry_unused, freed);
>> shrink_dentry_list(&dispose);
>> - } while (freed > 0);
>> + cond_resched();
>> + } while (list_lru_count(&sb->s_dentry_lru) > 0);
>> }
>> EXPORT_SYMBOL(shrink_dcache_sb);
> I'll add a cc:stable to this one - a large dentry list is a relatively
> common thing.
>
> I'm assumng that [1/2] does not need to be backported, OK?
I think we should include [1/2] as well along with this patch, as this patch
is using list_lru_count(), which can return incorrect count if [1/2] is
not included.
Also, all the previous patches submitted for fixing this issue must be
dropped i.e,
mm/list_lru.c: use cond_resched_lock() for nlru->lock must be dropped.
--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.
On Thu, Jun 29, 2017 at 09:09:35AM +0530, Sahitya Tummala wrote:
> __list_lru_walk_one() acquires nlru spin lock (nlru->lock) for
> longer duration if there are more number of items in the lru list.
> As per the current code, it can hold the spin lock for upto maximum
> UINT_MAX entries at a time. So if there are more number of items in
> the lru list, then "BUG: spinlock lockup suspected" is observed in
> the below path -
>
> [<ffffff8eca0fb0bc>] spin_bug+0x90
> [<ffffff8eca0fb220>] do_raw_spin_lock+0xfc
> [<ffffff8ecafb7798>] _raw_spin_lock+0x28
> [<ffffff8eca1ae884>] list_lru_add+0x28
> [<ffffff8eca1f5dac>] dput+0x1c8
> [<ffffff8eca1eb46c>] path_put+0x20
> [<ffffff8eca1eb73c>] terminate_walk+0x3c
> [<ffffff8eca1eee58>] path_lookupat+0x100
> [<ffffff8eca1f00fc>] filename_lookup+0x6c
> [<ffffff8eca1f0264>] user_path_at_empty+0x54
> [<ffffff8eca1e066c>] SyS_faccessat+0xd0
> [<ffffff8eca084e30>] el0_svc_naked+0x24
>
> This nlru->lock is acquired by another CPU in this path -
>
> [<ffffff8eca1f5fd0>] d_lru_shrink_move+0x34
> [<ffffff8eca1f6180>] dentry_lru_isolate_shrink+0x48
> [<ffffff8eca1aeafc>] __list_lru_walk_one.isra.10+0x94
> [<ffffff8eca1aec34>] list_lru_walk_node+0x40
> [<ffffff8eca1f6620>] shrink_dcache_sb+0x60
> [<ffffff8eca1e56a8>] do_remount_sb+0xbc
> [<ffffff8eca1e583c>] do_emergency_remount+0xb0
> [<ffffff8eca0ba510>] process_one_work+0x228
> [<ffffff8eca0bb158>] worker_thread+0x2e0
> [<ffffff8eca0c040c>] kthread+0xf4
> [<ffffff8eca084dd0>] ret_from_fork+0x10
>
> Fix this lockup by reducing the number of entries to be shrinked
> from the lru list to 1024 at once. Also, add cond_resched() before
> processing the lru list again.
>
> Link: http://marc.info/?t=149722864900001&r=1&w=2
> Fix-suggested-by: Jan kara <[email protected]>
> Fix-suggested-by: Vladimir Davydov <[email protected]>
> Signed-off-by: Sahitya Tummala <[email protected]>
Acked-by: Vladimir Davydov <[email protected]>
On Thu, Jun 29, 2017 at 09:09:15AM +0530, Sahitya Tummala wrote:
> list_lru_count_node() iterates over all memcgs to get
> the total number of entries on the node but it can race with
> memcg_drain_all_list_lrus(), which migrates the entries from
> a dead cgroup to another. This can return incorrect number of
> entries from list_lru_count_node().
>
> Fix this by keeping track of entries per node and simply return
> it in list_lru_count_node().
>
> Signed-off-by: Sahitya Tummala <[email protected]>
Acked-by: Vladimir Davydov <[email protected]>