2021-02-09 18:37:44

by Yang Shi

[permalink] [raw]
Subject: [v7 PATCH 0/12] Make shrinker's nr_deferred memcg aware


Changelog
v6 --> v7:
* Expanded shrinker_info in a batch of BITS_PER_LONG per Kirill.
* Added patch 06/12 to introduce a helper for dereferencing shrinker_info
per Kirill.
* Renamed set_nr_deferred_memcg to add_nr_deferred_memcg per Kirill.
* Collected Acked-by from Kirill.
v5 --> v6:
* Rebased on top of https://lore.kernel.org/linux-mm/[email protected]/
per Kirill.
* Don't register shrinker idr with NULL and remove idr_replace() per Vlastimil.
* Move nr_deferred before map to guarantee the alignment per Vlastimil.
* Misc minor code cleanup and refactor per Kirill and Vlastimil.
* Added Acked-by from Vlastimil for path #1, #2, #3, #5, #9 and #10.
v4 --> v5:
* Incorporated the comments from Kirill.
* Rebased to v5.11-rc5.
v3 --> v4:
* Removed "memcg_" prefix for shrinker_maps related functions per Roman.
* Use write lock instead of read lock per Kirill. Also removed Johannes's ack
since write lock is used.
* Incorporated the comments from Kirill.
* Removed RFC.
* Rebased to v5.11-rc4.
v2 --> v3:
* Moved shrinker_maps related code to vmscan.c per Dave.
* Removed memcg_shrinker_map_size. Calcuated the size of map via shrinker_nr_max
per Johannes.
* Consolidated shrinker_deferred with shrinker_maps into one struct per Dave.
* Simplified the nr_deferred related code.
* Dropped the memory barrier from v2.
* Moved nr_deferred reparent code to vmscan.c per Dave.
* Added test coverage information in patch #11. Dave is concerned about the
potential regression. I didn't notice regression with my tests, but suggestions
about more test coverage is definitely welcome. And it may help spot regression
with this patch in -mm tree then linux-next tree so I keep it in this version.
* The code cleanup and consolidation resulted in the series grow to 11 patches.
* Rebased onto 5.11-rc2.
v1 --> v2:
* Use shrinker->flags to store the new SHRINKER_REGISTERED flag per Roman.
* Folded patch #1 into patch #6 per Roman.
* Added memory barrier to prevent shrink_slab_memcg from seeing NULL shrinker_maps/
shrinker_deferred per Kirill.
* Removed memcg_shrinker_map_mutex. Protcted shrinker_map/shrinker_deferred
allocations from expand with shrinker_rwsem per Johannes.

Recently huge amount one-off slab drop was seen on some vfs metadata heavy workloads,
it turned out there were huge amount accumulated nr_deferred objects seen by the
shrinker.

On our production machine, I saw absurd number of nr_deferred shown as the below
tracing result:

<...>-48776 [032] .... 27970562.458916: mm_shrink_slab_start:
super_cache_scan+0x0/0x1a0 ffff9a83046f3458: nid: 0 objects to shrink
2531805877005 gfp_flags GFP_HIGHUSER_MOVABLE pgs_scanned 32 lru_pgs
9300 cache items 1667 delta 11 total_scan 833

There are 2.5 trillion deferred objects on one node, assuming all of them
are dentry (192 bytes per object), so the total size of deferred on
one node is ~480TB. It is definitely ridiculous.

I managed to reproduce this problem with kernel build workload plus negative dentry
generator.

First step, run the below kernel build test script:

NR_CPUS=`cat /proc/cpuinfo | grep -e processor | wc -l`

cd /root/Buildarea/linux-stable

for i in `seq 1500`; do
cgcreate -g memory:kern_build
echo 4G > /sys/fs/cgroup/memory/kern_build/memory.limit_in_bytes

echo 3 > /proc/sys/vm/drop_caches
cgexec -g memory:kern_build make clean > /dev/null 2>&1
cgexec -g memory:kern_build make -j$NR_CPUS > /dev/null 2>&1

cgdelete -g memory:kern_build
done

Then run the below negative dentry generator script:

NR_CPUS=`cat /proc/cpuinfo | grep -e processor | wc -l`

mkdir /sys/fs/cgroup/memory/test
echo $$ > /sys/fs/cgroup/memory/test/tasks

for i in `seq $NR_CPUS`; do
while true; do
FILE=`head /dev/urandom | tr -dc A-Za-z0-9 | head -c 64`
cat $FILE 2>/dev/null
done &
done

Then kswapd will shrink half of dentry cache in just one loop as the below tracing result
showed:

kswapd0-475 [028] .... 305968.252561: mm_shrink_slab_start: super_cache_scan+0x0/0x190 0000000024acf00c: nid: 0
objects to shrink 4994376020 gfp_flags GFP_KERNEL cache items 93689873 delta 45746 total_scan 46844936 priority 12
kswapd0-475 [021] .... 306013.099399: mm_shrink_slab_end: super_cache_scan+0x0/0x190 0000000024acf00c: nid: 0 unused
scan count 4994376020 new scan count 4947576838 total_scan 8 last shrinker return val 46844928

There were huge number of deferred objects before the shrinker was called, the behavior
does match the code but it might be not desirable from the user's stand of point.

The excessive amount of nr_deferred might be accumulated due to various reasons, for example:
* GFP_NOFS allocation
* Significant times of small amount scan (< scan_batch, 1024 for vfs metadata)

However the LRUs of slabs are per memcg (memcg-aware shrinkers) but the deferred objects
is per shrinker, this may have some bad effects:
* Poor isolation among memcgs. Some memcgs which happen to have frequent limit
reclaim may get nr_deferred accumulated to a huge number, then other innocent
memcgs may take the fall. In our case the main workload was hit.
* Unbounded deferred objects. There is no cap for deferred objects, it can outgrow
ridiculously as the tracing result showed.
* Easy to get out of control. Although shrinkers take into account deferred objects,
but it can go out of control easily. One misconfigured memcg could incur absurd
amount of deferred objects in a period of time.
* Sort of reclaim problems, i.e. over reclaim, long reclaim latency, etc. There may be
hundred GB slab caches for vfe metadata heavy workload, shrink half of them may take
minutes. We observed latency spike due to the prolonged reclaim.

These issues also have been discussed in https://lore.kernel.org/linux-mm/[email protected]/.
The patchset is the outcome of that discussion.

So this patchset makes nr_deferred per-memcg to tackle the problem. It does:
* Have memcg_shrinker_deferred per memcg per node, just like what shrinker_map
does. Instead it is an atomic_long_t array, each element represent one shrinker
even though the shrinker is not memcg aware, this simplifies the implementation.
For memcg aware shrinkers, the deferred objects are just accumulated to its own
memcg. The shrinkers just see nr_deferred from its own memcg. Non memcg aware
shrinkers still use global nr_deferred from struct shrinker.
* Once the memcg is offlined, its nr_deferred will be reparented to its parent along
with LRUs.
* The root memcg has memcg_shrinker_deferred array too. It simplifies the handling of
reparenting to root memcg.
* Cap nr_deferred to 2x of the length of lru. The idea is borrowed from Dave Chinner's
series (https://lore.kernel.org/linux-xfs/[email protected]/)

The downside is each memcg has to allocate extra memory to store the nr_deferred array.
On our production environment, there are typically around 40 shrinkers, so each memcg
needs ~320 bytes. 10K memcgs would need ~3.2MB memory. It seems fine.

We have been running the patched kernel on some hosts of our fleet (test and production) for
months, it works very well. The monitor data shows the working set is sustained as expected.

Yang Shi (12):
mm: vmscan: use nid from shrink_control for tracepoint
mm: vmscan: consolidate shrinker_maps handling code
mm: vmscan: use shrinker_rwsem to protect shrinker_maps allocation
mm: vmscan: remove memcg_shrinker_map_size
mm: memcontrol: rename shrinker_map to shrinker_info
mm: vmscan: add shrinker_info_protected() helper
mm: vmscan: use a new flag to indicate shrinker is registered
mm: vmscan: add per memcg shrinker nr_deferred
mm: vmscan: use per memcg nr_deferred of shrinker
mm: vmscan: don't need allocate shrinker->nr_deferred for memcg aware shrinkers
mm: memcontrol: reparent nr_deferred when memcg offline
mm: vmscan: shrink deferred objects proportional to priority

include/linux/memcontrol.h | 23 +++---
include/linux/shrinker.h | 7 +-
mm/huge_memory.c | 4 +-
mm/list_lru.c | 6 +-
mm/memcontrol.c | 130 +-------------------------------
mm/vmscan.c | 375 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------------------------
6 files changed, 303 insertions(+), 242 deletions(-)


2021-02-09 18:42:33

by Yang Shi

[permalink] [raw]
Subject: [v7 PATCH 07/12] mm: vmscan: use a new flag to indicate shrinker is registered

Currently registered shrinker is indicated by non-NULL shrinker->nr_deferred.
This approach is fine with nr_deferred at the shrinker level, but the following
patches will move MEMCG_AWARE shrinkers' nr_deferred to memcg level, so their
shrinker->nr_deferred would always be NULL. This would prevent the shrinkers
from unregistering correctly.

Remove SHRINKER_REGISTERING since we could check if shrinker is registered
successfully by the new flag.

Acked-by: Kirill Tkhai <[email protected]>
Signed-off-by: Yang Shi <[email protected]>
---
include/linux/shrinker.h | 7 ++++---
mm/vmscan.c | 31 +++++++++----------------------
2 files changed, 13 insertions(+), 25 deletions(-)

diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 0f80123650e2..1eac79ce57d4 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -79,13 +79,14 @@ struct shrinker {
#define DEFAULT_SEEKS 2 /* A good number if you don't know better. */

/* Flags */
-#define SHRINKER_NUMA_AWARE (1 << 0)
-#define SHRINKER_MEMCG_AWARE (1 << 1)
+#define SHRINKER_REGISTERED (1 << 0)
+#define SHRINKER_NUMA_AWARE (1 << 1)
+#define SHRINKER_MEMCG_AWARE (1 << 2)
/*
* It just makes sense when the shrinker is also MEMCG_AWARE for now,
* non-MEMCG_AWARE shrinker should not have this flag set.
*/
-#define SHRINKER_NONSLAB (1 << 2)
+#define SHRINKER_NONSLAB (1 << 3)

extern int prealloc_shrinker(struct shrinker *shrinker);
extern void register_shrinker_prepared(struct shrinker *shrinker);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 273efbf4d53c..a047980536cf 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -315,19 +315,6 @@ void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)
}
}

-/*
- * We allow subsystems to populate their shrinker-related
- * LRU lists before register_shrinker_prepared() is called
- * for the shrinker, since we don't want to impose
- * restrictions on their internal registration order.
- * In this case shrink_slab_memcg() may find corresponding
- * bit is set in the shrinkers map.
- *
- * This value is used by the function to detect registering
- * shrinkers and to skip do_shrink_slab() calls for them.
- */
-#define SHRINKER_REGISTERING ((struct shrinker *)~0UL)
-
static DEFINE_IDR(shrinker_idr);

static int prealloc_memcg_shrinker(struct shrinker *shrinker)
@@ -336,7 +323,7 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)

down_write(&shrinker_rwsem);
/* This may call shrinker, so it must use down_read_trylock() */
- id = idr_alloc(&shrinker_idr, SHRINKER_REGISTERING, 0, 0, GFP_KERNEL);
+ id = idr_alloc(&shrinker_idr, shrinker, 0, 0, GFP_KERNEL);
if (id < 0)
goto unlock;

@@ -499,10 +486,7 @@ void register_shrinker_prepared(struct shrinker *shrinker)
{
down_write(&shrinker_rwsem);
list_add_tail(&shrinker->list, &shrinker_list);
-#ifdef CONFIG_MEMCG
- if (shrinker->flags & SHRINKER_MEMCG_AWARE)
- idr_replace(&shrinker_idr, shrinker, shrinker->id);
-#endif
+ shrinker->flags |= SHRINKER_REGISTERED;
up_write(&shrinker_rwsem);
}

@@ -522,13 +506,16 @@ EXPORT_SYMBOL(register_shrinker);
*/
void unregister_shrinker(struct shrinker *shrinker)
{
- if (!shrinker->nr_deferred)
+ if (!(shrinker->flags & SHRINKER_REGISTERED))
return;
- if (shrinker->flags & SHRINKER_MEMCG_AWARE)
- unregister_memcg_shrinker(shrinker);
+
down_write(&shrinker_rwsem);
list_del(&shrinker->list);
+ shrinker->flags &= ~SHRINKER_REGISTERED;
up_write(&shrinker_rwsem);
+
+ if (shrinker->flags & SHRINKER_MEMCG_AWARE)
+ unregister_memcg_shrinker(shrinker);
kfree(shrinker->nr_deferred);
shrinker->nr_deferred = NULL;
}
@@ -693,7 +680,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
struct shrinker *shrinker;

shrinker = idr_find(&shrinker_idr, i);
- if (unlikely(!shrinker || shrinker == SHRINKER_REGISTERING)) {
+ if (unlikely(!shrinker || !(shrinker->flags & SHRINKER_REGISTERED))) {
if (!shrinker)
clear_bit(i, info->map);
continue;
--
2.26.2

2021-02-10 02:05:39

by Roman Gushchin

[permalink] [raw]
Subject: Re: [v7 PATCH 07/12] mm: vmscan: use a new flag to indicate shrinker is registered

On Tue, Feb 09, 2021 at 09:46:41AM -0800, Yang Shi wrote:
> Currently registered shrinker is indicated by non-NULL shrinker->nr_deferred.
> This approach is fine with nr_deferred at the shrinker level, but the following
> patches will move MEMCG_AWARE shrinkers' nr_deferred to memcg level, so their
> shrinker->nr_deferred would always be NULL. This would prevent the shrinkers
> from unregistering correctly.
>
> Remove SHRINKER_REGISTERING since we could check if shrinker is registered
> successfully by the new flag.
>
> Acked-by: Kirill Tkhai <[email protected]>
> Signed-off-by: Yang Shi <[email protected]>
> ---
> include/linux/shrinker.h | 7 ++++---
> mm/vmscan.c | 31 +++++++++----------------------
> 2 files changed, 13 insertions(+), 25 deletions(-)
>
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index 0f80123650e2..1eac79ce57d4 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -79,13 +79,14 @@ struct shrinker {
> #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
>
> /* Flags */
> -#define SHRINKER_NUMA_AWARE (1 << 0)
> -#define SHRINKER_MEMCG_AWARE (1 << 1)
> +#define SHRINKER_REGISTERED (1 << 0)
> +#define SHRINKER_NUMA_AWARE (1 << 1)
> +#define SHRINKER_MEMCG_AWARE (1 << 2)
> /*
> * It just makes sense when the shrinker is also MEMCG_AWARE for now,
> * non-MEMCG_AWARE shrinker should not have this flag set.
> */
> -#define SHRINKER_NONSLAB (1 << 2)
> +#define SHRINKER_NONSLAB (1 << 3)
>
> extern int prealloc_shrinker(struct shrinker *shrinker);
> extern void register_shrinker_prepared(struct shrinker *shrinker);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 273efbf4d53c..a047980536cf 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -315,19 +315,6 @@ void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)
> }
> }
>
> -/*
> - * We allow subsystems to populate their shrinker-related
> - * LRU lists before register_shrinker_prepared() is called
> - * for the shrinker, since we don't want to impose
> - * restrictions on their internal registration order.
> - * In this case shrink_slab_memcg() may find corresponding
> - * bit is set in the shrinkers map.
> - *
> - * This value is used by the function to detect registering
> - * shrinkers and to skip do_shrink_slab() calls for them.
> - */
> -#define SHRINKER_REGISTERING ((struct shrinker *)~0UL)
> -
> static DEFINE_IDR(shrinker_idr);
>
> static int prealloc_memcg_shrinker(struct shrinker *shrinker)
> @@ -336,7 +323,7 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
>
> down_write(&shrinker_rwsem);
> /* This may call shrinker, so it must use down_read_trylock() */
> - id = idr_alloc(&shrinker_idr, SHRINKER_REGISTERING, 0, 0, GFP_KERNEL);
> + id = idr_alloc(&shrinker_idr, shrinker, 0, 0, GFP_KERNEL);
> if (id < 0)
> goto unlock;
>
> @@ -499,10 +486,7 @@ void register_shrinker_prepared(struct shrinker *shrinker)
> {
> down_write(&shrinker_rwsem);
> list_add_tail(&shrinker->list, &shrinker_list);
> -#ifdef CONFIG_MEMCG
> - if (shrinker->flags & SHRINKER_MEMCG_AWARE)
> - idr_replace(&shrinker_idr, shrinker, shrinker->id);
> -#endif
> + shrinker->flags |= SHRINKER_REGISTERED;
> up_write(&shrinker_rwsem);
> }
>
> @@ -522,13 +506,16 @@ EXPORT_SYMBOL(register_shrinker);
> */
> void unregister_shrinker(struct shrinker *shrinker)
> {
> - if (!shrinker->nr_deferred)
> + if (!(shrinker->flags & SHRINKER_REGISTERED))
> return;
> - if (shrinker->flags & SHRINKER_MEMCG_AWARE)
> - unregister_memcg_shrinker(shrinker);
> +
> down_write(&shrinker_rwsem);
> list_del(&shrinker->list);
> + shrinker->flags &= ~SHRINKER_REGISTERED;
> up_write(&shrinker_rwsem);
> +
> + if (shrinker->flags & SHRINKER_MEMCG_AWARE)
> + unregister_memcg_shrinker(shrinker);

Because unregister_memcg_shrinker() will take and release shrinker_rwsem once again,
I wonder if it's better to move it into the locked section and change the calling
convention to require the caller to take the semaphore?

> kfree(shrinkrem->nr_deferred);
> shrinker->nr_deferred = NULL;
> }
> @@ -693,7 +680,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
> struct shrinker *shrinker;
>
> shrinker = idr_find(&shrinker_idr, i);
> - if (unlikely(!shrinker || shrinker == SHRINKER_REGISTERING)) {
> + if (unlikely(!shrinker || !(shrinker->flags & SHRINKER_REGISTERED))) {
> if (!shrinker)
> clear_bit(i, info->map);
> continue;
> --
> 2.26.2
>

2021-02-10 02:17:01

by Yang Shi

[permalink] [raw]
Subject: Re: [v7 PATCH 07/12] mm: vmscan: use a new flag to indicate shrinker is registered

On Tue, Feb 9, 2021 at 4:39 PM Roman Gushchin <[email protected]> wrote:
>
> On Tue, Feb 09, 2021 at 09:46:41AM -0800, Yang Shi wrote:
> > Currently registered shrinker is indicated by non-NULL shrinker->nr_deferred.
> > This approach is fine with nr_deferred at the shrinker level, but the following
> > patches will move MEMCG_AWARE shrinkers' nr_deferred to memcg level, so their
> > shrinker->nr_deferred would always be NULL. This would prevent the shrinkers
> > from unregistering correctly.
> >
> > Remove SHRINKER_REGISTERING since we could check if shrinker is registered
> > successfully by the new flag.
> >
> > Acked-by: Kirill Tkhai <[email protected]>
> > Signed-off-by: Yang Shi <[email protected]>
> > ---
> > include/linux/shrinker.h | 7 ++++---
> > mm/vmscan.c | 31 +++++++++----------------------
> > 2 files changed, 13 insertions(+), 25 deletions(-)
> >
> > diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> > index 0f80123650e2..1eac79ce57d4 100644
> > --- a/include/linux/shrinker.h
> > +++ b/include/linux/shrinker.h
> > @@ -79,13 +79,14 @@ struct shrinker {
> > #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
> >
> > /* Flags */
> > -#define SHRINKER_NUMA_AWARE (1 << 0)
> > -#define SHRINKER_MEMCG_AWARE (1 << 1)
> > +#define SHRINKER_REGISTERED (1 << 0)
> > +#define SHRINKER_NUMA_AWARE (1 << 1)
> > +#define SHRINKER_MEMCG_AWARE (1 << 2)
> > /*
> > * It just makes sense when the shrinker is also MEMCG_AWARE for now,
> > * non-MEMCG_AWARE shrinker should not have this flag set.
> > */
> > -#define SHRINKER_NONSLAB (1 << 2)
> > +#define SHRINKER_NONSLAB (1 << 3)
> >
> > extern int prealloc_shrinker(struct shrinker *shrinker);
> > extern void register_shrinker_prepared(struct shrinker *shrinker);
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 273efbf4d53c..a047980536cf 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -315,19 +315,6 @@ void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)
> > }
> > }
> >
> > -/*
> > - * We allow subsystems to populate their shrinker-related
> > - * LRU lists before register_shrinker_prepared() is called
> > - * for the shrinker, since we don't want to impose
> > - * restrictions on their internal registration order.
> > - * In this case shrink_slab_memcg() may find corresponding
> > - * bit is set in the shrinkers map.
> > - *
> > - * This value is used by the function to detect registering
> > - * shrinkers and to skip do_shrink_slab() calls for them.
> > - */
> > -#define SHRINKER_REGISTERING ((struct shrinker *)~0UL)
> > -
> > static DEFINE_IDR(shrinker_idr);
> >
> > static int prealloc_memcg_shrinker(struct shrinker *shrinker)
> > @@ -336,7 +323,7 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
> >
> > down_write(&shrinker_rwsem);
> > /* This may call shrinker, so it must use down_read_trylock() */
> > - id = idr_alloc(&shrinker_idr, SHRINKER_REGISTERING, 0, 0, GFP_KERNEL);
> > + id = idr_alloc(&shrinker_idr, shrinker, 0, 0, GFP_KERNEL);
> > if (id < 0)
> > goto unlock;
> >
> > @@ -499,10 +486,7 @@ void register_shrinker_prepared(struct shrinker *shrinker)
> > {
> > down_write(&shrinker_rwsem);
> > list_add_tail(&shrinker->list, &shrinker_list);
> > -#ifdef CONFIG_MEMCG
> > - if (shrinker->flags & SHRINKER_MEMCG_AWARE)
> > - idr_replace(&shrinker_idr, shrinker, shrinker->id);
> > -#endif
> > + shrinker->flags |= SHRINKER_REGISTERED;
> > up_write(&shrinker_rwsem);
> > }
> >
> > @@ -522,13 +506,16 @@ EXPORT_SYMBOL(register_shrinker);
> > */
> > void unregister_shrinker(struct shrinker *shrinker)
> > {
> > - if (!shrinker->nr_deferred)
> > + if (!(shrinker->flags & SHRINKER_REGISTERED))
> > return;
> > - if (shrinker->flags & SHRINKER_MEMCG_AWARE)
> > - unregister_memcg_shrinker(shrinker);
> > +
> > down_write(&shrinker_rwsem);
> > list_del(&shrinker->list);
> > + shrinker->flags &= ~SHRINKER_REGISTERED;
> > up_write(&shrinker_rwsem);
> > +
> > + if (shrinker->flags & SHRINKER_MEMCG_AWARE)
> > + unregister_memcg_shrinker(shrinker);
>
> Because unregister_memcg_shrinker() will take and release shrinker_rwsem once again,
> I wonder if it's better to move it into the locked section and change the calling
> convention to require the caller to take the semaphore?

I don't think we could do that since unregister_memcg_shrinker() is
called by free_prealloced_shrinker() which is called without holding
the shrinker_rwsem by fs and workingset code.

We could add a bool parameter to indicate if the rwsem was acquired or
not, but IMHO it seems not worth it.

>
> > kfree(shrinkrem->nr_deferred);
> > shrinker->nr_deferred = NULL;
> > }
> > @@ -693,7 +680,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
> > struct shrinker *shrinker;
> >
> > shrinker = idr_find(&shrinker_idr, i);
> > - if (unlikely(!shrinker || shrinker == SHRINKER_REGISTERING)) {
> > + if (unlikely(!shrinker || !(shrinker->flags & SHRINKER_REGISTERED))) {
> > if (!shrinker)
> > clear_bit(i, info->map);
> > continue;
> > --
> > 2.26.2
> >

2021-02-10 02:25:39

by Roman Gushchin

[permalink] [raw]
Subject: Re: [v7 PATCH 07/12] mm: vmscan: use a new flag to indicate shrinker is registered

On Tue, Feb 09, 2021 at 05:12:51PM -0800, Yang Shi wrote:
> On Tue, Feb 9, 2021 at 4:39 PM Roman Gushchin <[email protected]> wrote:
> >
> > On Tue, Feb 09, 2021 at 09:46:41AM -0800, Yang Shi wrote:
> > > Currently registered shrinker is indicated by non-NULL shrinker->nr_deferred.
> > > This approach is fine with nr_deferred at the shrinker level, but the following
> > > patches will move MEMCG_AWARE shrinkers' nr_deferred to memcg level, so their
> > > shrinker->nr_deferred would always be NULL. This would prevent the shrinkers
> > > from unregistering correctly.
> > >
> > > Remove SHRINKER_REGISTERING since we could check if shrinker is registered
> > > successfully by the new flag.
> > >
> > > Acked-by: Kirill Tkhai <[email protected]>
> > > Signed-off-by: Yang Shi <[email protected]>
> > > ---
> > > include/linux/shrinker.h | 7 ++++---
> > > mm/vmscan.c | 31 +++++++++----------------------
> > > 2 files changed, 13 insertions(+), 25 deletions(-)
> > >
> > > diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> > > index 0f80123650e2..1eac79ce57d4 100644
> > > --- a/include/linux/shrinker.h
> > > +++ b/include/linux/shrinker.h
> > > @@ -79,13 +79,14 @@ struct shrinker {
> > > #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
> > >
> > > /* Flags */
> > > -#define SHRINKER_NUMA_AWARE (1 << 0)
> > > -#define SHRINKER_MEMCG_AWARE (1 << 1)
> > > +#define SHRINKER_REGISTERED (1 << 0)
> > > +#define SHRINKER_NUMA_AWARE (1 << 1)
> > > +#define SHRINKER_MEMCG_AWARE (1 << 2)
> > > /*
> > > * It just makes sense when the shrinker is also MEMCG_AWARE for now,
> > > * non-MEMCG_AWARE shrinker should not have this flag set.
> > > */
> > > -#define SHRINKER_NONSLAB (1 << 2)
> > > +#define SHRINKER_NONSLAB (1 << 3)
> > >
> > > extern int prealloc_shrinker(struct shrinker *shrinker);
> > > extern void register_shrinker_prepared(struct shrinker *shrinker);
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 273efbf4d53c..a047980536cf 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -315,19 +315,6 @@ void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)
> > > }
> > > }
> > >
> > > -/*
> > > - * We allow subsystems to populate their shrinker-related
> > > - * LRU lists before register_shrinker_prepared() is called
> > > - * for the shrinker, since we don't want to impose
> > > - * restrictions on their internal registration order.
> > > - * In this case shrink_slab_memcg() may find corresponding
> > > - * bit is set in the shrinkers map.
> > > - *
> > > - * This value is used by the function to detect registering
> > > - * shrinkers and to skip do_shrink_slab() calls for them.
> > > - */
> > > -#define SHRINKER_REGISTERING ((struct shrinker *)~0UL)
> > > -
> > > static DEFINE_IDR(shrinker_idr);
> > >
> > > static int prealloc_memcg_shrinker(struct shrinker *shrinker)
> > > @@ -336,7 +323,7 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
> > >
> > > down_write(&shrinker_rwsem);
> > > /* This may call shrinker, so it must use down_read_trylock() */
> > > - id = idr_alloc(&shrinker_idr, SHRINKER_REGISTERING, 0, 0, GFP_KERNEL);
> > > + id = idr_alloc(&shrinker_idr, shrinker, 0, 0, GFP_KERNEL);
> > > if (id < 0)
> > > goto unlock;
> > >
> > > @@ -499,10 +486,7 @@ void register_shrinker_prepared(struct shrinker *shrinker)
> > > {
> > > down_write(&shrinker_rwsem);
> > > list_add_tail(&shrinker->list, &shrinker_list);
> > > -#ifdef CONFIG_MEMCG
> > > - if (shrinker->flags & SHRINKER_MEMCG_AWARE)
> > > - idr_replace(&shrinker_idr, shrinker, shrinker->id);
> > > -#endif
> > > + shrinker->flags |= SHRINKER_REGISTERED;
> > > up_write(&shrinker_rwsem);
> > > }
> > >
> > > @@ -522,13 +506,16 @@ EXPORT_SYMBOL(register_shrinker);
> > > */
> > > void unregister_shrinker(struct shrinker *shrinker)
> > > {
> > > - if (!shrinker->nr_deferred)
> > > + if (!(shrinker->flags & SHRINKER_REGISTERED))
> > > return;
> > > - if (shrinker->flags & SHRINKER_MEMCG_AWARE)
> > > - unregister_memcg_shrinker(shrinker);
> > > +
> > > down_write(&shrinker_rwsem);
> > > list_del(&shrinker->list);
> > > + shrinker->flags &= ~SHRINKER_REGISTERED;
> > > up_write(&shrinker_rwsem);
> > > +
> > > + if (shrinker->flags & SHRINKER_MEMCG_AWARE)
> > > + unregister_memcg_shrinker(shrinker);
> >
> > Because unregister_memcg_shrinker() will take and release shrinker_rwsem once again,
> > I wonder if it's better to move it into the locked section and change the calling
> > convention to require the caller to take the semaphore?
>
> I don't think we could do that since unregister_memcg_shrinker() is
> called by free_prealloced_shrinker() which is called without holding
> the shrinker_rwsem by fs and workingset code.
>
> We could add a bool parameter to indicate if the rwsem was acquired or
> not, but IMHO it seems not worth it.

Can free_preallocated_shrinker() just do

if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
down_write(&shrinker_rwsem);
unregister_memcg_shrinker(shrinker);
up_write(&shrinker_rwsem);
}

?

2021-02-10 02:27:42

by Yang Shi

[permalink] [raw]
Subject: Re: [v7 PATCH 07/12] mm: vmscan: use a new flag to indicate shrinker is registered

On Tue, Feb 9, 2021 at 5:34 PM Roman Gushchin <[email protected]> wrote:
>
> On Tue, Feb 09, 2021 at 05:12:51PM -0800, Yang Shi wrote:
> > On Tue, Feb 9, 2021 at 4:39 PM Roman Gushchin <[email protected]> wrote:
> > >
> > > On Tue, Feb 09, 2021 at 09:46:41AM -0800, Yang Shi wrote:
> > > > Currently registered shrinker is indicated by non-NULL shrinker->nr_deferred.
> > > > This approach is fine with nr_deferred at the shrinker level, but the following
> > > > patches will move MEMCG_AWARE shrinkers' nr_deferred to memcg level, so their
> > > > shrinker->nr_deferred would always be NULL. This would prevent the shrinkers
> > > > from unregistering correctly.
> > > >
> > > > Remove SHRINKER_REGISTERING since we could check if shrinker is registered
> > > > successfully by the new flag.
> > > >
> > > > Acked-by: Kirill Tkhai <[email protected]>
> > > > Signed-off-by: Yang Shi <[email protected]>
> > > > ---
> > > > include/linux/shrinker.h | 7 ++++---
> > > > mm/vmscan.c | 31 +++++++++----------------------
> > > > 2 files changed, 13 insertions(+), 25 deletions(-)
> > > >
> > > > diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> > > > index 0f80123650e2..1eac79ce57d4 100644
> > > > --- a/include/linux/shrinker.h
> > > > +++ b/include/linux/shrinker.h
> > > > @@ -79,13 +79,14 @@ struct shrinker {
> > > > #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
> > > >
> > > > /* Flags */
> > > > -#define SHRINKER_NUMA_AWARE (1 << 0)
> > > > -#define SHRINKER_MEMCG_AWARE (1 << 1)
> > > > +#define SHRINKER_REGISTERED (1 << 0)
> > > > +#define SHRINKER_NUMA_AWARE (1 << 1)
> > > > +#define SHRINKER_MEMCG_AWARE (1 << 2)
> > > > /*
> > > > * It just makes sense when the shrinker is also MEMCG_AWARE for now,
> > > > * non-MEMCG_AWARE shrinker should not have this flag set.
> > > > */
> > > > -#define SHRINKER_NONSLAB (1 << 2)
> > > > +#define SHRINKER_NONSLAB (1 << 3)
> > > >
> > > > extern int prealloc_shrinker(struct shrinker *shrinker);
> > > > extern void register_shrinker_prepared(struct shrinker *shrinker);
> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > index 273efbf4d53c..a047980536cf 100644
> > > > --- a/mm/vmscan.c
> > > > +++ b/mm/vmscan.c
> > > > @@ -315,19 +315,6 @@ void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)
> > > > }
> > > > }
> > > >
> > > > -/*
> > > > - * We allow subsystems to populate their shrinker-related
> > > > - * LRU lists before register_shrinker_prepared() is called
> > > > - * for the shrinker, since we don't want to impose
> > > > - * restrictions on their internal registration order.
> > > > - * In this case shrink_slab_memcg() may find corresponding
> > > > - * bit is set in the shrinkers map.
> > > > - *
> > > > - * This value is used by the function to detect registering
> > > > - * shrinkers and to skip do_shrink_slab() calls for them.
> > > > - */
> > > > -#define SHRINKER_REGISTERING ((struct shrinker *)~0UL)
> > > > -
> > > > static DEFINE_IDR(shrinker_idr);
> > > >
> > > > static int prealloc_memcg_shrinker(struct shrinker *shrinker)
> > > > @@ -336,7 +323,7 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
> > > >
> > > > down_write(&shrinker_rwsem);
> > > > /* This may call shrinker, so it must use down_read_trylock() */
> > > > - id = idr_alloc(&shrinker_idr, SHRINKER_REGISTERING, 0, 0, GFP_KERNEL);
> > > > + id = idr_alloc(&shrinker_idr, shrinker, 0, 0, GFP_KERNEL);
> > > > if (id < 0)
> > > > goto unlock;
> > > >
> > > > @@ -499,10 +486,7 @@ void register_shrinker_prepared(struct shrinker *shrinker)
> > > > {
> > > > down_write(&shrinker_rwsem);
> > > > list_add_tail(&shrinker->list, &shrinker_list);
> > > > -#ifdef CONFIG_MEMCG
> > > > - if (shrinker->flags & SHRINKER_MEMCG_AWARE)
> > > > - idr_replace(&shrinker_idr, shrinker, shrinker->id);
> > > > -#endif
> > > > + shrinker->flags |= SHRINKER_REGISTERED;
> > > > up_write(&shrinker_rwsem);
> > > > }
> > > >
> > > > @@ -522,13 +506,16 @@ EXPORT_SYMBOL(register_shrinker);
> > > > */
> > > > void unregister_shrinker(struct shrinker *shrinker)
> > > > {
> > > > - if (!shrinker->nr_deferred)
> > > > + if (!(shrinker->flags & SHRINKER_REGISTERED))
> > > > return;
> > > > - if (shrinker->flags & SHRINKER_MEMCG_AWARE)
> > > > - unregister_memcg_shrinker(shrinker);
> > > > +
> > > > down_write(&shrinker_rwsem);
> > > > list_del(&shrinker->list);
> > > > + shrinker->flags &= ~SHRINKER_REGISTERED;
> > > > up_write(&shrinker_rwsem);
> > > > +
> > > > + if (shrinker->flags & SHRINKER_MEMCG_AWARE)
> > > > + unregister_memcg_shrinker(shrinker);
> > >
> > > Because unregister_memcg_shrinker() will take and release shrinker_rwsem once again,
> > > I wonder if it's better to move it into the locked section and change the calling
> > > convention to require the caller to take the semaphore?
> >
> > I don't think we could do that since unregister_memcg_shrinker() is
> > called by free_prealloced_shrinker() which is called without holding
> > the shrinker_rwsem by fs and workingset code.
> >
> > We could add a bool parameter to indicate if the rwsem was acquired or
> > not, but IMHO it seems not worth it.
>
> Can free_preallocated_shrinker() just do
>
> if (shrinker->flags & SHRINKER_MEMCG_AWARE) {
> down_write(&shrinker_rwsem);
> unregister_memcg_shrinker(shrinker);
> up_write(&shrinker_rwsem);
> }
>
> ?

Aha, yes. I didn't think of that way.

2021-02-10 07:48:59

by Yang Shi

[permalink] [raw]
Subject: [v7 PATCH 04/12] mm: vmscan: remove memcg_shrinker_map_size

Both memcg_shrinker_map_size and shrinker_nr_max is maintained, but actually the
map size can be calculated via shrinker_nr_max, so it seems unnecessary to keep both.
Remove memcg_shrinker_map_size since shrinker_nr_max is also used by iterating the
bit map.

Acked-by: Kirill Tkhai <[email protected]>
Signed-off-by: Yang Shi <[email protected]>
---
mm/vmscan.c | 18 +++++++++---------
1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index e4ddaaaeffe2..641077b09e5d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -185,8 +185,10 @@ static LIST_HEAD(shrinker_list);
static DECLARE_RWSEM(shrinker_rwsem);

#ifdef CONFIG_MEMCG
+static int shrinker_nr_max;

-static int memcg_shrinker_map_size;
+#define NR_MAX_TO_SHR_MAP_SIZE(nr_max) \
+ (DIV_ROUND_UP(nr_max, BITS_PER_LONG) * sizeof(unsigned long))

static void free_shrinker_map_rcu(struct rcu_head *head)
{
@@ -247,7 +249,7 @@ int alloc_shrinker_maps(struct mem_cgroup *memcg)
return 0;

down_write(&shrinker_rwsem);
- size = memcg_shrinker_map_size;
+ size = NR_MAX_TO_SHR_MAP_SIZE(shrinker_nr_max);
for_each_node(nid) {
map = kvzalloc_node(sizeof(*map) + size, GFP_KERNEL, nid);
if (!map) {
@@ -265,12 +267,13 @@ int alloc_shrinker_maps(struct mem_cgroup *memcg)
static int expand_shrinker_maps(int new_id)
{
int size, old_size, ret = 0;
+ int new_nr_max = new_id + 1;
struct mem_cgroup *memcg;

- size = DIV_ROUND_UP(new_id + 1, BITS_PER_LONG) * sizeof(unsigned long);
- old_size = memcg_shrinker_map_size;
+ size = NR_MAX_TO_SHR_MAP_SIZE(new_nr_max);
+ old_size = NR_MAX_TO_SHR_MAP_SIZE(shrinker_nr_max);
if (size <= old_size)
- return 0;
+ goto out;

if (!root_mem_cgroup)
goto out;
@@ -287,7 +290,7 @@ static int expand_shrinker_maps(int new_id)
} while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL);
out:
if (!ret)
- memcg_shrinker_map_size = size;
+ shrinker_nr_max = new_nr_max;

return ret;
}
@@ -320,7 +323,6 @@ void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)
#define SHRINKER_REGISTERING ((struct shrinker *)~0UL)

static DEFINE_IDR(shrinker_idr);
-static int shrinker_nr_max;

static int prealloc_memcg_shrinker(struct shrinker *shrinker)
{
@@ -337,8 +339,6 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
idr_remove(&shrinker_idr, id);
goto unlock;
}
-
- shrinker_nr_max = id + 1;
}
shrinker->id = id;
ret = 0;
--
2.26.2

2021-02-10 07:50:24

by Yang Shi

[permalink] [raw]
Subject: [v7 PATCH 05/12] mm: memcontrol: rename shrinker_map to shrinker_info

The following patch is going to add nr_deferred into shrinker_map, the change will
make shrinker_map not only include map anymore, so rename it to "memcg_shrinker_info".
And this should make the patch adding nr_deferred cleaner and readable and make
review easier. Also remove the "memcg_" prefix.

Acked-by: Vlastimil Babka <[email protected]>
Acked-by: Kirill Tkhai <[email protected]>
Signed-off-by: Yang Shi <[email protected]>
---
include/linux/memcontrol.h | 8 ++---
mm/memcontrol.c | 6 ++--
mm/vmscan.c | 62 +++++++++++++++++++-------------------
3 files changed, 38 insertions(+), 38 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1739f17e0939..4c9253896e25 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -96,7 +96,7 @@ struct lruvec_stat {
* Bitmap of shrinker::id corresponding to memcg-aware shrinkers,
* which have elements charged to this memcg.
*/
-struct memcg_shrinker_map {
+struct shrinker_info {
struct rcu_head rcu;
unsigned long map[];
};
@@ -118,7 +118,7 @@ struct mem_cgroup_per_node {

struct mem_cgroup_reclaim_iter iter;

- struct memcg_shrinker_map __rcu *shrinker_map;
+ struct shrinker_info __rcu *shrinker_info;

struct rb_node tree_node; /* RB tree node */
unsigned long usage_in_excess;/* Set to the value by which */
@@ -1581,8 +1581,8 @@ static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
return false;
}

-int alloc_shrinker_maps(struct mem_cgroup *memcg);
-void free_shrinker_maps(struct mem_cgroup *memcg);
+int alloc_shrinker_info(struct mem_cgroup *memcg);
+void free_shrinker_info(struct mem_cgroup *memcg);
void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id);
#else
#define mem_cgroup_sockets_enabled 0
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f5c9a0d2160b..f64ad0d044d9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5246,11 +5246,11 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
struct mem_cgroup *memcg = mem_cgroup_from_css(css);

/*
- * A memcg must be visible for expand_shrinker_maps()
+ * A memcg must be visible for expand_shrinker_info()
* by the time the maps are allocated. So, we allocate maps
* here, when for_each_mem_cgroup() can't skip it.
*/
- if (alloc_shrinker_maps(memcg)) {
+ if (alloc_shrinker_info(memcg)) {
mem_cgroup_id_remove(memcg);
return -ENOMEM;
}
@@ -5314,7 +5314,7 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
vmpressure_cleanup(&memcg->vmpressure);
cancel_work_sync(&memcg->high_work);
mem_cgroup_remove_from_trees(memcg);
- free_shrinker_maps(memcg);
+ free_shrinker_info(memcg);
memcg_free_kmem(memcg);
mem_cgroup_free(memcg);
}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 641077b09e5d..9436f9246d32 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -190,20 +190,20 @@ static int shrinker_nr_max;
#define NR_MAX_TO_SHR_MAP_SIZE(nr_max) \
(DIV_ROUND_UP(nr_max, BITS_PER_LONG) * sizeof(unsigned long))

-static void free_shrinker_map_rcu(struct rcu_head *head)
+static void free_shrinker_info_rcu(struct rcu_head *head)
{
- kvfree(container_of(head, struct memcg_shrinker_map, rcu));
+ kvfree(container_of(head, struct shrinker_info, rcu));
}

-static int expand_one_shrinker_map(struct mem_cgroup *memcg,
+static int expand_one_shrinker_info(struct mem_cgroup *memcg,
int size, int old_size)
{
- struct memcg_shrinker_map *new, *old;
+ struct shrinker_info *new, *old;
int nid;

for_each_node(nid) {
old = rcu_dereference_protected(
- mem_cgroup_nodeinfo(memcg, nid)->shrinker_map, true);
+ mem_cgroup_nodeinfo(memcg, nid)->shrinker_info, true);
/* Not yet online memcg */
if (!old)
return 0;
@@ -216,17 +216,17 @@ static int expand_one_shrinker_map(struct mem_cgroup *memcg,
memset(new->map, (int)0xff, old_size);
memset((void *)new->map + old_size, 0, size - old_size);

- rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, new);
- call_rcu(&old->rcu, free_shrinker_map_rcu);
+ rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, new);
+ call_rcu(&old->rcu, free_shrinker_info_rcu);
}

return 0;
}

-void free_shrinker_maps(struct mem_cgroup *memcg)
+void free_shrinker_info(struct mem_cgroup *memcg)
{
struct mem_cgroup_per_node *pn;
- struct memcg_shrinker_map *map;
+ struct shrinker_info *info;
int nid;

if (mem_cgroup_is_root(memcg))
@@ -234,15 +234,15 @@ void free_shrinker_maps(struct mem_cgroup *memcg)

for_each_node(nid) {
pn = mem_cgroup_nodeinfo(memcg, nid);
- map = rcu_dereference_protected(pn->shrinker_map, true);
- kvfree(map);
- rcu_assign_pointer(pn->shrinker_map, NULL);
+ info = rcu_dereference_protected(pn->shrinker_info, true);
+ kvfree(info);
+ rcu_assign_pointer(pn->shrinker_info, NULL);
}
}

-int alloc_shrinker_maps(struct mem_cgroup *memcg)
+int alloc_shrinker_info(struct mem_cgroup *memcg)
{
- struct memcg_shrinker_map *map;
+ struct shrinker_info *info;
int nid, size, ret = 0;

if (mem_cgroup_is_root(memcg))
@@ -251,20 +251,20 @@ int alloc_shrinker_maps(struct mem_cgroup *memcg)
down_write(&shrinker_rwsem);
size = NR_MAX_TO_SHR_MAP_SIZE(shrinker_nr_max);
for_each_node(nid) {
- map = kvzalloc_node(sizeof(*map) + size, GFP_KERNEL, nid);
- if (!map) {
- free_shrinker_maps(memcg);
+ info = kvzalloc_node(sizeof(*info) + size, GFP_KERNEL, nid);
+ if (!info) {
+ free_shrinker_info(memcg);
ret = -ENOMEM;
break;
}
- rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_map, map);
+ rcu_assign_pointer(memcg->nodeinfo[nid]->shrinker_info, info);
}
up_write(&shrinker_rwsem);

return ret;
}

-static int expand_shrinker_maps(int new_id)
+static int expand_shrinker_info(int new_id)
{
int size, old_size, ret = 0;
int new_nr_max = new_id + 1;
@@ -282,7 +282,7 @@ static int expand_shrinker_maps(int new_id)
do {
if (mem_cgroup_is_root(memcg))
continue;
- ret = expand_one_shrinker_map(memcg, size, old_size);
+ ret = expand_one_shrinker_info(memcg, size, old_size);
if (ret) {
mem_cgroup_iter_break(NULL, memcg);
goto out;
@@ -298,13 +298,13 @@ static int expand_shrinker_maps(int new_id)
void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)
{
if (shrinker_id >= 0 && memcg && !mem_cgroup_is_root(memcg)) {
- struct memcg_shrinker_map *map;
+ struct shrinker_info *info;

rcu_read_lock();
- map = rcu_dereference(memcg->nodeinfo[nid]->shrinker_map);
+ info = rcu_dereference(memcg->nodeinfo[nid]->shrinker_info);
/* Pairs with smp mb in shrink_slab() */
smp_mb__before_atomic();
- set_bit(shrinker_id, map->map);
+ set_bit(shrinker_id, info->map);
rcu_read_unlock();
}
}
@@ -335,7 +335,7 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
goto unlock;

if (id >= shrinker_nr_max) {
- if (expand_shrinker_maps(id)) {
+ if (expand_shrinker_info(id)) {
idr_remove(&shrinker_idr, id);
goto unlock;
}
@@ -664,7 +664,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
struct mem_cgroup *memcg, int priority)
{
- struct memcg_shrinker_map *map;
+ struct shrinker_info *info;
unsigned long ret, freed = 0;
int i;

@@ -674,12 +674,12 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
if (!down_read_trylock(&shrinker_rwsem))
return 0;

- map = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_map,
- true);
- if (unlikely(!map))
+ info = rcu_dereference_protected(memcg->nodeinfo[nid]->shrinker_info,
+ true);
+ if (unlikely(!info))
goto unlock;

- for_each_set_bit(i, map->map, shrinker_nr_max) {
+ for_each_set_bit(i, info->map, shrinker_nr_max) {
struct shrink_control sc = {
.gfp_mask = gfp_mask,
.nid = nid,
@@ -690,7 +690,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
shrinker = idr_find(&shrinker_idr, i);
if (unlikely(!shrinker || shrinker == SHRINKER_REGISTERING)) {
if (!shrinker)
- clear_bit(i, map->map);
+ clear_bit(i, info->map);
continue;
}

@@ -701,7 +701,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,

ret = do_shrink_slab(&sc, shrinker, priority);
if (ret == SHRINK_EMPTY) {
- clear_bit(i, map->map);
+ clear_bit(i, info->map);
/*
* After the shrinker reported that it had no objects to
* free, but before we cleared the corresponding bit in
--
2.26.2

2021-02-10 18:37:47

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [v7 PATCH 07/12] mm: vmscan: use a new flag to indicate shrinker is registered

On 2/9/21 6:46 PM, Yang Shi wrote:
> Currently registered shrinker is indicated by non-NULL shrinker->nr_deferred.
> This approach is fine with nr_deferred at the shrinker level, but the following
> patches will move MEMCG_AWARE shrinkers' nr_deferred to memcg level, so their
> shrinker->nr_deferred would always be NULL. This would prevent the shrinkers
> from unregistering correctly.
>
> Remove SHRINKER_REGISTERING since we could check if shrinker is registered
> successfully by the new flag.
>
> Acked-by: Kirill Tkhai <[email protected]>
> Signed-off-by: Yang Shi <[email protected]>

Acked-by: Vlastimil Babka <[email protected]>

With Roman's suggestion it's fine by me too.

2021-02-10 18:54:29

by Yang Shi

[permalink] [raw]
Subject: Re: [v7 PATCH 07/12] mm: vmscan: use a new flag to indicate shrinker is registered

On Tue, Feb 9, 2021 at 4:39 PM Roman Gushchin <[email protected]> wrote:
>
> On Tue, Feb 09, 2021 at 09:46:41AM -0800, Yang Shi wrote:
> > Currently registered shrinker is indicated by non-NULL shrinker->nr_deferred.
> > This approach is fine with nr_deferred at the shrinker level, but the following
> > patches will move MEMCG_AWARE shrinkers' nr_deferred to memcg level, so their
> > shrinker->nr_deferred would always be NULL. This would prevent the shrinkers
> > from unregistering correctly.
> >
> > Remove SHRINKER_REGISTERING since we could check if shrinker is registered
> > successfully by the new flag.
> >
> > Acked-by: Kirill Tkhai <[email protected]>
> > Signed-off-by: Yang Shi <[email protected]>
> > ---
> > include/linux/shrinker.h | 7 ++++---
> > mm/vmscan.c | 31 +++++++++----------------------
> > 2 files changed, 13 insertions(+), 25 deletions(-)
> >
> > diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> > index 0f80123650e2..1eac79ce57d4 100644
> > --- a/include/linux/shrinker.h
> > +++ b/include/linux/shrinker.h
> > @@ -79,13 +79,14 @@ struct shrinker {
> > #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
> >
> > /* Flags */
> > -#define SHRINKER_NUMA_AWARE (1 << 0)
> > -#define SHRINKER_MEMCG_AWARE (1 << 1)
> > +#define SHRINKER_REGISTERED (1 << 0)
> > +#define SHRINKER_NUMA_AWARE (1 << 1)
> > +#define SHRINKER_MEMCG_AWARE (1 << 2)
> > /*
> > * It just makes sense when the shrinker is also MEMCG_AWARE for now,
> > * non-MEMCG_AWARE shrinker should not have this flag set.
> > */
> > -#define SHRINKER_NONSLAB (1 << 2)
> > +#define SHRINKER_NONSLAB (1 << 3)
> >
> > extern int prealloc_shrinker(struct shrinker *shrinker);
> > extern void register_shrinker_prepared(struct shrinker *shrinker);
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 273efbf4d53c..a047980536cf 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -315,19 +315,6 @@ void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id)
> > }
> > }
> >
> > -/*
> > - * We allow subsystems to populate their shrinker-related
> > - * LRU lists before register_shrinker_prepared() is called
> > - * for the shrinker, since we don't want to impose
> > - * restrictions on their internal registration order.
> > - * In this case shrink_slab_memcg() may find corresponding
> > - * bit is set in the shrinkers map.
> > - *
> > - * This value is used by the function to detect registering
> > - * shrinkers and to skip do_shrink_slab() calls for them.
> > - */
> > -#define SHRINKER_REGISTERING ((struct shrinker *)~0UL)
> > -
> > static DEFINE_IDR(shrinker_idr);
> >
> > static int prealloc_memcg_shrinker(struct shrinker *shrinker)
> > @@ -336,7 +323,7 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
> >
> > down_write(&shrinker_rwsem);
> > /* This may call shrinker, so it must use down_read_trylock() */
> > - id = idr_alloc(&shrinker_idr, SHRINKER_REGISTERING, 0, 0, GFP_KERNEL);
> > + id = idr_alloc(&shrinker_idr, shrinker, 0, 0, GFP_KERNEL);
> > if (id < 0)
> > goto unlock;
> >
> > @@ -499,10 +486,7 @@ void register_shrinker_prepared(struct shrinker *shrinker)
> > {
> > down_write(&shrinker_rwsem);
> > list_add_tail(&shrinker->list, &shrinker_list);
> > -#ifdef CONFIG_MEMCG
> > - if (shrinker->flags & SHRINKER_MEMCG_AWARE)
> > - idr_replace(&shrinker_idr, shrinker, shrinker->id);
> > -#endif
> > + shrinker->flags |= SHRINKER_REGISTERED;
> > up_write(&shrinker_rwsem);
> > }
> >
> > @@ -522,13 +506,16 @@ EXPORT_SYMBOL(register_shrinker);
> > */
> > void unregister_shrinker(struct shrinker *shrinker)
> > {
> > - if (!shrinker->nr_deferred)
> > + if (!(shrinker->flags & SHRINKER_REGISTERED))
> > return;
> > - if (shrinker->flags & SHRINKER_MEMCG_AWARE)
> > - unregister_memcg_shrinker(shrinker);
> > +
> > down_write(&shrinker_rwsem);
> > list_del(&shrinker->list);
> > + shrinker->flags &= ~SHRINKER_REGISTERED;
> > up_write(&shrinker_rwsem);
> > +
> > + if (shrinker->flags & SHRINKER_MEMCG_AWARE)
> > + unregister_memcg_shrinker(shrinker);
>
> Because unregister_memcg_shrinker() will take and release shrinker_rwsem once again,
> I wonder if it's better to move it into the locked section and change the calling
> convention to require the caller to take the semaphore?

BTW, I think lockdep_assert_held() in unregister_memcg_shrinker()
seems good enough.

>
> > kfree(shrinkrem->nr_deferred);
> > shrinker->nr_deferred = NULL;
> > }
> > @@ -693,7 +680,7 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
> > struct shrinker *shrinker;
> >
> > shrinker = idr_find(&shrinker_idr, i);
> > - if (unlikely(!shrinker || shrinker == SHRINKER_REGISTERING)) {
> > + if (unlikely(!shrinker || !(shrinker->flags & SHRINKER_REGISTERED))) {
> > if (!shrinker)
> > clear_bit(i, info->map);
> > continue;
> > --
> > 2.26.2
> >