v4:
- Drop v3 patch 1 as well as modification to mm/percpu.c as the percpu
vmstat update isn't frequent enough to worth caching it.
- Add a new patch 1 to move Move mod_objcg_state() to memcontrol.c instead.
- Combine v3 patches 4 & 5 into a single patch (patch 3).
- Add a new patch 4 to cache both reclaimable & unreclaimable vmstat updates.
- Add a new patch 5 to improve refill_obj_stock() performance.
v3:
- Add missing "inline" qualifier to the alternate mod_obj_stock_state()
in patch 3.
- Remove redundant current_obj_stock() call in patch 5.
v2:
- Fix bug found by test robot in patch 5.
- Update cover letter and commit logs.
With the recent introduction of the new slab memory controller, we
eliminate the need for having separate kmemcaches for each memory
cgroup and reduce overall kernel memory usage. However, we also add
additional memory accounting overhead to each call of kmem_cache_alloc()
and kmem_cache_free().
For workloads that require a lot of kmemcache allocations and
de-allocations, they may experience performance regression as illustrated
in [1] and [2].
A simple kernel module that performs repeated loop of 100,000,000
kmem_cache_alloc() and kmem_cache_free() of either a small 32-byte object
or a big 4k object at module init time is used for benchmarking. The
test was run on a CascadeLake server with turbo-boosting disable to
reduce run-to-run variation.
With memory accounting disable, the run time was 2.848s with small object
and 2.890s for the big object. With memory accounting enabled, the run
times with the application of various patches in the patchset were:
Applied patches Run time Accounting overhead %age 1 %age 2
--------------- -------- ------------------- ------ ------
Small 32-byte object:
None 10.570s 7.722s 100.0% 271.1%
1-2 8.560s 5.712s 74.0% 200.6%
1-3 6.592s 3.744s 48.5% 131.5%
1-4 7.154s 4.306s 55.8% 151.2%
1-5 7.192s 4.344s 56.3% 152.5%
Large 4k object:
None 20.612s 17.722s 100.0% 613.2%
1-2 20.354s 17.464s 98.5% 604.3%
1-3 19.395s 16.505s 93.1% 571.1%
1-4 19.094s 16.204s 91.4% 560.7%
1-5 13.576s 10.686s 60.3% 369.8%
N.B. %age 1 = overhead/unpatched overhead
%age 2 = overhead/accounting disable time
The small object test exercises mainly the object stock charging and
vmstat update code paths. The large object test also exercises the
refill_obj_stock() and __memcg_kmem_charge()/__memcg_kmem_uncharge()
code paths.
The vmstat data stock caching helps in the small object test,
but not so much on the large object test. Similarly, eliminating
irq_disable/irq_enable helps in the small object test and less so in
the large object test. Caching both reclaimable and non-reclaimable
vmstat data actually regresses performance a bit in this particular
small object test.
The final patch to optimize refill_obj_stock() has negligible impact
on the small object test as this code path isn't being exercised. The
large object test, however, sees a pretty good performance improvement
with this patch.
[1] https://lore.kernel.org/linux-mm/20210408193948.vfktg3azh2wrt56t@gabell/T/#u
[2] https://lore.kernel.org/lkml/20210114025151.GA22932@xsang-OptiPlex-9020/
Waiman Long (5):
mm/memcg: Move mod_objcg_state() to memcontrol.c
mm/memcg: Cache vmstat data in percpu memcg_stock_pcp
mm/memcg: Optimize user context object stock access
mm/memcg: Save both reclaimable & unreclaimable bytes in object stock
mm/memcg: Improve refill_obj_stock() performance
mm/memcontrol.c | 199 +++++++++++++++++++++++++++++++++++++++++-------
mm/slab.h | 16 +---
2 files changed, 175 insertions(+), 40 deletions(-)
--
2.18.1
The mod_objcg_state() function is moved from mm/slab.h to mm/memcontrol.c
so that further optimization can be done to it in later patches without
exposing unnecessary details to other mm components.
Signed-off-by: Waiman Long <[email protected]>
---
mm/memcontrol.c | 13 +++++++++++++
mm/slab.h | 16 ++--------------
2 files changed, 15 insertions(+), 14 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e064ac0d850a..dc9032f28f2e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3150,6 +3150,19 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
css_put(&memcg->css);
}
+void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
+ enum node_stat_item idx, int nr)
+{
+ struct mem_cgroup *memcg;
+ struct lruvec *lruvec = NULL;
+
+ rcu_read_lock();
+ memcg = obj_cgroup_memcg(objcg);
+ lruvec = mem_cgroup_lruvec(memcg, pgdat);
+ mod_memcg_lruvec_state(lruvec, idx, nr);
+ rcu_read_unlock();
+}
+
static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
{
struct memcg_stock_pcp *stock;
diff --git a/mm/slab.h b/mm/slab.h
index 076582f58f68..ae8b85875426 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -239,6 +239,8 @@ static inline bool kmem_cache_debug_flags(struct kmem_cache *s, slab_flags_t fla
#ifdef CONFIG_MEMCG_KMEM
int memcg_alloc_page_obj_cgroups(struct page *page, struct kmem_cache *s,
gfp_t gfp, bool new_page);
+void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
+ enum node_stat_item idx, int nr);
static inline void memcg_free_page_obj_cgroups(struct page *page)
{
@@ -283,20 +285,6 @@ static inline bool memcg_slab_pre_alloc_hook(struct kmem_cache *s,
return true;
}
-static inline void mod_objcg_state(struct obj_cgroup *objcg,
- struct pglist_data *pgdat,
- enum node_stat_item idx, int nr)
-{
- struct mem_cgroup *memcg;
- struct lruvec *lruvec;
-
- rcu_read_lock();
- memcg = obj_cgroup_memcg(objcg);
- lruvec = mem_cgroup_lruvec(memcg, pgdat);
- mod_memcg_lruvec_state(lruvec, idx, nr);
- rcu_read_unlock();
-}
-
static inline void memcg_slab_post_alloc_hook(struct kmem_cache *s,
struct obj_cgroup *objcg,
gfp_t flags, size_t size,
--
2.18.1
Before the new slab memory controller with per object byte charging,
charging and vmstat data update happen only when new slab pages are
allocated or freed. Now they are done with every kmem_cache_alloc()
and kmem_cache_free(). This causes additional overhead for workloads
that generate a lot of alloc and free calls.
The memcg_stock_pcp is used to cache byte charge for a specific
obj_cgroup to reduce that overhead. To further reducing it, this patch
makes the vmstat data cached in the memcg_stock_pcp structure as well
until it accumulates a page size worth of update or when other cached
data change. Caching the vmstat data in the per-cpu stock eliminates two
writes to non-hot cachelines for memcg specific as well as memcg-lruvecs
specific vmstat data by a write to a hot local stock cacheline.
On a 2-socket Cascade Lake server with instrumentation enabled and this
patch applied, it was found that about 20% (634400 out of 3243830)
of the time when mod_objcg_state() is called leads to an actual call
to __mod_objcg_state() after initial boot. When doing parallel kernel
build, the figure was about 17% (24329265 out of 142512465). So caching
the vmstat data reduces the number of calls to __mod_objcg_state()
by more than 80%.
Signed-off-by: Waiman Long <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
---
mm/memcontrol.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 61 insertions(+), 3 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index dc9032f28f2e..693453f95d99 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2213,7 +2213,10 @@ struct memcg_stock_pcp {
#ifdef CONFIG_MEMCG_KMEM
struct obj_cgroup *cached_objcg;
+ struct pglist_data *cached_pgdat;
unsigned int nr_bytes;
+ int vmstat_idx;
+ int vmstat_bytes;
#endif
struct work_struct work;
@@ -3150,8 +3153,9 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
css_put(&memcg->css);
}
-void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
- enum node_stat_item idx, int nr)
+static inline void __mod_objcg_state(struct obj_cgroup *objcg,
+ struct pglist_data *pgdat,
+ enum node_stat_item idx, int nr)
{
struct mem_cgroup *memcg;
struct lruvec *lruvec = NULL;
@@ -3159,10 +3163,53 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
rcu_read_lock();
memcg = obj_cgroup_memcg(objcg);
lruvec = mem_cgroup_lruvec(memcg, pgdat);
- mod_memcg_lruvec_state(lruvec, idx, nr);
+ __mod_memcg_lruvec_state(lruvec, idx, nr);
rcu_read_unlock();
}
+void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
+ enum node_stat_item idx, int nr)
+{
+ struct memcg_stock_pcp *stock;
+ unsigned long flags;
+
+ local_irq_save(flags);
+ stock = this_cpu_ptr(&memcg_stock);
+
+ /*
+ * Save vmstat data in stock and skip vmstat array update unless
+ * accumulating over a page of vmstat data or when pgdat or idx
+ * changes.
+ */
+ if (stock->cached_objcg != objcg) {
+ /* Output the current data as is */
+ } else if (!stock->vmstat_bytes) {
+ /* Save the current data */
+ stock->vmstat_bytes = nr;
+ stock->vmstat_idx = idx;
+ stock->cached_pgdat = pgdat;
+ nr = 0;
+ } else if ((stock->cached_pgdat != pgdat) ||
+ (stock->vmstat_idx != idx)) {
+ /* Output the cached data & save the current data */
+ swap(nr, stock->vmstat_bytes);
+ swap(idx, stock->vmstat_idx);
+ swap(pgdat, stock->cached_pgdat);
+ } else {
+ stock->vmstat_bytes += nr;
+ if (abs(stock->vmstat_bytes) > PAGE_SIZE) {
+ nr = stock->vmstat_bytes;
+ stock->vmstat_bytes = 0;
+ } else {
+ nr = 0;
+ }
+ }
+ if (nr)
+ __mod_objcg_state(objcg, pgdat, idx, nr);
+
+ local_irq_restore(flags);
+}
+
static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
{
struct memcg_stock_pcp *stock;
@@ -3213,6 +3260,17 @@ static void drain_obj_stock(struct memcg_stock_pcp *stock)
stock->nr_bytes = 0;
}
+ /*
+ * Flush the vmstat data in current stock
+ */
+ if (stock->vmstat_bytes) {
+ __mod_objcg_state(old, stock->cached_pgdat, stock->vmstat_idx,
+ stock->vmstat_bytes);
+ stock->cached_pgdat = NULL;
+ stock->vmstat_bytes = 0;
+ stock->vmstat_idx = 0;
+ }
+
obj_cgroup_put(old);
stock->cached_objcg = NULL;
}
--
2.18.1
There are two issues with the current refill_obj_stock() code. First of
all, when nr_bytes reaches over PAGE_SIZE, it calls drain_obj_stock() to
atomically flush out remaining bytes to obj_cgroup, clear cached_objcg
and do a obj_cgroup_put(). It is likely that the same obj_cgroup will
be used again which leads to another call to drain_obj_stock() and
obj_cgroup_get() as well as atomically retrieve the available byte from
obj_cgroup. That is costly. Instead, we should just uncharge the excess
pages, reduce the stock bytes and be done with it. The drain_obj_stock()
function should only be called when obj_cgroup changes.
Secondly, when charging an object of size not less than a page in
obj_cgroup_charge(), it is possible that the remaining bytes to be
refilled to the stock will overflow a page and cause refill_obj_stock()
to uncharge 1 page. To avoid the additional uncharge in this case,
a new overfill flag is added to refill_obj_stock() which will be set
when called from obj_cgroup_charge().
Signed-off-by: Waiman Long <[email protected]>
---
mm/memcontrol.c | 23 +++++++++++++++++------
1 file changed, 17 insertions(+), 6 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a6dd18f6d8a8..d13961352eef 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3357,23 +3357,34 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
return false;
}
-static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
+static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
+ bool overfill)
{
unsigned long flags;
struct obj_stock *stock = get_obj_stock(&flags);
+ unsigned int nr_pages = 0;
if (stock->cached_objcg != objcg) { /* reset if necessary */
- drain_obj_stock(stock);
+ if (stock->cached_objcg)
+ drain_obj_stock(stock);
obj_cgroup_get(objcg);
stock->cached_objcg = objcg;
stock->nr_bytes = atomic_xchg(&objcg->nr_charged_bytes, 0);
}
stock->nr_bytes += nr_bytes;
- if (stock->nr_bytes > PAGE_SIZE)
- drain_obj_stock(stock);
+ if (!overfill && (stock->nr_bytes > PAGE_SIZE)) {
+ nr_pages = stock->nr_bytes >> PAGE_SHIFT;
+ stock->nr_bytes &= (PAGE_SIZE - 1);
+ }
put_obj_stock(flags);
+
+ if (nr_pages) {
+ rcu_read_lock();
+ __memcg_kmem_uncharge(obj_cgroup_memcg(objcg), nr_pages);
+ rcu_read_unlock();
+ }
}
int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
@@ -3410,7 +3421,7 @@ int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
ret = __memcg_kmem_charge(memcg, gfp, nr_pages);
if (!ret && nr_bytes)
- refill_obj_stock(objcg, PAGE_SIZE - nr_bytes);
+ refill_obj_stock(objcg, PAGE_SIZE - nr_bytes, true);
css_put(&memcg->css);
return ret;
@@ -3418,7 +3429,7 @@ int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
{
- refill_obj_stock(objcg, size);
+ refill_obj_stock(objcg, size, false);
}
#endif /* CONFIG_MEMCG_KMEM */
--
2.18.1
Currently, the object stock structure caches either reclaimable vmstat
bytes or unreclaimable vmstat bytes in its object stock structure. The
hit rate can be improved if both types of vmstat data can be cached
especially for single-node system.
This patch supports the cacheing of both type of vmstat data, though
at the expense of a slightly increased complexity in the caching code.
For large object (>= PAGE_SIZE), vmstat array is done directly without
going through the stock caching step.
On a 2-socket Cascade Lake server with instrumentation enabled, the
miss rates are shown in the table below.
Initial bootup:
Kernel __mod_objcg_state mod_objcg_state %age
------ ----------------- --------------- ----
Before patch 634400 3243830 19.6%
After patch 419810 3182424 13.2%
Parallel kernel build:
Kernel __mod_objcg_state mod_objcg_state %age
------ ----------------- --------------- ----
Before patch 24329265 142512465 17.1%
After patch 24051721 142445825 16.9%
There was a decrease of miss rate after initial system bootup. However,
the miss rate for parallel kernel build remained about the same probably
because most of the touched kmemcache objects were reclaimable inodes
and dentries.
Signed-off-by: Waiman Long <[email protected]>
---
mm/memcontrol.c | 79 +++++++++++++++++++++++++++++++------------------
1 file changed, 51 insertions(+), 28 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c13502eab282..a6dd18f6d8a8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2212,8 +2212,8 @@ struct obj_stock {
struct obj_cgroup *cached_objcg;
struct pglist_data *cached_pgdat;
unsigned int nr_bytes;
- int vmstat_idx;
- int vmstat_bytes;
+ int reclaimable_bytes; /* NR_SLAB_RECLAIMABLE_B */
+ int unreclaimable_bytes; /* NR_SLAB_UNRECLAIMABLE_B */
#else
int dummy[0];
#endif
@@ -3217,40 +3217,56 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
enum node_stat_item idx, int nr)
{
unsigned long flags;
- struct obj_stock *stock = get_obj_stock(&flags);
+ struct obj_stock *stock;
+ int *bytes, *alt_bytes, alt_idx;
+
+ /*
+ * Directly update vmstat array for big object.
+ */
+ if (unlikely(abs(nr) >= PAGE_SIZE))
+ goto update_vmstat;
+
+ stock = get_obj_stock(&flags);
+ if (idx == NR_SLAB_RECLAIMABLE_B) {
+ bytes = &stock->reclaimable_bytes;
+ alt_bytes = &stock->unreclaimable_bytes;
+ alt_idx = NR_SLAB_UNRECLAIMABLE_B;
+ } else {
+ bytes = &stock->unreclaimable_bytes;
+ alt_bytes = &stock->reclaimable_bytes;
+ alt_idx = NR_SLAB_RECLAIMABLE_B;
+ }
/*
- * Save vmstat data in stock and skip vmstat array update unless
- * accumulating over a page of vmstat data or when pgdat or idx
+ * Try to save vmstat data in stock and skip vmstat array update
+ * unless accumulating over a page of vmstat data or when pgdat
* changes.
*/
if (stock->cached_objcg != objcg) {
/* Output the current data as is */
- } else if (!stock->vmstat_bytes) {
- /* Save the current data */
- stock->vmstat_bytes = nr;
- stock->vmstat_idx = idx;
- stock->cached_pgdat = pgdat;
- nr = 0;
- } else if ((stock->cached_pgdat != pgdat) ||
- (stock->vmstat_idx != idx)) {
- /* Output the cached data & save the current data */
- swap(nr, stock->vmstat_bytes);
- swap(idx, stock->vmstat_idx);
+ } else if (stock->cached_pgdat != pgdat) {
+ /* Save the current data and output cached data, if any */
+ swap(nr, *bytes);
swap(pgdat, stock->cached_pgdat);
+ if (*alt_bytes) {
+ __mod_objcg_state(objcg, pgdat, alt_idx,
+ *alt_bytes);
+ *alt_bytes = 0;
+ }
} else {
- stock->vmstat_bytes += nr;
- if (abs(stock->vmstat_bytes) > PAGE_SIZE) {
- nr = stock->vmstat_bytes;
- stock->vmstat_bytes = 0;
+ *bytes += nr;
+ if (abs(*bytes) > PAGE_SIZE) {
+ nr = *bytes;
+ *bytes = 0;
} else {
nr = 0;
}
}
- if (nr)
- __mod_objcg_state(objcg, pgdat, idx, nr);
-
put_obj_stock(flags);
+ if (!nr)
+ return;
+update_vmstat:
+ __mod_objcg_state(objcg, pgdat, idx, nr);
}
static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
@@ -3303,12 +3319,19 @@ static void drain_obj_stock(struct obj_stock *stock)
/*
* Flush the vmstat data in current stock
*/
- if (stock->vmstat_bytes) {
- __mod_objcg_state(old, stock->cached_pgdat, stock->vmstat_idx,
- stock->vmstat_bytes);
+ if (stock->reclaimable_bytes || stock->unreclaimable_bytes) {
+ int bytes;
+
+ if ((bytes = stock->reclaimable_bytes))
+ __mod_objcg_state(old, stock->cached_pgdat,
+ NR_SLAB_RECLAIMABLE_B, bytes);
+ if ((bytes = stock->unreclaimable_bytes))
+ __mod_objcg_state(old, stock->cached_pgdat,
+ NR_SLAB_UNRECLAIMABLE_B, bytes);
+
stock->cached_pgdat = NULL;
- stock->vmstat_bytes = 0;
- stock->vmstat_idx = 0;
+ stock->reclaimable_bytes = 0;
+ stock->unreclaimable_bytes = 0;
}
obj_cgroup_put(old);
--
2.18.1
On Mon, Apr 19, 2021 at 8:01 AM Waiman Long <[email protected]> wrote:
>
> There are two issues with the current refill_obj_stock() code. First of
> all, when nr_bytes reaches over PAGE_SIZE, it calls drain_obj_stock() to
> atomically flush out remaining bytes to obj_cgroup, clear cached_objcg
> and do a obj_cgroup_put(). It is likely that the same obj_cgroup will
> be used again which leads to another call to drain_obj_stock() and
> obj_cgroup_get() as well as atomically retrieve the available byte from
> obj_cgroup. That is costly. Instead, we should just uncharge the excess
> pages, reduce the stock bytes and be done with it. The drain_obj_stock()
> function should only be called when obj_cgroup changes.
>
> Secondly, when charging an object of size not less than a page in
> obj_cgroup_charge(), it is possible that the remaining bytes to be
> refilled to the stock will overflow a page and cause refill_obj_stock()
> to uncharge 1 page. To avoid the additional uncharge in this case,
> a new overfill flag is added to refill_obj_stock() which will be set
> when called from obj_cgroup_charge().
>
> Signed-off-by: Waiman Long <[email protected]>
> ---
> mm/memcontrol.c | 23 +++++++++++++++++------
> 1 file changed, 17 insertions(+), 6 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a6dd18f6d8a8..d13961352eef 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3357,23 +3357,34 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
> return false;
> }
>
> -static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
> +static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
> + bool overfill)
> {
> unsigned long flags;
> struct obj_stock *stock = get_obj_stock(&flags);
> + unsigned int nr_pages = 0;
>
> if (stock->cached_objcg != objcg) { /* reset if necessary */
> - drain_obj_stock(stock);
> + if (stock->cached_objcg)
> + drain_obj_stock(stock);
> obj_cgroup_get(objcg);
> stock->cached_objcg = objcg;
> stock->nr_bytes = atomic_xchg(&objcg->nr_charged_bytes, 0);
> }
> stock->nr_bytes += nr_bytes;
>
> - if (stock->nr_bytes > PAGE_SIZE)
> - drain_obj_stock(stock);
> + if (!overfill && (stock->nr_bytes > PAGE_SIZE)) {
> + nr_pages = stock->nr_bytes >> PAGE_SHIFT;
> + stock->nr_bytes &= (PAGE_SIZE - 1);
> + }
>
> put_obj_stock(flags);
> +
> + if (nr_pages) {
> + rcu_read_lock();
> + __memcg_kmem_uncharge(obj_cgroup_memcg(objcg), nr_pages);
> + rcu_read_unlock();
> + }
It is not safe to call __memcg_kmem_uncharge() under rcu lock
and without holding a reference to memcg. More details can refer
to the following link.
https://lore.kernel.org/linux-mm/[email protected]/
In the above patchset, we introduce obj_cgroup_uncharge_pages to
uncharge some pages from object cgroup. You can use this safe
API.
Thanks.
> }
>
> int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
> @@ -3410,7 +3421,7 @@ int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
>
> ret = __memcg_kmem_charge(memcg, gfp, nr_pages);
> if (!ret && nr_bytes)
> - refill_obj_stock(objcg, PAGE_SIZE - nr_bytes);
> + refill_obj_stock(objcg, PAGE_SIZE - nr_bytes, true);
>
> css_put(&memcg->css);
> return ret;
> @@ -3418,7 +3429,7 @@ int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
>
> void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
> {
> - refill_obj_stock(objcg, size);
> + refill_obj_stock(objcg, size, false);
> }
>
> #endif /* CONFIG_MEMCG_KMEM */
> --
> 2.18.1
>
On Sun, Apr 18, 2021 at 11:07 PM Muchun Song <[email protected]> wrote:
>
> On Mon, Apr 19, 2021 at 8:01 AM Waiman Long <[email protected]> wrote:
> >
> > There are two issues with the current refill_obj_stock() code. First of
> > all, when nr_bytes reaches over PAGE_SIZE, it calls drain_obj_stock() to
> > atomically flush out remaining bytes to obj_cgroup, clear cached_objcg
> > and do a obj_cgroup_put(). It is likely that the same obj_cgroup will
> > be used again which leads to another call to drain_obj_stock() and
> > obj_cgroup_get() as well as atomically retrieve the available byte from
> > obj_cgroup. That is costly. Instead, we should just uncharge the excess
> > pages, reduce the stock bytes and be done with it. The drain_obj_stock()
> > function should only be called when obj_cgroup changes.
> >
> > Secondly, when charging an object of size not less than a page in
> > obj_cgroup_charge(), it is possible that the remaining bytes to be
> > refilled to the stock will overflow a page and cause refill_obj_stock()
> > to uncharge 1 page. To avoid the additional uncharge in this case,
> > a new overfill flag is added to refill_obj_stock() which will be set
> > when called from obj_cgroup_charge().
> >
> > Signed-off-by: Waiman Long <[email protected]>
> > ---
> > mm/memcontrol.c | 23 +++++++++++++++++------
> > 1 file changed, 17 insertions(+), 6 deletions(-)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index a6dd18f6d8a8..d13961352eef 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -3357,23 +3357,34 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
> > return false;
> > }
> >
> > -static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
> > +static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
> > + bool overfill)
> > {
> > unsigned long flags;
> > struct obj_stock *stock = get_obj_stock(&flags);
> > + unsigned int nr_pages = 0;
> >
> > if (stock->cached_objcg != objcg) { /* reset if necessary */
> > - drain_obj_stock(stock);
> > + if (stock->cached_objcg)
> > + drain_obj_stock(stock);
> > obj_cgroup_get(objcg);
> > stock->cached_objcg = objcg;
> > stock->nr_bytes = atomic_xchg(&objcg->nr_charged_bytes, 0);
> > }
> > stock->nr_bytes += nr_bytes;
> >
> > - if (stock->nr_bytes > PAGE_SIZE)
> > - drain_obj_stock(stock);
> > + if (!overfill && (stock->nr_bytes > PAGE_SIZE)) {
> > + nr_pages = stock->nr_bytes >> PAGE_SHIFT;
> > + stock->nr_bytes &= (PAGE_SIZE - 1);
> > + }
> >
> > put_obj_stock(flags);
> > +
> > + if (nr_pages) {
> > + rcu_read_lock();
> > + __memcg_kmem_uncharge(obj_cgroup_memcg(objcg), nr_pages);
> > + rcu_read_unlock();
> > + }
>
> It is not safe to call __memcg_kmem_uncharge() under rcu lock
> and without holding a reference to memcg. More details can refer
> to the following link.
>
> https://lore.kernel.org/linux-mm/[email protected]/
>
> In the above patchset, we introduce obj_cgroup_uncharge_pages to
> uncharge some pages from object cgroup. You can use this safe
> API.
>
I would recommend just rebase the patch series over the latest mm tree.
On Sun, Apr 18, 2021 at 08:00:28PM -0400, Waiman Long wrote:
> The mod_objcg_state() function is moved from mm/slab.h to mm/memcontrol.c
> so that further optimization can be done to it in later patches without
> exposing unnecessary details to other mm components.
>
> Signed-off-by: Waiman Long <[email protected]>
> ---
> mm/memcontrol.c | 13 +++++++++++++
> mm/slab.h | 16 ++--------------
> 2 files changed, 15 insertions(+), 14 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e064ac0d850a..dc9032f28f2e 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3150,6 +3150,19 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
> css_put(&memcg->css);
> }
>
> +void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
> + enum node_stat_item idx, int nr)
> +{
> + struct mem_cgroup *memcg;
> + struct lruvec *lruvec = NULL;
> +
> + rcu_read_lock();
> + memcg = obj_cgroup_memcg(objcg);
> + lruvec = mem_cgroup_lruvec(memcg, pgdat);
> + mod_memcg_lruvec_state(lruvec, idx, nr);
> + rcu_read_unlock();
> +}
It would be more naturally placed next to the others, e.g. below
__mod_lruvec_kmem_state().
But no deal breaker if there isn't another revision.
Acked-by: Johannes Weiner <[email protected]>
On 4/19/21 11:14 AM, Johannes Weiner wrote:
> On Sun, Apr 18, 2021 at 08:00:28PM -0400, Waiman Long wrote:
>> The mod_objcg_state() function is moved from mm/slab.h to mm/memcontrol.c
>> so that further optimization can be done to it in later patches without
>> exposing unnecessary details to other mm components.
>>
>> Signed-off-by: Waiman Long <[email protected]>
>> ---
>> mm/memcontrol.c | 13 +++++++++++++
>> mm/slab.h | 16 ++--------------
>> 2 files changed, 15 insertions(+), 14 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index e064ac0d850a..dc9032f28f2e 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -3150,6 +3150,19 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
>> css_put(&memcg->css);
>> }
>>
>> +void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
>> + enum node_stat_item idx, int nr)
>> +{
>> + struct mem_cgroup *memcg;
>> + struct lruvec *lruvec = NULL;
>> +
>> + rcu_read_lock();
>> + memcg = obj_cgroup_memcg(objcg);
>> + lruvec = mem_cgroup_lruvec(memcg, pgdat);
>> + mod_memcg_lruvec_state(lruvec, idx, nr);
>> + rcu_read_unlock();
>> +}
> It would be more naturally placed next to the others, e.g. below
> __mod_lruvec_kmem_state().
>
> But no deal breaker if there isn't another revision.
>
> Acked-by: Johannes Weiner <[email protected]>
>
Yes, there will be another revision by rebasing patch series on the
linux-next. I will move the function then.
Cheers,
Longman
On 4/19/21 2:06 AM, Muchun Song wrote:
> On Mon, Apr 19, 2021 at 8:01 AM Waiman Long <[email protected]> wrote:
>> There are two issues with the current refill_obj_stock() code. First of
>> all, when nr_bytes reaches over PAGE_SIZE, it calls drain_obj_stock() to
>> atomically flush out remaining bytes to obj_cgroup, clear cached_objcg
>> and do a obj_cgroup_put(). It is likely that the same obj_cgroup will
>> be used again which leads to another call to drain_obj_stock() and
>> obj_cgroup_get() as well as atomically retrieve the available byte from
>> obj_cgroup. That is costly. Instead, we should just uncharge the excess
>> pages, reduce the stock bytes and be done with it. The drain_obj_stock()
>> function should only be called when obj_cgroup changes.
>>
>> Secondly, when charging an object of size not less than a page in
>> obj_cgroup_charge(), it is possible that the remaining bytes to be
>> refilled to the stock will overflow a page and cause refill_obj_stock()
>> to uncharge 1 page. To avoid the additional uncharge in this case,
>> a new overfill flag is added to refill_obj_stock() which will be set
>> when called from obj_cgroup_charge().
>>
>> Signed-off-by: Waiman Long <[email protected]>
>> ---
>> mm/memcontrol.c | 23 +++++++++++++++++------
>> 1 file changed, 17 insertions(+), 6 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index a6dd18f6d8a8..d13961352eef 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -3357,23 +3357,34 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
>> return false;
>> }
>>
>> -static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
>> +static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
>> + bool overfill)
>> {
>> unsigned long flags;
>> struct obj_stock *stock = get_obj_stock(&flags);
>> + unsigned int nr_pages = 0;
>>
>> if (stock->cached_objcg != objcg) { /* reset if necessary */
>> - drain_obj_stock(stock);
>> + if (stock->cached_objcg)
>> + drain_obj_stock(stock);
>> obj_cgroup_get(objcg);
>> stock->cached_objcg = objcg;
>> stock->nr_bytes = atomic_xchg(&objcg->nr_charged_bytes, 0);
>> }
>> stock->nr_bytes += nr_bytes;
>>
>> - if (stock->nr_bytes > PAGE_SIZE)
>> - drain_obj_stock(stock);
>> + if (!overfill && (stock->nr_bytes > PAGE_SIZE)) {
>> + nr_pages = stock->nr_bytes >> PAGE_SHIFT;
>> + stock->nr_bytes &= (PAGE_SIZE - 1);
>> + }
>>
>> put_obj_stock(flags);
>> +
>> + if (nr_pages) {
>> + rcu_read_lock();
>> + __memcg_kmem_uncharge(obj_cgroup_memcg(objcg), nr_pages);
>> + rcu_read_unlock();
>> + }
> It is not safe to call __memcg_kmem_uncharge() under rcu lock
> and without holding a reference to memcg. More details can refer
> to the following link.
>
> https://lore.kernel.org/linux-mm/[email protected]/
>
> In the above patchset, we introduce obj_cgroup_uncharge_pages to
> uncharge some pages from object cgroup. You can use this safe
> API.
Thanks for the comment. Will update my patch with call to
obj_cgroup_uncharge_pages().
Cheers,
Longman
On 4/19/21 11:21 AM, Waiman Long wrote:
> On 4/19/21 11:14 AM, Johannes Weiner wrote:
>> On Sun, Apr 18, 2021 at 08:00:28PM -0400, Waiman Long wrote:
>>> The mod_objcg_state() function is moved from mm/slab.h to
>>> mm/memcontrol.c
>>> so that further optimization can be done to it in later patches without
>>> exposing unnecessary details to other mm components.
>>>
>>> Signed-off-by: Waiman Long <[email protected]>
>>> ---
>>> mm/memcontrol.c | 13 +++++++++++++
>>> mm/slab.h | 16 ++--------------
>>> 2 files changed, 15 insertions(+), 14 deletions(-)
>>>
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index e064ac0d850a..dc9032f28f2e 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -3150,6 +3150,19 @@ void __memcg_kmem_uncharge_page(struct page
>>> *page, int order)
>>> css_put(&memcg->css);
>>> }
>>> +void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data
>>> *pgdat,
>>> + enum node_stat_item idx, int nr)
>>> +{
>>> + struct mem_cgroup *memcg;
>>> + struct lruvec *lruvec = NULL;
>>> +
>>> + rcu_read_lock();
>>> + memcg = obj_cgroup_memcg(objcg);
>>> + lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>> + mod_memcg_lruvec_state(lruvec, idx, nr);
>>> + rcu_read_unlock();
>>> +}
>> It would be more naturally placed next to the others, e.g. below
>> __mod_lruvec_kmem_state().
>>
>> But no deal breaker if there isn't another revision.
>>
>> Acked-by: Johannes Weiner <[email protected]>
>>
> Yes, there will be another revision by rebasing patch series on the
> linux-next. I will move the function then.
OK, it turns out that mod_objcg_state() is only defined if
CONFIG_MEMCG_KMEM. That was why I put it in the CONFIG_MEMCG_KMEM block
with the other obj_stock functions. I think I will keep it there.
Thanks,
Longman
On 4/19/21 1:13 PM, Johannes Weiner wrote:
> On Mon, Apr 19, 2021 at 12:18:29PM -0400, Waiman Long wrote:
>> On 4/19/21 11:21 AM, Waiman Long wrote:
>>> On 4/19/21 11:14 AM, Johannes Weiner wrote:
>>>> On Sun, Apr 18, 2021 at 08:00:28PM -0400, Waiman Long wrote:
>>>>> The mod_objcg_state() function is moved from mm/slab.h to
>>>>> mm/memcontrol.c
>>>>> so that further optimization can be done to it in later patches without
>>>>> exposing unnecessary details to other mm components.
>>>>>
>>>>> Signed-off-by: Waiman Long <[email protected]>
>>>>> ---
>>>>> mm/memcontrol.c | 13 +++++++++++++
>>>>> mm/slab.h | 16 ++--------------
>>>>> 2 files changed, 15 insertions(+), 14 deletions(-)
>>>>>
>>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>>>> index e064ac0d850a..dc9032f28f2e 100644
>>>>> --- a/mm/memcontrol.c
>>>>> +++ b/mm/memcontrol.c
>>>>> @@ -3150,6 +3150,19 @@ void __memcg_kmem_uncharge_page(struct
>>>>> page *page, int order)
>>>>> css_put(&memcg->css);
>>>>> }
>>>>> +void mod_objcg_state(struct obj_cgroup *objcg, struct
>>>>> pglist_data *pgdat,
>>>>> + enum node_stat_item idx, int nr)
>>>>> +{
>>>>> + struct mem_cgroup *memcg;
>>>>> + struct lruvec *lruvec = NULL;
>>>>> +
>>>>> + rcu_read_lock();
>>>>> + memcg = obj_cgroup_memcg(objcg);
>>>>> + lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>>>> + mod_memcg_lruvec_state(lruvec, idx, nr);
>>>>> + rcu_read_unlock();
>>>>> +}
>>>> It would be more naturally placed next to the others, e.g. below
>>>> __mod_lruvec_kmem_state().
>>>>
>>>> But no deal breaker if there isn't another revision.
>>>>
>>>> Acked-by: Johannes Weiner <[email protected]>
>>>>
>>> Yes, there will be another revision by rebasing patch series on the
>>> linux-next. I will move the function then.
>> OK, it turns out that mod_objcg_state() is only defined if
>> CONFIG_MEMCG_KMEM. That was why I put it in the CONFIG_MEMCG_KMEM block with
>> the other obj_stock functions. I think I will keep it there.
> The CONFIG_MEMCG_KMEM has become sort of useless now. It used to be
> configurable, but now it just means CONFIG_MEMCG && !CONFIG_SLOB. And
> even that doesn't make sense because while slob doesn't support slab
> object tracking, we can still do all the other stuff we do under
> KMEM. I have a patch in the works to remove the symbol and ifdefs.
>
> With that in mind, it's better to group the functions based on what
> they do rather than based on CONFIG_MEMCG_KMEM. It's easier to just
> remove another ifdef later than it is to reorder the functions.
>
OK, I will make the move then. Thanks for the explanation.
Cheers,
Longman
On 4/19/21 1:19 PM, Waiman Long wrote:
> On 4/19/21 1:13 PM, Johannes Weiner wrote:
>> On Mon, Apr 19, 2021 at 12:18:29PM -0400, Waiman Long wrote:
>>> On 4/19/21 11:21 AM, Waiman Long wrote:
>>>> On 4/19/21 11:14 AM, Johannes Weiner wrote:
>>>>> On Sun, Apr 18, 2021 at 08:00:28PM -0400, Waiman Long wrote:
>>>>>> The mod_objcg_state() function is moved from mm/slab.h to
>>>>>> mm/memcontrol.c
>>>>>> so that further optimization can be done to it in later patches
>>>>>> without
>>>>>> exposing unnecessary details to other mm components.
>>>>>>
>>>>>> Signed-off-by: Waiman Long <[email protected]>
>>>>>> ---
>>>>>> mm/memcontrol.c | 13 +++++++++++++
>>>>>> mm/slab.h | 16 ++--------------
>>>>>> 2 files changed, 15 insertions(+), 14 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>>>>> index e064ac0d850a..dc9032f28f2e 100644
>>>>>> --- a/mm/memcontrol.c
>>>>>> +++ b/mm/memcontrol.c
>>>>>> @@ -3150,6 +3150,19 @@ void __memcg_kmem_uncharge_page(struct
>>>>>> page *page, int order)
>>>>>> css_put(&memcg->css);
>>>>>> }
>>>>>> +void mod_objcg_state(struct obj_cgroup *objcg, struct
>>>>>> pglist_data *pgdat,
>>>>>> + enum node_stat_item idx, int nr)
>>>>>> +{
>>>>>> + struct mem_cgroup *memcg;
>>>>>> + struct lruvec *lruvec = NULL;
>>>>>> +
>>>>>> + rcu_read_lock();
>>>>>> + memcg = obj_cgroup_memcg(objcg);
>>>>>> + lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>>>>> + mod_memcg_lruvec_state(lruvec, idx, nr);
>>>>>> + rcu_read_unlock();
>>>>>> +}
>>>>> It would be more naturally placed next to the others, e.g. below
>>>>> __mod_lruvec_kmem_state().
>>>>>
>>>>> But no deal breaker if there isn't another revision.
>>>>>
>>>>> Acked-by: Johannes Weiner <[email protected]>
>>>>>
>>>> Yes, there will be another revision by rebasing patch series on the
>>>> linux-next. I will move the function then.
>>> OK, it turns out that mod_objcg_state() is only defined if
>>> CONFIG_MEMCG_KMEM. That was why I put it in the CONFIG_MEMCG_KMEM
>>> block with
>>> the other obj_stock functions. I think I will keep it there.
>> The CONFIG_MEMCG_KMEM has become sort of useless now. It used to be
>> configurable, but now it just means CONFIG_MEMCG && !CONFIG_SLOB. And
>> even that doesn't make sense because while slob doesn't support slab
>> object tracking, we can still do all the other stuff we do under
>> KMEM. I have a patch in the works to remove the symbol and ifdefs.
>>
>> With that in mind, it's better to group the functions based on what
>> they do rather than based on CONFIG_MEMCG_KMEM. It's easier to just
>> remove another ifdef later than it is to reorder the functions.
>>
> OK, I will make the move then. Thanks for the explanation.
BTW, have you ever thought of moving the cgroup-v1 specific functions
out into a separate memcontrol-v1.c file just like
kernel/cgroup/cgroup-v1.c?
I thought of that before, but memcontrol.c is a frequently changed file
and so a bit hard to do.
Cheers,
Longman
On 4/19/21 11:00 AM, Shakeel Butt wrote:
> On Sun, Apr 18, 2021 at 11:07 PM Muchun Song <[email protected]> wrote:
>> On Mon, Apr 19, 2021 at 8:01 AM Waiman Long <[email protected]> wrote:
>>> There are two issues with the current refill_obj_stock() code. First of
>>> all, when nr_bytes reaches over PAGE_SIZE, it calls drain_obj_stock() to
>>> atomically flush out remaining bytes to obj_cgroup, clear cached_objcg
>>> and do a obj_cgroup_put(). It is likely that the same obj_cgroup will
>>> be used again which leads to another call to drain_obj_stock() and
>>> obj_cgroup_get() as well as atomically retrieve the available byte from
>>> obj_cgroup. That is costly. Instead, we should just uncharge the excess
>>> pages, reduce the stock bytes and be done with it. The drain_obj_stock()
>>> function should only be called when obj_cgroup changes.
>>>
>>> Secondly, when charging an object of size not less than a page in
>>> obj_cgroup_charge(), it is possible that the remaining bytes to be
>>> refilled to the stock will overflow a page and cause refill_obj_stock()
>>> to uncharge 1 page. To avoid the additional uncharge in this case,
>>> a new overfill flag is added to refill_obj_stock() which will be set
>>> when called from obj_cgroup_charge().
>>>
>>> Signed-off-by: Waiman Long <[email protected]>
>>> ---
>>> mm/memcontrol.c | 23 +++++++++++++++++------
>>> 1 file changed, 17 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index a6dd18f6d8a8..d13961352eef 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -3357,23 +3357,34 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
>>> return false;
>>> }
>>>
>>> -static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
>>> +static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
>>> + bool overfill)
>>> {
>>> unsigned long flags;
>>> struct obj_stock *stock = get_obj_stock(&flags);
>>> + unsigned int nr_pages = 0;
>>>
>>> if (stock->cached_objcg != objcg) { /* reset if necessary */
>>> - drain_obj_stock(stock);
>>> + if (stock->cached_objcg)
>>> + drain_obj_stock(stock);
>>> obj_cgroup_get(objcg);
>>> stock->cached_objcg = objcg;
>>> stock->nr_bytes = atomic_xchg(&objcg->nr_charged_bytes, 0);
>>> }
>>> stock->nr_bytes += nr_bytes;
>>>
>>> - if (stock->nr_bytes > PAGE_SIZE)
>>> - drain_obj_stock(stock);
>>> + if (!overfill && (stock->nr_bytes > PAGE_SIZE)) {
>>> + nr_pages = stock->nr_bytes >> PAGE_SHIFT;
>>> + stock->nr_bytes &= (PAGE_SIZE - 1);
>>> + }
>>>
>>> put_obj_stock(flags);
>>> +
>>> + if (nr_pages) {
>>> + rcu_read_lock();
>>> + __memcg_kmem_uncharge(obj_cgroup_memcg(objcg), nr_pages);
>>> + rcu_read_unlock();
>>> + }
>> It is not safe to call __memcg_kmem_uncharge() under rcu lock
>> and without holding a reference to memcg. More details can refer
>> to the following link.
>>
>> https://lore.kernel.org/linux-mm/[email protected]/
>>
>> In the above patchset, we introduce obj_cgroup_uncharge_pages to
>> uncharge some pages from object cgroup. You can use this safe
>> API.
>>
> I would recommend just rebase the patch series over the latest mm tree.
>
I see, I will rebase it to the latest mm tree.
Thanks,
Longman
On Sun, Apr 18, 2021 at 5:00 PM Waiman Long <[email protected]> wrote:
>
> The mod_objcg_state() function is moved from mm/slab.h to mm/memcontrol.c
> so that further optimization can be done to it in later patches without
> exposing unnecessary details to other mm components.
>
> Signed-off-by: Waiman Long <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
On Sun, Apr 18, 2021 at 08:00:29PM -0400, Waiman Long wrote:
> Before the new slab memory controller with per object byte charging,
> charging and vmstat data update happen only when new slab pages are
> allocated or freed. Now they are done with every kmem_cache_alloc()
> and kmem_cache_free(). This causes additional overhead for workloads
> that generate a lot of alloc and free calls.
>
> The memcg_stock_pcp is used to cache byte charge for a specific
> obj_cgroup to reduce that overhead. To further reducing it, this patch
> makes the vmstat data cached in the memcg_stock_pcp structure as well
> until it accumulates a page size worth of update or when other cached
> data change. Caching the vmstat data in the per-cpu stock eliminates two
> writes to non-hot cachelines for memcg specific as well as memcg-lruvecs
> specific vmstat data by a write to a hot local stock cacheline.
>
> On a 2-socket Cascade Lake server with instrumentation enabled and this
> patch applied, it was found that about 20% (634400 out of 3243830)
> of the time when mod_objcg_state() is called leads to an actual call
> to __mod_objcg_state() after initial boot. When doing parallel kernel
> build, the figure was about 17% (24329265 out of 142512465). So caching
> the vmstat data reduces the number of calls to __mod_objcg_state()
> by more than 80%.
>
> Signed-off-by: Waiman Long <[email protected]>
> Reviewed-by: Shakeel Butt <[email protected]>
> ---
> mm/memcontrol.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++---
> 1 file changed, 61 insertions(+), 3 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index dc9032f28f2e..693453f95d99 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2213,7 +2213,10 @@ struct memcg_stock_pcp {
>
> #ifdef CONFIG_MEMCG_KMEM
> struct obj_cgroup *cached_objcg;
> + struct pglist_data *cached_pgdat;
> unsigned int nr_bytes;
> + int vmstat_idx;
> + int vmstat_bytes;
> #endif
>
> struct work_struct work;
> @@ -3150,8 +3153,9 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
> css_put(&memcg->css);
> }
>
> -void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
> - enum node_stat_item idx, int nr)
> +static inline void __mod_objcg_state(struct obj_cgroup *objcg,
> + struct pglist_data *pgdat,
> + enum node_stat_item idx, int nr)
This naming is dangerous, as the __mod_foo naming scheme we use
everywhere else suggests it's the same function as mod_foo() just with
preemption/irqs disabled.
> @@ -3159,10 +3163,53 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
> rcu_read_lock();
> memcg = obj_cgroup_memcg(objcg);
> lruvec = mem_cgroup_lruvec(memcg, pgdat);
> - mod_memcg_lruvec_state(lruvec, idx, nr);
> + __mod_memcg_lruvec_state(lruvec, idx, nr);
> rcu_read_unlock();
> }
>
> +void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
> + enum node_stat_item idx, int nr)
> +{
> + struct memcg_stock_pcp *stock;
> + unsigned long flags;
> +
> + local_irq_save(flags);
> + stock = this_cpu_ptr(&memcg_stock);
> +
> + /*
> + * Save vmstat data in stock and skip vmstat array update unless
> + * accumulating over a page of vmstat data or when pgdat or idx
> + * changes.
> + */
> + if (stock->cached_objcg != objcg) {
> + /* Output the current data as is */
When you get here with the wrong objcg and hit the cold path, it's
usually immediately followed by an uncharge -> refill_obj_stock() that
will then flush and reset cached_objcg.
Instead of doing two cold paths, why not flush the old objcg right
away and set the new so that refill_obj_stock() can use the fast path?
> + } else if (!stock->vmstat_bytes) {
> + /* Save the current data */
> + stock->vmstat_bytes = nr;
> + stock->vmstat_idx = idx;
> + stock->cached_pgdat = pgdat;
> + nr = 0;
> + } else if ((stock->cached_pgdat != pgdat) ||
> + (stock->vmstat_idx != idx)) {
> + /* Output the cached data & save the current data */
> + swap(nr, stock->vmstat_bytes);
> + swap(idx, stock->vmstat_idx);
> + swap(pgdat, stock->cached_pgdat);
Is this optimization worth doing?
You later split vmstat_bytes and idx doesn't change anymore.
How often does the pgdat change? This is a per-cpu cache after all,
and the numa node a given cpu allocates from tends to not change that
often. Even with interleaving mode, which I think is pretty rare, the
interleaving happens at the slab/page level, not the object level, and
the cache isn't bigger than a page anyway.
> + } else {
> + stock->vmstat_bytes += nr;
> + if (abs(stock->vmstat_bytes) > PAGE_SIZE) {
> + nr = stock->vmstat_bytes;
> + stock->vmstat_bytes = 0;
> + } else {
> + nr = 0;
> + }
..and this is the regular overflow handling done by the objcg and
memcg charge stock as well.
How about this?
if (stock->cached_objcg != objcg ||
stock->cached_pgdat != pgdat ||
stock->vmstat_idx != idx) {
drain_obj_stock(stock);
obj_cgroup_get(objcg);
stock->cached_objcg = objcg;
stock->nr_bytes = atomic_xchg(&objcg->nr_charged_bytes, 0);
stock->vmstat_idx = idx;
}
stock->vmstat_bytes += nr_bytes;
if (abs(stock->vmstat_bytes > PAGE_SIZE))
drain_obj_stock(stock);
(Maybe we could be clever, here since the charge and stat caches are
the same size: don't flush an oversized charge cache from
refill_obj_stock in the charge path, but leave it to the
mod_objcg_state() that follows; likewise don't flush an undersized
vmstat stock from mod_objcg_state() in the uncharge path, but leave it
to the refill_obj_stock() that follows. Could get a bit complicated...)
> @@ -3213,6 +3260,17 @@ static void drain_obj_stock(struct memcg_stock_pcp *stock)
> stock->nr_bytes = 0;
> }
>
> + /*
> + * Flush the vmstat data in current stock
> + */
> + if (stock->vmstat_bytes) {
> + __mod_objcg_state(old, stock->cached_pgdat, stock->vmstat_idx,
> + stock->vmstat_bytes);
... then inline __mod_objcg_state() here into the only caller, and
there won't be any need to come up with a better name.
> + stock->cached_pgdat = NULL;
> + stock->vmstat_bytes = 0;
> + stock->vmstat_idx = 0;
> + }
> +
> obj_cgroup_put(old);
> stock->cached_objcg = NULL;
On Sun, Apr 18, 2021 at 08:00:31PM -0400, Waiman Long wrote:
> Currently, the object stock structure caches either reclaimable vmstat
> bytes or unreclaimable vmstat bytes in its object stock structure. The
> hit rate can be improved if both types of vmstat data can be cached
> especially for single-node system.
>
> This patch supports the cacheing of both type of vmstat data, though
> at the expense of a slightly increased complexity in the caching code.
> For large object (>= PAGE_SIZE), vmstat array is done directly without
> going through the stock caching step.
>
> On a 2-socket Cascade Lake server with instrumentation enabled, the
> miss rates are shown in the table below.
>
> Initial bootup:
>
> Kernel __mod_objcg_state mod_objcg_state %age
> ------ ----------------- --------------- ----
> Before patch 634400 3243830 19.6%
> After patch 419810 3182424 13.2%
>
> Parallel kernel build:
>
> Kernel __mod_objcg_state mod_objcg_state %age
> ------ ----------------- --------------- ----
> Before patch 24329265 142512465 17.1%
> After patch 24051721 142445825 16.9%
>
> There was a decrease of miss rate after initial system bootup. However,
> the miss rate for parallel kernel build remained about the same probably
> because most of the touched kmemcache objects were reclaimable inodes
> and dentries.
>
> Signed-off-by: Waiman Long <[email protected]>
> ---
> mm/memcontrol.c | 79 +++++++++++++++++++++++++++++++------------------
> 1 file changed, 51 insertions(+), 28 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c13502eab282..a6dd18f6d8a8 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2212,8 +2212,8 @@ struct obj_stock {
> struct obj_cgroup *cached_objcg;
> struct pglist_data *cached_pgdat;
> unsigned int nr_bytes;
> - int vmstat_idx;
> - int vmstat_bytes;
> + int reclaimable_bytes; /* NR_SLAB_RECLAIMABLE_B */
> + int unreclaimable_bytes; /* NR_SLAB_UNRECLAIMABLE_B */
How about
int nr_slab_reclaimable_b;
int nr_slab_unreclaimable_b;
so you don't need the comments?
> #else
> int dummy[0];
> #endif
> @@ -3217,40 +3217,56 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
> enum node_stat_item idx, int nr)
> {
> unsigned long flags;
> - struct obj_stock *stock = get_obj_stock(&flags);
> + struct obj_stock *stock;
> + int *bytes, *alt_bytes, alt_idx;
> +
> + /*
> + * Directly update vmstat array for big object.
> + */
> + if (unlikely(abs(nr) >= PAGE_SIZE))
> + goto update_vmstat;
This looks like an optimization independent of the vmstat item split?
> + stock = get_obj_stock(&flags);
> + if (idx == NR_SLAB_RECLAIMABLE_B) {
> + bytes = &stock->reclaimable_bytes;
> + alt_bytes = &stock->unreclaimable_bytes;
> + alt_idx = NR_SLAB_UNRECLAIMABLE_B;
> + } else {
> + bytes = &stock->unreclaimable_bytes;
> + alt_bytes = &stock->reclaimable_bytes;
> + alt_idx = NR_SLAB_RECLAIMABLE_B;
> + }
>
> /*
> - * Save vmstat data in stock and skip vmstat array update unless
> - * accumulating over a page of vmstat data or when pgdat or idx
> + * Try to save vmstat data in stock and skip vmstat array update
> + * unless accumulating over a page of vmstat data or when pgdat
> * changes.
> */
> if (stock->cached_objcg != objcg) {
> /* Output the current data as is */
> - } else if (!stock->vmstat_bytes) {
> - /* Save the current data */
> - stock->vmstat_bytes = nr;
> - stock->vmstat_idx = idx;
> - stock->cached_pgdat = pgdat;
> - nr = 0;
> - } else if ((stock->cached_pgdat != pgdat) ||
> - (stock->vmstat_idx != idx)) {
> - /* Output the cached data & save the current data */
> - swap(nr, stock->vmstat_bytes);
> - swap(idx, stock->vmstat_idx);
> + } else if (stock->cached_pgdat != pgdat) {
> + /* Save the current data and output cached data, if any */
> + swap(nr, *bytes);
> swap(pgdat, stock->cached_pgdat);
> + if (*alt_bytes) {
> + __mod_objcg_state(objcg, pgdat, alt_idx,
> + *alt_bytes);
> + *alt_bytes = 0;
> + }
As per the other email, I really don't think optimizing the pgdat
switch (in a percpu cache) is worth this level of complexity.
> } else {
> - stock->vmstat_bytes += nr;
> - if (abs(stock->vmstat_bytes) > PAGE_SIZE) {
> - nr = stock->vmstat_bytes;
> - stock->vmstat_bytes = 0;
> + *bytes += nr;
> + if (abs(*bytes) > PAGE_SIZE) {
> + nr = *bytes;
> + *bytes = 0;
> } else {
> nr = 0;
> }
> }
> - if (nr)
> - __mod_objcg_state(objcg, pgdat, idx, nr);
> -
> put_obj_stock(flags);
> + if (!nr)
> + return;
> +update_vmstat:
> + __mod_objcg_state(objcg, pgdat, idx, nr);
> }
>
> static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
> @@ -3303,12 +3319,19 @@ static void drain_obj_stock(struct obj_stock *stock)
> /*
> * Flush the vmstat data in current stock
> */
> - if (stock->vmstat_bytes) {
> - __mod_objcg_state(old, stock->cached_pgdat, stock->vmstat_idx,
> - stock->vmstat_bytes);
> + if (stock->reclaimable_bytes || stock->unreclaimable_bytes) {
> + int bytes;
> +
> + if ((bytes = stock->reclaimable_bytes))
> + __mod_objcg_state(old, stock->cached_pgdat,
> + NR_SLAB_RECLAIMABLE_B, bytes);
> + if ((bytes = stock->unreclaimable_bytes))
> + __mod_objcg_state(old, stock->cached_pgdat,
> + NR_SLAB_UNRECLAIMABLE_B, bytes);
The int bytes indirection isn't necessary. It's easier to read even
with the extra lines required to repeat the long stock member names,
because that is quite a common pattern (if (stuff) frob(stuff)).
__mod_objcg_state() also each time does rcu_read_lock() toggling and a
memcg lookup that could be batched, which I think is further proof
that it should just be inlined here.
On Mon, Apr 19, 2021 at 12:18:29PM -0400, Waiman Long wrote:
> On 4/19/21 11:21 AM, Waiman Long wrote:
> > On 4/19/21 11:14 AM, Johannes Weiner wrote:
> > > On Sun, Apr 18, 2021 at 08:00:28PM -0400, Waiman Long wrote:
> > > > The mod_objcg_state() function is moved from mm/slab.h to
> > > > mm/memcontrol.c
> > > > so that further optimization can be done to it in later patches without
> > > > exposing unnecessary details to other mm components.
> > > >
> > > > Signed-off-by: Waiman Long <[email protected]>
> > > > ---
> > > > ? mm/memcontrol.c | 13 +++++++++++++
> > > > ? mm/slab.h?????? | 16 ++--------------
> > > > ? 2 files changed, 15 insertions(+), 14 deletions(-)
> > > >
> > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > index e064ac0d850a..dc9032f28f2e 100644
> > > > --- a/mm/memcontrol.c
> > > > +++ b/mm/memcontrol.c
> > > > @@ -3150,6 +3150,19 @@ void __memcg_kmem_uncharge_page(struct
> > > > page *page, int order)
> > > > ????? css_put(&memcg->css);
> > > > ? }
> > > > ? +void mod_objcg_state(struct obj_cgroup *objcg, struct
> > > > pglist_data *pgdat,
> > > > +???????????? enum node_stat_item idx, int nr)
> > > > +{
> > > > +??? struct mem_cgroup *memcg;
> > > > +??? struct lruvec *lruvec = NULL;
> > > > +
> > > > +??? rcu_read_lock();
> > > > +??? memcg = obj_cgroup_memcg(objcg);
> > > > +??? lruvec = mem_cgroup_lruvec(memcg, pgdat);
> > > > +??? mod_memcg_lruvec_state(lruvec, idx, nr);
> > > > +??? rcu_read_unlock();
> > > > +}
> > > It would be more naturally placed next to the others, e.g. below
> > > __mod_lruvec_kmem_state().
> > >
> > > But no deal breaker if there isn't another revision.
> > >
> > > Acked-by: Johannes Weiner <[email protected]>
> > >
> > Yes, there will be another revision by rebasing patch series on the
> > linux-next. I will move the function then.
>
> OK, it turns out that mod_objcg_state() is only defined if
> CONFIG_MEMCG_KMEM. That was why I put it in the CONFIG_MEMCG_KMEM block with
> the other obj_stock functions. I think I will keep it there.
The CONFIG_MEMCG_KMEM has become sort of useless now. It used to be
configurable, but now it just means CONFIG_MEMCG && !CONFIG_SLOB. And
even that doesn't make sense because while slob doesn't support slab
object tracking, we can still do all the other stuff we do under
KMEM. I have a patch in the works to remove the symbol and ifdefs.
With that in mind, it's better to group the functions based on what
they do rather than based on CONFIG_MEMCG_KMEM. It's easier to just
remove another ifdef later than it is to reorder the functions.
On Mon, Apr 19, 2021 at 01:26:29PM -0400, Waiman Long wrote:
> On 4/19/21 1:19 PM, Waiman Long wrote:
> > On 4/19/21 1:13 PM, Johannes Weiner wrote:
> > > The CONFIG_MEMCG_KMEM has become sort of useless now. It used to be
> > > configurable, but now it just means CONFIG_MEMCG && !CONFIG_SLOB. And
> > > even that doesn't make sense because while slob doesn't support slab
> > > object tracking, we can still do all the other stuff we do under
> > > KMEM. I have a patch in the works to remove the symbol and ifdefs.
> > >
> > > With that in mind, it's better to group the functions based on what
> > > they do rather than based on CONFIG_MEMCG_KMEM. It's easier to just
> > > remove another ifdef later than it is to reorder the functions.
> > >
> > OK, I will make the move then. Thanks for the explanation.
Thanks!
> BTW, have you ever thought of moving the cgroup-v1 specific functions out
> into a separate memcontrol-v1.c file just like kernel/cgroup/cgroup-v1.c?
>
> I thought of that before, but memcontrol.c is a frequently changed file and
> so a bit hard to do.
I haven't looked too deeply at it so far, but I think it would make
sense to try.
There are indeed many of the entry paths from the MM code that are
shared between cgroup1 and cgroup2, with smaller branches here and
there to adjust behavior. Those would throw conflicts, but those we
should probably keep in the main memcontrol.c for readability anyway.
But there is also plenty of code that is exclusively about cgroup1,
and which actually doesn't change much in a long time. Moving that
elsewhere shouldn't create difficult conflicts - maybe a few line
offset warnings or fuzz in the diff context of unrelated changes:
- the soft limit tree and soft limit reclaim
- the threshold and oom event notification stuff
- the charge moving code
- remaining v1 interface files, as well as their helper functions
From a quick scan, this adds up to ~2,500 lines of old code with no
actual dependencies from the common code or from v2, and which could
be moved out of the way without disrupting ongoing development much.
On 4/19/21 5:11 PM, Johannes Weiner wrote:
>
>> BTW, have you ever thought of moving the cgroup-v1 specific functions out
>> into a separate memcontrol-v1.c file just like kernel/cgroup/cgroup-v1.c?
>>
>> I thought of that before, but memcontrol.c is a frequently changed file and
>> so a bit hard to do.
> I haven't looked too deeply at it so far, but I think it would make
> sense to try.
>
> There are indeed many of the entry paths from the MM code that are
> shared between cgroup1 and cgroup2, with smaller branches here and
> there to adjust behavior. Those would throw conflicts, but those we
> should probably keep in the main memcontrol.c for readability anyway.
>
> But there is also plenty of code that is exclusively about cgroup1,
> and which actually doesn't change much in a long time. Moving that
> elsewhere shouldn't create difficult conflicts - maybe a few line
> offset warnings or fuzz-- Rafael
>
>
> in the diff context of unrelated changes:
>
> - the soft limit tree and soft limit reclaim
>
> - the threshold and oom event notification stuff
>
> - the charge moving code
>
> - remaining v1 interface files, as well as their helper functions
>
> From a quick scan, this adds up to ~2,500 lines of old code with no
> actual dependencies from the common code or from v2, and which could
> be moved out of the way without disrupting ongoing development much.
>
Right.
Currently memcontrol.c has over 7000 lines of code and keep growing.
That makes it harder to read, navigate and update. If we can cut out
2000 lines or more from memcontrol.c, it will make it more manageable.
Cheers,
Longman
On 4/19/21 12:38 PM, Johannes Weiner wrote:
> On Sun, Apr 18, 2021 at 08:00:29PM -0400, Waiman Long wrote:
>> Before the new slab memory controller with per object byte charging,
>> charging and vmstat data update happen only when new slab pages are
>> allocated or freed. Now they are done with every kmem_cache_alloc()
>> and kmem_cache_free(). This causes additional overhead for workloads
>> that generate a lot of alloc and free calls.
>>
>> The memcg_stock_pcp is used to cache byte charge for a specific
>> obj_cgroup to reduce that overhead. To further reducing it, this patch
>> makes the vmstat data cached in the memcg_stock_pcp structure as well
>> until it accumulates a page size worth of update or when other cached
>> data change. Caching the vmstat data in the per-cpu stock eliminates two
>> writes to non-hot cachelines for memcg specific as well as memcg-lruvecs
>> specific vmstat data by a write to a hot local stock cacheline.
>>
>> On a 2-socket Cascade Lake server with instrumentation enabled and this
>> patch applied, it was found that about 20% (634400 out of 3243830)
>> of the time when mod_objcg_state() is called leads to an actual call
>> to __mod_objcg_state() after initial boot. When doing parallel kernel
>> build, the figure was about 17% (24329265 out of 142512465). So caching
>> the vmstat data reduces the number of calls to __mod_objcg_state()
>> by more than 80%.
>>
>> Signed-off-by: Waiman Long <[email protected]>
>> Reviewed-by: Shakeel Butt <[email protected]>
>> ---
>> mm/memcontrol.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++---
>> 1 file changed, 61 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index dc9032f28f2e..693453f95d99 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -2213,7 +2213,10 @@ struct memcg_stock_pcp {
>>
>> #ifdef CONFIG_MEMCG_KMEM
>> struct obj_cgroup *cached_objcg;
>> + struct pglist_data *cached_pgdat;
>> unsigned int nr_bytes;
>> + int vmstat_idx;
>> + int vmstat_bytes;
>> #endif
>>
>> struct work_struct work;
>> @@ -3150,8 +3153,9 @@ void __memcg_kmem_uncharge_page(struct page *page, int order)
>> css_put(&memcg->css);
>> }
>>
>> -void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
>> - enum node_stat_item idx, int nr)
>> +static inline void __mod_objcg_state(struct obj_cgroup *objcg,
>> + struct pglist_data *pgdat,
>> + enum node_stat_item idx, int nr)
> This naming is dangerous, as the __mod_foo naming scheme we use
> everywhere else suggests it's the same function as mod_foo() just with
> preemption/irqs disabled.
>
I will change its name to, say, mod_objcg_mlstate() to indicate that it
is something different. Actually, it is hard to come up with a good name
which is not too long.
>> @@ -3159,10 +3163,53 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
>> rcu_read_lock();
>> memcg = obj_cgroup_memcg(objcg);
>> lruvec = mem_cgroup_lruvec(memcg, pgdat);
>> - mod_memcg_lruvec_state(lruvec, idx, nr);
>> + __mod_memcg_lruvec_state(lruvec, idx, nr);
>> rcu_read_unlock();
>> }
>>
>> +void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
>> + enum node_stat_item idx, int nr)
>> +{
>> + struct memcg_stock_pcp *stock;
>> + unsigned long flags;
>> +
>> + local_irq_save(flags);
>> + stock = this_cpu_ptr(&memcg_stock);
>> +
>> + /*
>> + * Save vmstat data in stock and skip vmstat array update unless
>> + * accumulating over a page of vmstat data or when pgdat or idx
>> + * changes.
>> + */
>> + if (stock->cached_objcg != objcg) {
>> + /* Output the current data as is */
> When you get here with the wrong objcg and hit the cold path, it's
> usually immediately followed by an uncharge -> refill_obj_stock() that
> will then flush and reset cached_objcg.
>
> Instead of doing two cold paths, why not flush the old objcg right
> away and set the new so that refill_obj_stock() can use the fast path?
That is a good idea. Will do that.
>
>> + } else if (!stock->vmstat_bytes) {
>> + /* Save the current data */
>> + stock->vmstat_bytes = nr;
>> + stock->vmstat_idx = idx;
>> + stock->cached_pgdat = pgdat;
>> + nr = 0;
>> + } else if ((stock->cached_pgdat != pgdat) ||
>> + (stock->vmstat_idx != idx)) {
>> + /* Output the cached data & save the current data */
>> + swap(nr, stock->vmstat_bytes);
>> + swap(idx, stock->vmstat_idx);
>> + swap(pgdat, stock->cached_pgdat);
> Is this optimization worth doing?
>
> You later split vmstat_bytes and idx doesn't change anymore.
I am going to merge patch 2 and patch 4 to avoid the confusion.
>
> How often does the pgdat change? This is a per-cpu cache after all,
> and the numa node a given cpu allocates from tends to not change that
> often. Even with interleaving mode, which I think is pretty rare, the
> interleaving happens at the slab/page level, not the object level, and
> the cache isn't bigger than a page anyway.
The testing done on a 2-socket system indicated that pgdat changes
roughly 10-20% of time. So it does happen, especially on the kfree()
path, I think. I have tried to cached vmstat update for those on the
local node only, but I got more misses with that. So I am just going to
change pgdat and flush out existing data for now.
>
>> + } else {
>> + stock->vmstat_bytes += nr;
>> + if (abs(stock->vmstat_bytes) > PAGE_SIZE) {
>> + nr = stock->vmstat_bytes;
>> + stock->vmstat_bytes = 0;
>> + } else {
>> + nr = 0;
>> + }
> ..and this is the regular overflow handling done by the objcg and
> memcg charge stock as well.
>
> How about this?
>
> if (stock->cached_objcg != objcg ||
> stock->cached_pgdat != pgdat ||
> stock->vmstat_idx != idx) {
> drain_obj_stock(stock);
> obj_cgroup_get(objcg);
> stock->cached_objcg = objcg;
> stock->nr_bytes = atomic_xchg(&objcg->nr_charged_bytes, 0);
> stock->vmstat_idx = idx;
> }
> stock->vmstat_bytes += nr_bytes;
>
> if (abs(stock->vmstat_bytes > PAGE_SIZE))
> drain_obj_stock(stock);
>
> (Maybe we could be clever, here since the charge and stat caches are
> the same size: don't flush an oversized charge cache from
> refill_obj_stock in the charge path, but leave it to the
> mod_objcg_state() that follows; likewise don't flush an undersized
> vmstat stock from mod_objcg_state() in the uncharge path, but leave it
> to the refill_obj_stock() that follows. Could get a bit complicated...)
If you look at patch 5, I am trying to avoid doing drain_obj_stock()
unless the objcg change. I am going to do the same here.
Cheers,
Longman
On Mon 19-04-21 17:11:37, Johannes Weiner wrote:
> On Mon, Apr 19, 2021 at 01:26:29PM -0400, Waiman Long wrote:
[...]
> - the soft limit tree and soft limit reclaim
>
> - the threshold and oom event notification stuff
>
> - the charge moving code
>
> - remaining v1 interface files, as well as their helper functions
>
> From a quick scan, this adds up to ~2,500 lines of old code with no
> actual dependencies from the common code or from v2, and which could
> be moved out of the way without disrupting ongoing development much.
Moving those into its own file makes sense to me as well. If the code is
not conditional (e.g. like swap accounting and some others) then moving
it would make memecontrol.c easier to navigate through.
--
Michal Hocko
SUSE Labs
On 4/19/21 12:55 PM, Johannes Weiner wrote:
> On Sun, Apr 18, 2021 at 08:00:31PM -0400, Waiman Long wrote:
>> Currently, the object stock structure caches either reclaimable vmstat
>> bytes or unreclaimable vmstat bytes in its object stock structure. The
>> hit rate can be improved if both types of vmstat data can be cached
>> especially for single-node system.
>>
>> This patch supports the cacheing of both type of vmstat data, though
>> at the expense of a slightly increased complexity in the caching code.
>> For large object (>= PAGE_SIZE), vmstat array is done directly without
>> going through the stock caching step.
>>
>> On a 2-socket Cascade Lake server with instrumentation enabled, the
>> miss rates are shown in the table below.
>>
>> Initial bootup:
>>
>> Kernel __mod_objcg_state mod_objcg_state %age
>> ------ ----------------- --------------- ----
>> Before patch 634400 3243830 19.6%
>> After patch 419810 3182424 13.2%
>>
>> Parallel kernel build:
>>
>> Kernel __mod_objcg_state mod_objcg_state %age
>> ------ ----------------- --------------- ----
>> Before patch 24329265 142512465 17.1%
>> After patch 24051721 142445825 16.9%
>>
>> There was a decrease of miss rate after initial system bootup. However,
>> the miss rate for parallel kernel build remained about the same probably
>> because most of the touched kmemcache objects were reclaimable inodes
>> and dentries.
>>
>> Signed-off-by: Waiman Long <[email protected]>
>> ---
>> mm/memcontrol.c | 79 +++++++++++++++++++++++++++++++------------------
>> 1 file changed, 51 insertions(+), 28 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index c13502eab282..a6dd18f6d8a8 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -2212,8 +2212,8 @@ struct obj_stock {
>> struct obj_cgroup *cached_objcg;
>> struct pglist_data *cached_pgdat;
>> unsigned int nr_bytes;
>> - int vmstat_idx;
>> - int vmstat_bytes;
>> + int reclaimable_bytes; /* NR_SLAB_RECLAIMABLE_B */
>> + int unreclaimable_bytes; /* NR_SLAB_UNRECLAIMABLE_B */
> How about
>
> int nr_slab_reclaimable_b;
> int nr_slab_unreclaimable_b;
>
> so you don't need the comments?
Sure, will make the change.
>> #else
>> int dummy[0];
>> #endif
>> @@ -3217,40 +3217,56 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
>> enum node_stat_item idx, int nr)
>> {
>> unsigned long flags;
>> - struct obj_stock *stock = get_obj_stock(&flags);
>> + struct obj_stock *stock;
>> + int *bytes, *alt_bytes, alt_idx;
>> +
>> + /*
>> + * Directly update vmstat array for big object.
>> + */
>> + if (unlikely(abs(nr) >= PAGE_SIZE))
>> + goto update_vmstat;
> This looks like an optimization independent of the vmstat item split?
It may not be that helpful. I am going to take it out in the next version.
>
>> + stock = get_obj_stock(&flags);
>> + if (idx == NR_SLAB_RECLAIMABLE_B) {
>> + bytes = &stock->reclaimable_bytes;
>> + alt_bytes = &stock->unreclaimable_bytes;
>> + alt_idx = NR_SLAB_UNRECLAIMABLE_B;
>> + } else {
>> + bytes = &stock->unreclaimable_bytes;
>> + alt_bytes = &stock->reclaimable_bytes;
>> + alt_idx = NR_SLAB_RECLAIMABLE_B;
>> + }
>>
>> /*
>> - * Save vmstat data in stock and skip vmstat array update unless
>> - * accumulating over a page of vmstat data or when pgdat or idx
>> + * Try to save vmstat data in stock and skip vmstat array update
>> + * unless accumulating over a page of vmstat data or when pgdat
>> * changes.
>> */
>> if (stock->cached_objcg != objcg) {
>> /* Output the current data as is */
>> - } else if (!stock->vmstat_bytes) {
>> - /* Save the current data */
>> - stock->vmstat_bytes = nr;
>> - stock->vmstat_idx = idx;
>> - stock->cached_pgdat = pgdat;
>> - nr = 0;
>> - } else if ((stock->cached_pgdat != pgdat) ||
>> - (stock->vmstat_idx != idx)) {
>> - /* Output the cached data & save the current data */
>> - swap(nr, stock->vmstat_bytes);
>> - swap(idx, stock->vmstat_idx);
>> + } else if (stock->cached_pgdat != pgdat) {
>> + /* Save the current data and output cached data, if any */
>> + swap(nr, *bytes);
>> swap(pgdat, stock->cached_pgdat);
>> + if (*alt_bytes) {
>> + __mod_objcg_state(objcg, pgdat, alt_idx,
>> + *alt_bytes);
>> + *alt_bytes = 0;
>> + }
> As per the other email, I really don't think optimizing the pgdat
> switch (in a percpu cache) is worth this level of complexity.
I am going to simplify it in the next version.
>
>> } else {
>> - stock->vmstat_bytes += nr;
>> - if (abs(stock->vmstat_bytes) > PAGE_SIZE) {
>> - nr = stock->vmstat_bytes;
>> - stock->vmstat_bytes = 0;
>> + *bytes += nr;
>> + if (abs(*bytes) > PAGE_SIZE) {
>> + nr = *bytes;
>> + *bytes = 0;
>> } else {
>> nr = 0;
>> }
>> }
>> - if (nr)
>> - __mod_objcg_state(objcg, pgdat, idx, nr);
>> -
>> put_obj_stock(flags);
>> + if (!nr)
>> + return;
>> +update_vmstat:
>> + __mod_objcg_state(objcg, pgdat, idx, nr);
>> }
>>
>> static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
>> @@ -3303,12 +3319,19 @@ static void drain_obj_stock(struct obj_stock *stock)
>> /*
>> * Flush the vmstat data in current stock
>> */
>> - if (stock->vmstat_bytes) {
>> - __mod_objcg_state(old, stock->cached_pgdat, stock->vmstat_idx,
>> - stock->vmstat_bytes);
>> + if (stock->reclaimable_bytes || stock->unreclaimable_bytes) {
>> + int bytes;
>> +
>> + if ((bytes = stock->reclaimable_bytes))
>> + __mod_objcg_state(old, stock->cached_pgdat,
>> + NR_SLAB_RECLAIMABLE_B, bytes);
>> + if ((bytes = stock->unreclaimable_bytes))
>> + __mod_objcg_state(old, stock->cached_pgdat,
>> + NR_SLAB_UNRECLAIMABLE_B, bytes);
> The int bytes indirection isn't necessary. It's easier to read even
> with the extra lines required to repeat the long stock member names,
> because that is quite a common pattern (if (stuff) frob(stuff)).
OK, I will eliminate the bytes variable.
>
> __mod_objcg_state() also each time does rcu_read_lock() toggling and a
> memcg lookup that could be batched, which I think is further proof
> that it should just be inlined here.
>
I am also thinking that eliminate unnecessary
rcu_read_lock/rcu_read_unlock may help performance a bit. However, that
will be done in another patch after I have done more performance
testing. I am not going to bother with that in this series.
Cheers,
Longman