2024-04-09 11:09:01

by Jesper Dangaard Brouer

[permalink] [raw]
Subject: Re: Advice on cgroup rstat lock

Let move this discussion upstream.

On 22/03/2024 19.32, Yosry Ahmed wrote:
> [..]
>>> There was a couple of series that made all calls to
>>> cgroup_rstat_flush() sleepable, which allows the lock to be dropped
>>> (and IRQs enabled) in between CPU iterations. This fixed a similar
>>> problem that we used to face (except in our case, we saw hard lockups
>>> in extreme scenarios):
>>> https://lore.kernel.org/linux-mm/[email protected]/
>>> https://lore.kernel.org/lkml/[email protected]/
>>
>> I've only done the 6.6 backport, and these were in 6.5/6.6.

Given I have these in my 6.6 kernel. You are basically saying I should
be able to avoid IRQ-disable for the lock, right?

My main problem with the global cgroup_rstat_lock[3] is it disables IRQs
and (thereby also) BH/softirq (spin_lock_irq). This cause production
issues elsewhere, e.g. we are seeing network softirq "not-able-to-run"
latency issues (debug via softirq_net_latency.bt [5]).

[3]
https://elixir.bootlin.com/linux/v6.9-rc3/source/kernel/cgroup/rstat.c#L10
[5]
https://github.com/xdp-project/xdp-project/blob/master/areas/latency/softirq_net_latency.bt


>> And between 6.1 to 6.6 we did observe an improvement in this area.
>> (Maybe I don't have to do the 6.1 backport if the 6.6 release plan progress)
>>
>> I've had a chance to get running in prod for 6.6 backport.
>> As you can see in attached grafana heatmap pictures, we do observe an
>> improved/reduced softirq wait time.
>> These softirq "not-able-to-run" outliers is *one* of the prod issues we
>> observed. As you can see, I still have other areas to improve/fix.
>
> I am not very familiar with such heatmaps, but I am glad there is an
> improvement with 6.6 and the backports. Let me know if there is
> anything I could do to help with your effort.

The heatmaps give me an overview, but I needed a debugging tool, so I
developed some bpftrace scripts [1][2] I'm running on production.
To measure how long time we hold the cgroup rstat lock (results below).
Adding ACME and Daniel as I hope there is an easier way to measure lock
hold time and congestion. Notice tricky release/yield in
cgroup_rstat_flush_locked[4].

My production results on 6.6 with backported patches (below signature)
vs a our normal 6.6 kernel, with script [2]. The `@lock_time_hist_ns`
shows how long time the lock+IRQs were disabled (taking into account it
can be released in the loop [4]).

Patched kernel:

21:49:02 time elapsed: 43200 sec
@lock_time_hist_ns:
[2K, 4K) 61 |
|
[4K, 8K) 734 |
|
[8K, 16K) 121500 |@@@@@@@@@@@@@@@@
|
[16K, 32K) 385714
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[32K, 64K) 145600 |@@@@@@@@@@@@@@@@@@@
|
[64K, 128K) 156873 |@@@@@@@@@@@@@@@@@@@@@
|
[128K, 256K) 261027 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
|
[256K, 512K) 291986 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
|
[512K, 1M) 101859 |@@@@@@@@@@@@@
|
[1M, 2M) 19866 |@@
|
[2M, 4M) 10146 |@
|
[4M, 8M) 30633 |@@@@
|
[8M, 16M) 40365 |@@@@@
|
[16M, 32M) 21650 |@@
|
[32M, 64M) 5842 |
|
[64M, 128M) 8 |
|

And normal 6.6 kernel:

21:48:32 time elapsed: 43200 sec
@lock_time_hist_ns:
[1K, 2K) 25 |
|
[2K, 4K) 1146 |
|
[4K, 8K) 59397 |@@@@
|
[8K, 16K) 571528 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
|
[16K, 32K) 542648 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
|
[32K, 64K) 202810 |@@@@@@@@@@@@@
|
[64K, 128K) 134564 |@@@@@@@@@
|
[128K, 256K) 72870 |@@@@@
|
[256K, 512K) 56914 |@@@
|
[512K, 1M) 83140 |@@@@@
|
[1M, 2M) 170514 |@@@@@@@@@@@
|
[2M, 4M) 396304 |@@@@@@@@@@@@@@@@@@@@@@@@@@@
|
[4M, 8M) 755537
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8M, 16M) 231222 |@@@@@@@@@@@@@@@
|
[16M, 32M) 76370 |@@@@@
|
[32M, 64M) 1043 |
|
[64M, 128M) 12 |
|


For the unpatched kernel we see more events in 4ms to 8ms bucket than
any other bucket.
For patched kernel, we clearly see a significant reduction of events in
the 4 ms to 64 ms area, but we still have some events in this area. I'm
very happy to see these patches improves the situation. But for network
processing I'm not happy to see events in area 16ms to 128ms area. If
we can just avoid disabling IRQs/softirq for the lock, I would be happy.

How far can we go... could cgroup_rstat_lock be converted to a mutex?

--Jesper

[1]
https://github.com/xdp-project/xdp-project/blob/master/areas/latency/cgroup_rstat_latency.bt
[2]
https://github.com/xdp-project/xdp-project/blob/master/areas/latency/cgroup_rstat_latency_steroids.bt
[3]
https://elixir.bootlin.com/linux/v6.9-rc3/source/kernel/cgroup/rstat.c#L10
[4]
https://elixir.bootlin.com/linux/v6.9-rc3/source/kernel/cgroup/rstat.c#L226
cgroup_rstat_flush_locked
[5]
https://github.com/xdp-project/xdp-project/blob/master/areas/latency/softirq_net_latency.bt



Backported to 6.6

List of **main** patches address issue - to backport for 6.6:
- 508bed884767 mm: memcg: change flush_next_time to flush_last_time
- v6.8-rc1~180^2~205
- e0bf1dc859fd mm: memcg: move vmstats structs definition above
flushing code
- v6.8-rc1~180^2~204
- 8d59d2214c23 mm: memcg: make stats flushing threshold per-memcg
- v6.8-rc1~180^2~203
- b00684722262 mm: workingset: move the stats flush into
workingset_test_recent()
- v6.8-rc1~180^2~202
- 7d7ef0a4686a mm: memcg: restore subtree stats flushing
- v6.8-rc1~180^2~201

And extra (thanks Longman)

- e76d28bdf9ba ("cgroup/rstat: Reduce cpu_lock hold time in
cgroup_rstat_flush_locked()")
- v6.8-rc1~182^2~8

And list of patches that contain **fixes** for backports above:
- 9cee7e8ef3e3 mm: memcg: optimize parent iteration in
memcg_rstat_updated()
- v6.8-rc4~3^2~12
- 13ef7424577f mm: memcg: don't periodically flush stats when memcg is
disabled
- v6.8-rc5-69-g13ef7424577f




2024-04-09 15:51:14

by Waiman Long

[permalink] [raw]
Subject: Re: Advice on cgroup rstat lock

On 4/9/24 07:08, Jesper Dangaard Brouer wrote:
> Let move this discussion upstream.
>
> On 22/03/2024 19.32, Yosry Ahmed wrote:
>> [..]
>>>> There was a couple of series that made all calls to
>>>> cgroup_rstat_flush() sleepable, which allows the lock to be dropped
>>>> (and IRQs enabled) in between CPU iterations. This fixed a similar
>>>> problem that we used to face (except in our case, we saw hard lockups
>>>> in extreme scenarios):
>>>> https://lore.kernel.org/linux-mm/[email protected]/
>>>>
>>>> https://lore.kernel.org/lkml/[email protected]/
>>>>
>>>
>>> I've only done the 6.6 backport, and these were in 6.5/6.6.
>
> Given I have these in my 6.6 kernel. You are basically saying I should
> be able to avoid IRQ-disable for the lock, right?
>
> My main problem with the global cgroup_rstat_lock[3] is it disables IRQs
> and (thereby also) BH/softirq (spin_lock_irq).  This cause production
> issues elsewhere, e.g. we are seeing network softirq "not-able-to-run"
> latency issues (debug via softirq_net_latency.bt [5]).
>
>   [3]
> https://elixir.bootlin.com/linux/v6.9-rc3/source/kernel/cgroup/rstat.c#L10
>   [5]
> https://github.com/xdp-project/xdp-project/blob/master/areas/latency/softirq_net_latency.bt
>
>
>>> And between 6.1 to 6.6 we did observe an improvement in this area.
>>> (Maybe I don't have to do the 6.1 backport if the 6.6 release plan
>>> progress)
>>>
>>> I've had a chance to get running in prod for 6.6 backport.
>>> As you can see in attached grafana heatmap pictures, we do observe an
>>> improved/reduced softirq wait time.
>>> These softirq "not-able-to-run" outliers is *one* of the prod issues we
>>> observed.  As you can see, I still have other areas to improve/fix.
>>
>> I am not very familiar with such heatmaps, but I am glad there is an
>> improvement with 6.6 and the backports. Let me know if there is
>> anything I could do to help with your effort.
>
> The heatmaps give me an overview, but I needed a debugging tool, so I
> developed some bpftrace scripts [1][2] I'm running on production.
> To measure how long time we hold the cgroup rstat lock (results below).
> Adding ACME and Daniel as I hope there is an easier way to measure lock
> hold time and congestion. Notice tricky release/yield in
> cgroup_rstat_flush_locked[4].
>
> My production results on 6.6 with backported patches (below signature)
> vs a our normal 6.6 kernel, with script [2]. The `@lock_time_hist_ns`
> shows how long time the lock+IRQs were disabled (taking into account it
> can be released in the loop [4]).
>
> Patched kernel:
>
> 21:49:02  time elapsed: 43200 sec
> @lock_time_hist_ns:
> [2K, 4K)              61 |      |
> [4K, 8K)             734 |      |
> [8K, 16K)         121500 |@@@@@@@@@@@@@@@@      |
> [16K, 32K)        385714
> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [32K, 64K)        145600 |@@@@@@@@@@@@@@@@@@@      |
> [64K, 128K)       156873 |@@@@@@@@@@@@@@@@@@@@@      |
> [128K, 256K)      261027 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
> [256K, 512K)      291986 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      |
> [512K, 1M)        101859 |@@@@@@@@@@@@@      |
> [1M, 2M)           19866 |@@      |
> [2M, 4M)           10146 |@      |
> [4M, 8M)           30633 |@@@@      |
> [8M, 16M)          40365 |@@@@@      |
> [16M, 32M)         21650 |@@      |
> [32M, 64M)          5842 |      |
> [64M, 128M)            8 |      |
>
> And normal 6.6 kernel:
>
> 21:48:32  time elapsed: 43200 sec
> @lock_time_hist_ns:
> [1K, 2K)              25 |      |
> [2K, 4K)            1146 |      |
> [4K, 8K)           59397 |@@@@      |
> [8K, 16K)         571528 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      |
> [16K, 32K)        542648 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      |
> [32K, 64K)        202810 |@@@@@@@@@@@@@      |
> [64K, 128K)       134564 |@@@@@@@@@      |
> [128K, 256K)       72870 |@@@@@      |
> [256K, 512K)       56914 |@@@      |
> [512K, 1M)         83140 |@@@@@      |
> [1M, 2M)          170514 |@@@@@@@@@@@      |
> [2M, 4M)          396304 |@@@@@@@@@@@@@@@@@@@@@@@@@@@      |
> [4M, 8M)          755537
> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [8M, 16M)         231222 |@@@@@@@@@@@@@@@      |
> [16M, 32M)         76370 |@@@@@      |
> [32M, 64M)          1043 |      |
> [64M, 128M)           12 |      |
>
>
> For the unpatched kernel we see more events in 4ms to 8ms bucket than
> any other bucket.
> For patched kernel, we clearly see a significant reduction of events in
> the 4 ms to 64 ms area, but we still have some events in this area.  I'm
> very happy to see these patches improves the situation.  But for network
> processing I'm not happy to see events in area 16ms to 128ms area.  If
> we can just avoid disabling IRQs/softirq for the lock, I would be happy.
>
> How far can we go... could cgroup_rstat_lock be converted to a mutex?

The cgroup_rstat_lock was originally a mutex. It was converted to a
spinlock in commit 0fa294fb1985 ("group: Replace cgroup_rstat_mutex with
a spinlock"). Irq was disabled to enable calling from atomic context.
Since commit 0a2dc6ac3329 ("cgroup: remove
cgroup_rstat_flush_atomic()"), the rstat API hadn't been called from
atomic context anymore. Theoretically, we could change it back to a
mutex or not disabling interrupt. That will require that the API cannot
be called from atomic context going forward.

Cheers,
Longman



2024-04-09 17:11:50

by Yosry Ahmed

[permalink] [raw]
Subject: Re: Advice on cgroup rstat lock

On Tue, Apr 9, 2024 at 8:37 AM Waiman Long <[email protected]> wrote:
>
> On 4/9/24 07:08, Jesper Dangaard Brouer wrote:
> > Let move this discussion upstream.
> >
> > On 22/03/2024 19.32, Yosry Ahmed wrote:
> >> [..]
> >>>> There was a couple of series that made all calls to
> >>>> cgroup_rstat_flush() sleepable, which allows the lock to be dropped
> >>>> (and IRQs enabled) in between CPU iterations. This fixed a similar
> >>>> problem that we used to face (except in our case, we saw hard lockups
> >>>> in extreme scenarios):
> >>>> https://lore.kernel.org/linux-mm/[email protected]/
> >>>>
> >>>> https://lore.kernel.org/lkml/[email protected]/
> >>>>
> >>>
> >>> I've only done the 6.6 backport, and these were in 6.5/6.6.
> >
> > Given I have these in my 6.6 kernel. You are basically saying I should
> > be able to avoid IRQ-disable for the lock, right?
> >
> > My main problem with the global cgroup_rstat_lock[3] is it disables IRQs
> > and (thereby also) BH/softirq (spin_lock_irq). This cause production
> > issues elsewhere, e.g. we are seeing network softirq "not-able-to-run"
> > latency issues (debug via softirq_net_latency.bt [5]).
> >
> > [3]
> > https://elixir.bootlin.com/linux/v6.9-rc3/source/kernel/cgroup/rstat.c#L10
> > [5]
> > https://github.com/xdp-project/xdp-project/blob/master/areas/latency/softirq_net_latency.bt
> >
> >
> >>> And between 6.1 to 6.6 we did observe an improvement in this area.
> >>> (Maybe I don't have to do the 6.1 backport if the 6.6 release plan
> >>> progress)
> >>>
> >>> I've had a chance to get running in prod for 6.6 backport.
> >>> As you can see in attached grafana heatmap pictures, we do observe an
> >>> improved/reduced softirq wait time.
> >>> These softirq "not-able-to-run" outliers is *one* of the prod issues we
> >>> observed. As you can see, I still have other areas to improve/fix.
> >>
> >> I am not very familiar with such heatmaps, but I am glad there is an
> >> improvement with 6.6 and the backports. Let me know if there is
> >> anything I could do to help with your effort.
> >
> > The heatmaps give me an overview, but I needed a debugging tool, so I
> > developed some bpftrace scripts [1][2] I'm running on production.
> > To measure how long time we hold the cgroup rstat lock (results below).
> > Adding ACME and Daniel as I hope there is an easier way to measure lock
> > hold time and congestion. Notice tricky release/yield in
> > cgroup_rstat_flush_locked[4].
> >
> > My production results on 6.6 with backported patches (below signature)
> > vs a our normal 6.6 kernel, with script [2]. The `@lock_time_hist_ns`
> > shows how long time the lock+IRQs were disabled (taking into account it
> > can be released in the loop [4]).
> >
> > Patched kernel:
> >
> > 21:49:02 time elapsed: 43200 sec
> > @lock_time_hist_ns:
> > [2K, 4K) 61 | |
> > [4K, 8K) 734 | |
> > [8K, 16K) 121500 |@@@@@@@@@@@@@@@@ |
> > [16K, 32K) 385714
> > |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > [32K, 64K) 145600 |@@@@@@@@@@@@@@@@@@@ |
> > [64K, 128K) 156873 |@@@@@@@@@@@@@@@@@@@@@ |
> > [128K, 256K) 261027 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
> > [256K, 512K) 291986 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
> > [512K, 1M) 101859 |@@@@@@@@@@@@@ |
> > [1M, 2M) 19866 |@@ |
> > [2M, 4M) 10146 |@ |
> > [4M, 8M) 30633 |@@@@ |
> > [8M, 16M) 40365 |@@@@@ |
> > [16M, 32M) 21650 |@@ |
> > [32M, 64M) 5842 | |
> > [64M, 128M) 8 | |
> >
> > And normal 6.6 kernel:
> >
> > 21:48:32 time elapsed: 43200 sec
> > @lock_time_hist_ns:
> > [1K, 2K) 25 | |
> > [2K, 4K) 1146 | |
> > [4K, 8K) 59397 |@@@@ |
> > [8K, 16K) 571528 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
> > [16K, 32K) 542648 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
> > [32K, 64K) 202810 |@@@@@@@@@@@@@ |
> > [64K, 128K) 134564 |@@@@@@@@@ |
> > [128K, 256K) 72870 |@@@@@ |
> > [256K, 512K) 56914 |@@@ |
> > [512K, 1M) 83140 |@@@@@ |
> > [1M, 2M) 170514 |@@@@@@@@@@@ |
> > [2M, 4M) 396304 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
> > [4M, 8M) 755537
> > |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> > [8M, 16M) 231222 |@@@@@@@@@@@@@@@ |
> > [16M, 32M) 76370 |@@@@@ |
> > [32M, 64M) 1043 | |
> > [64M, 128M) 12 | |
> >
> >
> > For the unpatched kernel we see more events in 4ms to 8ms bucket than
> > any other bucket.
> > For patched kernel, we clearly see a significant reduction of events in
> > the 4 ms to 64 ms area, but we still have some events in this area. I'm
> > very happy to see these patches improves the situation. But for network
> > processing I'm not happy to see events in area 16ms to 128ms area. If
> > we can just avoid disabling IRQs/softirq for the lock, I would be happy.
> >
> > How far can we go... could cgroup_rstat_lock be converted to a mutex?
>
> The cgroup_rstat_lock was originally a mutex. It was converted to a
> spinlock in commit 0fa294fb1985 ("group: Replace cgroup_rstat_mutex with
> a spinlock"). Irq was disabled to enable calling from atomic context.
> Since commit 0a2dc6ac3329 ("cgroup: remove
> cgroup_rstat_flush_atomic()"), the rstat API hadn't been called from
> atomic context anymore. Theoretically, we could change it back to a
> mutex or not disabling interrupt. That will require that the API cannot
> be called from atomic context going forward.

I think we should avoid flushing from atomic contexts going forward
anyway tbh. It's just too much work to do with IRQs disabled, and we
observed hard lockups before in worst case scenarios.

I think one problem that was discussed before is that flushing is
exercised from multiple contexts and could have very high concurrency
(e.g. from reclaim when the system is under memory pressure). With a
mutex, the flusher could sleep with the mutex held and block other
threads for a while.

I vaguely recall experimenting locally with changing that lock into a
mutex and not liking the results, but I can't remember much more. I
could be misremembering though.

Currently, the lock is dropped in cgroup_rstat_flush_locked() between
CPU iterations if rescheduling is needed or the lock is being
contended (i.e. spin_needbreak() returns true). I had always wondered
if it's possible to introduce a similar primitive for IRQs? We could
also drop the lock (and re-enable IRQs) if IRQs are pending then.

2024-04-09 17:27:58

by Waiman Long

[permalink] [raw]
Subject: Re: Advice on cgroup rstat lock


On 4/9/24 12:45, Yosry Ahmed wrote:
> On Tue, Apr 9, 2024 at 8:37 AM Waiman Long <[email protected]> wrote:
>> On 4/9/24 07:08, Jesper Dangaard Brouer wrote:
>>> Let move this discussion upstream.
>>>
>>> On 22/03/2024 19.32, Yosry Ahmed wrote:
>>>> [..]
>>>>>> There was a couple of series that made all calls to
>>>>>> cgroup_rstat_flush() sleepable, which allows the lock to be dropped
>>>>>> (and IRQs enabled) in between CPU iterations. This fixed a similar
>>>>>> problem that we used to face (except in our case, we saw hard lockups
>>>>>> in extreme scenarios):
>>>>>> https://lore.kernel.org/linux-mm/[email protected]/
>>>>>>
>>>>>> https://lore.kernel.org/lkml/[email protected]/
>>>>>>
>>>>> I've only done the 6.6 backport, and these were in 6.5/6.6.
>>> Given I have these in my 6.6 kernel. You are basically saying I should
>>> be able to avoid IRQ-disable for the lock, right?
>>>
>>> My main problem with the global cgroup_rstat_lock[3] is it disables IRQs
>>> and (thereby also) BH/softirq (spin_lock_irq). This cause production
>>> issues elsewhere, e.g. we are seeing network softirq "not-able-to-run"
>>> latency issues (debug via softirq_net_latency.bt [5]).
>>>
>>> [3]
>>> https://elixir.bootlin.com/linux/v6.9-rc3/source/kernel/cgroup/rstat.c#L10
>>> [5]
>>> https://github.com/xdp-project/xdp-project/blob/master/areas/latency/softirq_net_latency.bt
>>>
>>>
>>>>> And between 6.1 to 6.6 we did observe an improvement in this area.
>>>>> (Maybe I don't have to do the 6.1 backport if the 6.6 release plan
>>>>> progress)
>>>>>
>>>>> I've had a chance to get running in prod for 6.6 backport.
>>>>> As you can see in attached grafana heatmap pictures, we do observe an
>>>>> improved/reduced softirq wait time.
>>>>> These softirq "not-able-to-run" outliers is *one* of the prod issues we
>>>>> observed. As you can see, I still have other areas to improve/fix.
>>>> I am not very familiar with such heatmaps, but I am glad there is an
>>>> improvement with 6.6 and the backports. Let me know if there is
>>>> anything I could do to help with your effort.
>>> The heatmaps give me an overview, but I needed a debugging tool, so I
>>> developed some bpftrace scripts [1][2] I'm running on production.
>>> To measure how long time we hold the cgroup rstat lock (results below).
>>> Adding ACME and Daniel as I hope there is an easier way to measure lock
>>> hold time and congestion. Notice tricky release/yield in
>>> cgroup_rstat_flush_locked[4].
>>>
>>> My production results on 6.6 with backported patches (below signature)
>>> vs a our normal 6.6 kernel, with script [2]. The `@lock_time_hist_ns`
>>> shows how long time the lock+IRQs were disabled (taking into account it
>>> can be released in the loop [4]).
>>>
>>> Patched kernel:
>>>
>>> 21:49:02 time elapsed: 43200 sec
>>> @lock_time_hist_ns:
>>> [2K, 4K) 61 | |
>>> [4K, 8K) 734 | |
>>> [8K, 16K) 121500 |@@@@@@@@@@@@@@@@ |
>>> [16K, 32K) 385714
>>> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>>> [32K, 64K) 145600 |@@@@@@@@@@@@@@@@@@@ |
>>> [64K, 128K) 156873 |@@@@@@@@@@@@@@@@@@@@@ |
>>> [128K, 256K) 261027 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
>>> [256K, 512K) 291986 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
>>> [512K, 1M) 101859 |@@@@@@@@@@@@@ |
>>> [1M, 2M) 19866 |@@ |
>>> [2M, 4M) 10146 |@ |
>>> [4M, 8M) 30633 |@@@@ |
>>> [8M, 16M) 40365 |@@@@@ |
>>> [16M, 32M) 21650 |@@ |
>>> [32M, 64M) 5842 | |
>>> [64M, 128M) 8 | |
>>>
>>> And normal 6.6 kernel:
>>>
>>> 21:48:32 time elapsed: 43200 sec
>>> @lock_time_hist_ns:
>>> [1K, 2K) 25 | |
>>> [2K, 4K) 1146 | |
>>> [4K, 8K) 59397 |@@@@ |
>>> [8K, 16K) 571528 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
>>> [16K, 32K) 542648 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
>>> [32K, 64K) 202810 |@@@@@@@@@@@@@ |
>>> [64K, 128K) 134564 |@@@@@@@@@ |
>>> [128K, 256K) 72870 |@@@@@ |
>>> [256K, 512K) 56914 |@@@ |
>>> [512K, 1M) 83140 |@@@@@ |
>>> [1M, 2M) 170514 |@@@@@@@@@@@ |
>>> [2M, 4M) 396304 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
>>> [4M, 8M) 755537
>>> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>>> [8M, 16M) 231222 |@@@@@@@@@@@@@@@ |
>>> [16M, 32M) 76370 |@@@@@ |
>>> [32M, 64M) 1043 | |
>>> [64M, 128M) 12 | |
>>>
>>>
>>> For the unpatched kernel we see more events in 4ms to 8ms bucket than
>>> any other bucket.
>>> For patched kernel, we clearly see a significant reduction of events in
>>> the 4 ms to 64 ms area, but we still have some events in this area. I'm
>>> very happy to see these patches improves the situation. But for network
>>> processing I'm not happy to see events in area 16ms to 128ms area. If
>>> we can just avoid disabling IRQs/softirq for the lock, I would be happy.
>>>
>>> How far can we go... could cgroup_rstat_lock be converted to a mutex?
>> The cgroup_rstat_lock was originally a mutex. It was converted to a
>> spinlock in commit 0fa294fb1985 ("group: Replace cgroup_rstat_mutex with
>> a spinlock"). Irq was disabled to enable calling from atomic context.
>> Since commit 0a2dc6ac3329 ("cgroup: remove
>> cgroup_rstat_flush_atomic()"), the rstat API hadn't been called from
>> atomic context anymore. Theoretically, we could change it back to a
>> mutex or not disabling interrupt. That will require that the API cannot
>> be called from atomic context going forward.
> I think we should avoid flushing from atomic contexts going forward
> anyway tbh. It's just too much work to do with IRQs disabled, and we
> observed hard lockups before in worst case scenarios.
>
> I think one problem that was discussed before is that flushing is
> exercised from multiple contexts and could have very high concurrency
> (e.g. from reclaim when the system is under memory pressure). With a
> mutex, the flusher could sleep with the mutex held and block other
> threads for a while.
>
> I vaguely recall experimenting locally with changing that lock into a
> mutex and not liking the results, but I can't remember much more. I
> could be misremembering though.
>
> Currently, the lock is dropped in cgroup_rstat_flush_locked() between
> CPU iterations if rescheduling is needed or the lock is being
> contended (i.e. spin_needbreak() returns true). I had always wondered
> if it's possible to introduce a similar primitive for IRQs? We could
> also drop the lock (and re-enable IRQs) if IRQs are pending then.

I am not sure if there is a way to check if a hardirq is pending, but we
do have a local_softirq_pending() helper.

Regards,
Longman


2024-04-11 10:18:05

by Jesper Dangaard Brouer

[permalink] [raw]
Subject: Re: Advice on cgroup rstat lock



On 09/04/2024 18.59, Waiman Long wrote:
>
> On 4/9/24 12:45, Yosry Ahmed wrote:
>> On Tue, Apr 9, 2024 at 8:37 AM Waiman Long <[email protected]> wrote:
>>> On 4/9/24 07:08, Jesper Dangaard Brouer wrote:
>>>> Let move this discussion upstream.
>>>>
>>>> On 22/03/2024 19.32, Yosry Ahmed wrote:
>>>>> [..]
>>>>>>> There was a couple of series that made all calls to
>>>>>>> cgroup_rstat_flush() sleepable, which allows the lock to be dropped
>>>>>>> (and IRQs enabled) in between CPU iterations. This fixed a similar
>>>>>>> problem that we used to face (except in our case, we saw hard
>>>>>>> lockups
>>>>>>> in extreme scenarios):
>>>>>>> https://lore.kernel.org/linux-mm/[email protected]/
>>>>>>>
>>>>>>> https://lore.kernel.org/lkml/[email protected]/
>>>>>>>
>>>>>> I've only done the 6.6 backport, and these were in 6.5/6.6.
>>>> Given I have these in my 6.6 kernel. You are basically saying I should
>>>> be able to avoid IRQ-disable for the lock, right?
>>>>
>>>> My main problem with the global cgroup_rstat_lock[3] is it disables
>>>> IRQs
>>>> and (thereby also) BH/softirq (spin_lock_irq).  This cause production
>>>> issues elsewhere, e.g. we are seeing network softirq "not-able-to-run"
>>>> latency issues (debug via softirq_net_latency.bt [5]).
>>>>
>>>>    [3]
>>>> https://elixir.bootlin.com/linux/v6.9-rc3/source/kernel/cgroup/rstat.c#L10
>>>>    [5]
>>>> https://github.com/xdp-project/xdp-project/blob/master/areas/latency/softirq_net_latency.bt
>>>>
>>>>
>>>>>> And between 6.1 to 6.6 we did observe an improvement in this area.
>>>>>> (Maybe I don't have to do the 6.1 backport if the 6.6 release plan
>>>>>> progress)
>>>>>>
>>>>>> I've had a chance to get running in prod for 6.6 backport.
>>>>>> As you can see in attached grafana heatmap pictures, we do observe an
>>>>>> improved/reduced softirq wait time.
>>>>>> These softirq "not-able-to-run" outliers is *one* of the prod
>>>>>> issues we
>>>>>> observed.  As you can see, I still have other areas to improve/fix.
>>>>> I am not very familiar with such heatmaps, but I am glad there is an
>>>>> improvement with 6.6 and the backports. Let me know if there is
>>>>> anything I could do to help with your effort.
>>>> The heatmaps give me an overview, but I needed a debugging tool, so I
>>>> developed some bpftrace scripts [1][2] I'm running on production.
>>>> To measure how long time we hold the cgroup rstat lock (results below).
>>>> Adding ACME and Daniel as I hope there is an easier way to measure lock
>>>> hold time and congestion. Notice tricky release/yield in
>>>> cgroup_rstat_flush_locked[4].
>>>>
>>>> My production results on 6.6 with backported patches (below signature)
>>>> vs a our normal 6.6 kernel, with script [2]. The `@lock_time_hist_ns`
>>>> shows how long time the lock+IRQs were disabled (taking into account it
>>>> can be released in the loop [4]).
>>>>
>>>> Patched kernel:
>>>>
>>>> 21:49:02  time elapsed: 43200 sec
>>>> @lock_time_hist_ns:
>>>> [2K, 4K)              61 |      |
>>>> [4K, 8K)             734 |      |
>>>> [8K, 16K)         121500 |@@@@@@@@@@@@@@@@      |
>>>> [16K, 32K)        385714
>>>> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>>>> [32K, 64K)        145600 |@@@@@@@@@@@@@@@@@@@      |
>>>> [64K, 128K)       156873 |@@@@@@@@@@@@@@@@@@@@@      |
>>>> [128K, 256K)      261027 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
>>>> [256K, 512K)      291986
>>>> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      |
>>>> [512K, 1M)        101859 |@@@@@@@@@@@@@      |
>>>> [1M, 2M)           19866 |@@      |
>>>> [2M, 4M)           10146 |@      |
>>>> [4M, 8M)           30633 |@@@@      |
>>>> [8M, 16M)          40365 |@@@@@      |
>>>> [16M, 32M)         21650 |@@      |
>>>> [32M, 64M)          5842 |      |
>>>> [64M, 128M)            8 |      |
>>>>
>>>> And normal 6.6 kernel:
>>>>
>>>> 21:48:32  time elapsed: 43200 sec
>>>> @lock_time_hist_ns:
>>>> [1K, 2K)              25 |      |
>>>> [2K, 4K)            1146 |      |
>>>> [4K, 8K)           59397 |@@@@      |
>>>> [8K, 16K)         571528
>>>> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      |
>>>> [16K, 32K)        542648 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      |
>>>> [32K, 64K)        202810 |@@@@@@@@@@@@@      |
>>>> [64K, 128K)       134564 |@@@@@@@@@      |
>>>> [128K, 256K)       72870 |@@@@@      |
>>>> [256K, 512K)       56914 |@@@      |
>>>> [512K, 1M)         83140 |@@@@@      |
>>>> [1M, 2M)          170514 |@@@@@@@@@@@      |
>>>> [2M, 4M)          396304 |@@@@@@@@@@@@@@@@@@@@@@@@@@@      |
>>>> [4M, 8M)          755537
>>>> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>>>> [8M, 16M)         231222 |@@@@@@@@@@@@@@@      |
>>>> [16M, 32M)         76370 |@@@@@      |
>>>> [32M, 64M)          1043 |      |
>>>> [64M, 128M)           12 |      |
>>>>
>>>>
>>>> For the unpatched kernel we see more events in 4ms to 8ms bucket than
>>>> any other bucket.
>>>> For patched kernel, we clearly see a significant reduction of events in
>>>> the 4 ms to 64 ms area, but we still have some events in this area.
>>>> I'm
>>>> very happy to see these patches improves the situation.  But for
>>>> network
>>>> processing I'm not happy to see events in area 16ms to 128ms area.  If
>>>> we can just avoid disabling IRQs/softirq for the lock, I would be
>>>> happy.
>>>>
>>>> How far can we go... could cgroup_rstat_lock be converted to a mutex?
>>>
>>> The cgroup_rstat_lock was originally a mutex. It was converted to a
>>> spinlock in commit 0fa294fb1985 ("group: Replace cgroup_rstat_mutex with
>>> a spinlock"). Irq was disabled to enable calling from atomic context.
>>> Since commit 0a2dc6ac3329 ("cgroup: remove
>>> cgroup_rstat_flush_atomic()"), the rstat API hadn't been called from
>>> atomic context anymore. Theoretically, we could change it back to a
>>> mutex or not disabling interrupt. That will require that the API cannot
>>> be called from atomic context going forward.
>>>
>> I think we should avoid flushing from atomic contexts going forward
>> anyway tbh. It's just too much work to do with IRQs disabled, and we
>> observed hard lockups before in worst case scenarios.
>>

Appreciate the historic commits as documentation for how the code
evolved. Sounds like we agree that the IRQ-disable can be lifted,
at-least between the three of us.

>> I think one problem that was discussed before is that flushing is
>> exercised from multiple contexts and could have very high concurrency
>> (e.g. from reclaim when the system is under memory pressure). With a
>> mutex, the flusher could sleep with the mutex held and block other
>> threads for a while.
>>

Fair point, so in first iteration we keep the spin_lock but don't do the
IRQ disable. I already have a upstream devel kernel doing this in my
testlab, but I need to test this in prod to see the effects. Can you
recommend a test I should run in my testlab?

I'm also looking at adding some instrumentation, as my bpftrace
script[2] need to be adjusted to every binary build.
Still hoping ACME will give me an easier approach to measuring lock wait
and hold time? (without having to instrument *all* lock in system).


[2]
https://github.com/xdp-project/xdp-project/blob/master/areas/latency/cgroup_rstat_latency_steroids.bt


>> I vaguely recall experimenting locally with changing that lock into a
>> mutex and not liking the results, but I can't remember much more. I
>> could be misremembering though.
>>
>> Currently, the lock is dropped in cgroup_rstat_flush_locked() between
>> CPU iterations if rescheduling is needed or the lock is being
>> contended (i.e. spin_needbreak() returns true). I had always wondered
>> if it's possible to introduce a similar primitive for IRQs? We could
>> also drop the lock (and re-enable IRQs) if IRQs are pending then.
>
> I am not sure if there is a way to check if a hardirq is pending, but we
> do have a local_softirq_pending() helper.

The local_softirq_pending() might work well for me, as this is our prod
problem, that CPU local pending softirq's are getting starved.

In production another problematic (but rarely occurring issue) is when
several CPUs contend on this lock. Yosry's recent work/patches have
already reduced the chances of this happening (thanks), BUT it still can
and does happen.
A simple solution to this, would be to do a spin_trylock() in
cgroup_rstat_flush(), and exit if we cannot get the lock, because we
know someone else will do the work.
I expect someone to complain here, as cgroup_rstat_flush() takes a
cgroup argument, so I might starve updates on some other cgroup. I
wonder if I can simply check if cgroup->rstat_flush_next is not NULL, to
determine if this cgroup is the one currently being processed?

--Jesper

2024-04-11 11:08:32

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: Advice on cgroup rstat lock

On Tue, Apr 09, 2024 at 01:08:40PM +0200, Jesper Dangaard Brouer wrote:
> Let move this discussion upstream.
>
> On 22/03/2024 19.32, Yosry Ahmed wrote:
> > [..]
> > > > There was a couple of series that made all calls to
> > > > cgroup_rstat_flush() sleepable, which allows the lock to be dropped
> > > > (and IRQs enabled) in between CPU iterations. This fixed a similar
> > > > problem that we used to face (except in our case, we saw hard lockups
> > > > in extreme scenarios):
> > > > https://lore.kernel.org/linux-mm/[email protected]/
> > > > https://lore.kernel.org/lkml/[email protected]/
> > >
> > > I've only done the 6.6 backport, and these were in 6.5/6.6.
>
> Given I have these in my 6.6 kernel. You are basically saying I should
> be able to avoid IRQ-disable for the lock, right?
>
> My main problem with the global cgroup_rstat_lock[3] is it disables IRQs
> and (thereby also) BH/softirq (spin_lock_irq). This cause production
> issues elsewhere, e.g. we are seeing network softirq "not-able-to-run"
> latency issues (debug via softirq_net_latency.bt [5]).
>
> [3]
> https://elixir.bootlin.com/linux/v6.9-rc3/source/kernel/cgroup/rstat.c#L10
> [5] https://github.com/xdp-project/xdp-project/blob/master/areas/latency/softirq_net_latency.bt
>
>
> > > And between 6.1 to 6.6 we did observe an improvement in this area.
> > > (Maybe I don't have to do the 6.1 backport if the 6.6 release plan progress)
> > >
> > > I've had a chance to get running in prod for 6.6 backport.
> > > As you can see in attached grafana heatmap pictures, we do observe an
> > > improved/reduced softirq wait time.
> > > These softirq "not-able-to-run" outliers is *one* of the prod issues we
> > > observed. As you can see, I still have other areas to improve/fix.
> >
> > I am not very familiar with such heatmaps, but I am glad there is an
> > improvement with 6.6 and the backports. Let me know if there is
> > anything I could do to help with your effort.
>
> The heatmaps give me an overview, but I needed a debugging tool, so I
> developed some bpftrace scripts [1][2] I'm running on production.
> To measure how long time we hold the cgroup rstat lock (results below).
> Adding ACME and Daniel as I hope there is an easier way to measure lock
> hold time and congestion. Notice tricky release/yield in
> cgroup_rstat_flush_locked[4].

Have you tried:

root@number:~# echo 1 > /proc/sys/vm/drop_caches
root@number:~# perf lock contention -b find / > /dev/null
contended total wait max wait avg wait type caller

8 16.32 s 7.08 s 2.04 s spinlock tick_do_update_jiffies64+0x25
2 1.58 s 1.58 s 787.88 ms spinlock raw_spin_rq_lock_nested+0x1c
19 165.77 us 24.93 us 8.72 us rwsem:R __btrfs_tree_read_lock+0x1b
17 103.15 us 16.31 us 6.07 us rwsem:R __btrfs_tree_read_lock+0x1b
3 21.45 us 7.88 us 7.15 us rwsem:R __btrfs_tree_read_lock+0x1b
1 10.62 us 10.62 us 10.62 us spinlock raw_spin_rq_lock_nested+0x1c
1 5.57 us 5.57 us 5.57 us rwsem:R __btrfs_tree_read_lock+0x1b
1 5.49 us 5.49 us 5.49 us spinlock tick_do_update_jiffies64+0x25
root@number:~# perf lock contention -b find / > /dev/null
contended total wait max wait avg wait type caller

1 5.91 us 5.91 us 5.91 us rwsem:R __btrfs_tree_read_lock+0x1b
root@number:~#

?

There are other modes of operation:

root@number:~# perf lock contention --help

Usage: perf lock contention [<options>]

-a, --all-cpus System-wide collection from all CPUs
-b, --use-bpf use BPF program to collect lock contention stats
-C, --cpu <cpu> List of cpus to monitor
-E, --entries <n> display this many functions
-F, --field <contended,wait_total,wait_max,avg_wait>
output fields (contended / wait_total / wait_max / wait_min / avg_wait)
-G, --cgroup-filter <CGROUPS>
Filter specific cgroups
-k, --key <wait_total>
key for sorting (contended / wait_total / wait_max / wait_min / avg_wait)
-l, --lock-addr show lock stats by address
-L, --lock-filter <ADDRS/NAMES>
Filter specific address/symbol of locks
-M, --map-nr-entries <num>
Max number of BPF map entries
-o, --lock-owner show lock owners instead of waiters
-p, --pid <pid> Trace on existing process id
-S, --callstack-filter <NAMES>
Filter specific function in the callstack
-t, --threads show per-thread lock stats
-x, --field-separator <separator>
print result in CSV format with custom separator
-Y, --type-filter <FLAGS>
Filter specific type of locks
--lock-cgroup show lock stats by cgroup
--max-stack <num>
Set the maximum stack depth when collecting lock contention, Default: 8
--stack-skip <n> Set the number of stack depth to skip when finding a lock caller, Default: 4
--tid <tid> Trace on existing thread id (exclusive to --pid)

root@number:~#

Looking at:

git log tools/perf/util/bpf_skel/lock_contention.bpf.c tools/perf/builtin-lock.c

Will show you more examples and details about its implementation that
may help in tailoring it to your needs.

- Arnaldo

> My production results on 6.6 with backported patches (below signature)
> vs a our normal 6.6 kernel, with script [2]. The `@lock_time_hist_ns`
> shows how long time the lock+IRQs were disabled (taking into account it
> can be released in the loop [4]).
>
> Patched kernel:
>
> 21:49:02 time elapsed: 43200 sec
> @lock_time_hist_ns:
> [2K, 4K) 61 | |
> [4K, 8K) 734 | |
> [8K, 16K) 121500 |@@@@@@@@@@@@@@@@ |
> [16K, 32K) 385714
> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [32K, 64K) 145600 |@@@@@@@@@@@@@@@@@@@ |
> [64K, 128K) 156873 |@@@@@@@@@@@@@@@@@@@@@ |
> [128K, 256K) 261027 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
> [256K, 512K) 291986 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
> [512K, 1M) 101859 |@@@@@@@@@@@@@ |
> [1M, 2M) 19866 |@@ |
> [2M, 4M) 10146 |@ |
> [4M, 8M) 30633 |@@@@ |
> [8M, 16M) 40365 |@@@@@ |
> [16M, 32M) 21650 |@@ |
> [32M, 64M) 5842 | |
> [64M, 128M) 8 | |
>
> And normal 6.6 kernel:
>
> 21:48:32 time elapsed: 43200 sec
> @lock_time_hist_ns:
> [1K, 2K) 25 | |
> [2K, 4K) 1146 | |
> [4K, 8K) 59397 |@@@@ |
> [8K, 16K) 571528 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
> [16K, 32K) 542648 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
> [32K, 64K) 202810 |@@@@@@@@@@@@@ |
> [64K, 128K) 134564 |@@@@@@@@@ |
> [128K, 256K) 72870 |@@@@@ |
> [256K, 512K) 56914 |@@@ |
> [512K, 1M) 83140 |@@@@@ |
> [1M, 2M) 170514 |@@@@@@@@@@@ |
> [2M, 4M) 396304 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
> [4M, 8M) 755537
> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [8M, 16M) 231222 |@@@@@@@@@@@@@@@ |
> [16M, 32M) 76370 |@@@@@ |
> [32M, 64M) 1043 | |
> [64M, 128M) 12 | |
>
>
> For the unpatched kernel we see more events in 4ms to 8ms bucket than
> any other bucket.
> For patched kernel, we clearly see a significant reduction of events in
> the 4 ms to 64 ms area, but we still have some events in this area. I'm
> very happy to see these patches improves the situation. But for network
> processing I'm not happy to see events in area 16ms to 128ms area. If
> we can just avoid disabling IRQs/softirq for the lock, I would be happy.
>
> How far can we go... could cgroup_rstat_lock be converted to a mutex?
>
> --Jesper
>
> [1] https://github.com/xdp-project/xdp-project/blob/master/areas/latency/cgroup_rstat_latency.bt
> [2] https://github.com/xdp-project/xdp-project/blob/master/areas/latency/cgroup_rstat_latency_steroids.bt
> [3]
> https://elixir.bootlin.com/linux/v6.9-rc3/source/kernel/cgroup/rstat.c#L10
> [4]
> https://elixir.bootlin.com/linux/v6.9-rc3/source/kernel/cgroup/rstat.c#L226
> cgroup_rstat_flush_locked
> [5] https://github.com/xdp-project/xdp-project/blob/master/areas/latency/softirq_net_latency.bt
>
>
>
> Backported to 6.6
>
> List of **main** patches address issue - to backport for 6.6:
> - 508bed884767 mm: memcg: change flush_next_time to flush_last_time
> - v6.8-rc1~180^2~205
> - e0bf1dc859fd mm: memcg: move vmstats structs definition above flushing
> code
> - v6.8-rc1~180^2~204
> - 8d59d2214c23 mm: memcg: make stats flushing threshold per-memcg
> - v6.8-rc1~180^2~203
> - b00684722262 mm: workingset: move the stats flush into
> workingset_test_recent()
> - v6.8-rc1~180^2~202
> - 7d7ef0a4686a mm: memcg: restore subtree stats flushing
> - v6.8-rc1~180^2~201
>
> And extra (thanks Longman)
>
> - e76d28bdf9ba ("cgroup/rstat: Reduce cpu_lock hold time in
> cgroup_rstat_flush_locked()")
> - v6.8-rc1~182^2~8
>
> And list of patches that contain **fixes** for backports above:
> - 9cee7e8ef3e3 mm: memcg: optimize parent iteration in
> memcg_rstat_updated()
> - v6.8-rc4~3^2~12
> - 13ef7424577f mm: memcg: don't periodically flush stats when memcg is
> disabled
> - v6.8-rc5-69-g13ef7424577f
>

2024-04-11 18:17:25

by Yosry Ahmed

[permalink] [raw]
Subject: Re: Advice on cgroup rstat lock

[..]
> >>>>
> >>>> How far can we go... could cgroup_rstat_lock be converted to a mutex?
> >>>
> >>> The cgroup_rstat_lock was originally a mutex. It was converted to a
> >>> spinlock in commit 0fa294fb1985 ("group: Replace cgroup_rstat_mutex with
> >>> a spinlock"). Irq was disabled to enable calling from atomic context.
> >>> Since commit 0a2dc6ac3329 ("cgroup: remove
> >>> cgroup_rstat_flush_atomic()"), the rstat API hadn't been called from
> >>> atomic context anymore. Theoretically, we could change it back to a
> >>> mutex or not disabling interrupt. That will require that the API cannot
> >>> be called from atomic context going forward.
> >>>
> >> I think we should avoid flushing from atomic contexts going forward
> >> anyway tbh. It's just too much work to do with IRQs disabled, and we
> >> observed hard lockups before in worst case scenarios.
> >>
>
> Appreciate the historic commits as documentation for how the code
> evolved. Sounds like we agree that the IRQ-disable can be lifted,
> at-least between the three of us.

It can be lifted, but whether it should be or not is a different
story. I tried keeping it as a spinlock without disabling IRQs before
and Tejun pointed out possible problems, see below.

>
> >> I think one problem that was discussed before is that flushing is
> >> exercised from multiple contexts and could have very high concurrency
> >> (e.g. from reclaim when the system is under memory pressure). With a
> >> mutex, the flusher could sleep with the mutex held and block other
> >> threads for a while.
> >>
>
> Fair point, so in first iteration we keep the spin_lock but don't do the
> IRQ disable.

I tried doing that before, and Tejun had some objections:
https://lore.kernel.org/lkml/ZBz%2FV5a7%[email protected]/

My read of that thread is that Tejun would prefer we look into
converting cgroup_rsat_lock into a mutex again, or more aggressively
drop the lock on CPU boundaries. Perhaps we can unconditionally drop
the lock on each CPU boundary, but I am worried that contending the
lock too often may be an issue, which is why I suggested dropping the
lock if there are pending IRQs instead -- but I am not sure how to do
that :)

> I already have a upstream devel kernel doing this in my
> testlab, but I need to test this in prod to see the effects. Can you
> recommend a test I should run in my testlab?

I don't know of any existing test/benchmark. What I used to do is run
a synthetic test with a lot of concurrent reclaim activity (some in
the same cgroups, some in different ones) to stress in-kernel
flushers, and a synthetic test with a lot of concurrent userspace
reads.

I would mainly look into the time it took for concurrent reclaim
operations to complete and the userspace reads latency histograms. I
don't have the scripts I used now unfortunately, but I can help with
more details if needed.

>
> I'm also looking at adding some instrumentation, as my bpftrace
> script[2] need to be adjusted to every binary build.
> Still hoping ACME will give me an easier approach to measuring lock wait
> and hold time? (without having to instrument *all* lock in system).
>
>
> [2]
> https://github.com/xdp-project/xdp-project/blob/master/areas/latency/cgroup_rstat_latency_steroids.bt
>
>
> >> I vaguely recall experimenting locally with changing that lock into a
> >> mutex and not liking the results, but I can't remember much more. I
> >> could be misremembering though.
> >>
> >> Currently, the lock is dropped in cgroup_rstat_flush_locked() between
> >> CPU iterations if rescheduling is needed or the lock is being
> >> contended (i.e. spin_needbreak() returns true). I had always wondered
> >> if it's possible to introduce a similar primitive for IRQs? We could
> >> also drop the lock (and re-enable IRQs) if IRQs are pending then.
> >
> > I am not sure if there is a way to check if a hardirq is pending, but we
> > do have a local_softirq_pending() helper.
>
> The local_softirq_pending() might work well for me, as this is our prod
> problem, that CPU local pending softirq's are getting starved.

If my understanding is correct, softirqs are usually scheduled by
IRQs, which means that local_softirq_pending() may return false if
there are pending IRQs (that will schedule softirqs). Is this correct?

>
> In production another problematic (but rarely occurring issue) is when
> several CPUs contend on this lock. Yosry's recent work/patches have
> already reduced the chances of this happening (thanks), BUT it still can
> and does happen.
> A simple solution to this, would be to do a spin_trylock() in
> cgroup_rstat_flush(), and exit if we cannot get the lock, because we
> know someone else will do the work.

I am not sure I understand what you mean specifically with the checks
below, but I generally don't like this (as you predicted :) ).

On the memcg side, we used to have similar logic when we used to
always flush the entire tree. This leaded to flushing being
indeterministic. You would occasionally get stale stats because of the
contention, which resulted in some inconsistencies (e.g. performing
proactive reclaim successfully then reading the stats that do not
reflect that).

Now that we dropped the logic to always flush the entire tree, it is
even more difficult because concurrent flushes could be in completely
irrelevant subtrees.

If we were to introduce some smart logic to figure out that the
subtree we are trying to flush is already being flushed, I think we
would need to wait for that ongoing flush to complete instead of just
returning (e.g. using completions). But I think such implementations
to find overlapping flushes and wait for them may be too compicated.

> I expect someone to complain here, as cgroup_rstat_flush() takes a
> cgroup argument, so I might starve updates on some other cgroup. I
> wonder if I can simply check if cgroup->rstat_flush_next is not NULL, to
> determine if this cgroup is the one currently being processed?

2024-04-12 19:26:47

by Jesper Dangaard Brouer

[permalink] [raw]
Subject: Re: Advice on cgroup rstat lock



On 11/04/2024 19.22, Yosry Ahmed wrote:
> [..]
>>>>>>
>>>>>> How far can we go... could cgroup_rstat_lock be converted to a mutex?
>> >>>
>>>>> The cgroup_rstat_lock was originally a mutex. It was converted to a
>>>>> spinlock in commit 0fa294fb1985 ("group: Replace cgroup_rstat_mutex with
>>>>> a spinlock"). Irq was disabled to enable calling from atomic context.
>>>>> Since commit 0a2dc6ac3329 ("cgroup: remove
>>>>> cgroup_rstat_flush_atomic()"), the rstat API hadn't been called from
>>>>> atomic context anymore. Theoretically, we could change it back to a
>>>>> mutex or not disabling interrupt. That will require that the API cannot
>>>>> be called from atomic context going forward.
>> >>>
>>>> I think we should avoid flushing from atomic contexts going forward
>>>> anyway tbh. It's just too much work to do with IRQs disabled, and we
>>>> observed hard lockups before in worst case scenarios.
>>>>
>>
>> Appreciate the historic commits as documentation for how the code
>> evolved. Sounds like we agree that the IRQ-disable can be lifted,
>> at-least between the three of us.
>
> It can be lifted, but whether it should be or not is a different
> story. I tried keeping it as a spinlock without disabling IRQs before
> and Tejun pointed out possible problems, see below.
>

IMHO it *MUST* be lifted, as disabling IRQs here is hurting other parts
of the system and actual production systems.

The "offending" IRQ-spin_lock commit (0fa294fb1985) is from 2018, and
GitHub noticed in 2019 (via blog[1]) and at Red Hat I backported[2]
patches (which I now understand) only mitigate the issues. Our prod
systems are on 6.1 and 6.6 where we still clearly see the issue
occurring. Also Daniel's "rtla timerlat" tool for catching systems
latency issues have "cgroup_rstat_flush_locked" as the poster child [3][4].


[1] https://github.blog/2019-11-21-debugging-network-stalls-on-kubernetes/
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1795049
[3] https://bristot.me/linux-scheduling-latency-debug-and-analysis/
[4] Documentation/tools/rtla/rtla-timerlat-top.rst

>>
>>>> I think one problem that was discussed before is that flushing is
>>>> exercised from multiple contexts and could have very high concurrency
>>>> (e.g. from reclaim when the system is under memory pressure). With a
>>>> mutex, the flusher could sleep with the mutex held and block other
>>>> threads for a while.
>>>>
>>
>> Fair point, so in first iteration we keep the spin_lock but don't do the
>> IRQ disable.
>
> I tried doing that before, and Tejun had some objections:
> https://lore.kernel.org/lkml/ZBz%2FV5a7%[email protected]/
>
> My read of that thread is that Tejun would prefer we look into
> converting cgroup_rsat_lock into a mutex again, or more aggressively
> drop the lock on CPU boundaries. Perhaps we can unconditionally drop
> the lock on each CPU boundary, but I am worried that contending the
> lock too often may be an issue, which is why I suggested dropping the
> lock if there are pending IRQs instead -- but I am not sure how to do
> that :)
>

Like Tejun, I share the concern that keeping this a spinlock will
can increase the chance of several CPUs contend on this lock (which is
also a production issue we see). This is why I suggested to "exit" if
(1) we see the lock have been taken by somebody else, or if (2) stats
were flushed recently.

For (2), memcg have a mem_cgroup_flush_stats_ratelimited() system
combined with memcg_vmstats_needs_flush(), which limits the pressure on
the global lock (cgroup_rstat_lock).
*BUT* other users of cgroup_rstat_flush() like when reading io.stat
(blk-cgroup.c) and cpu.stat, don't have such a system to limit pressure
on global lock. Further more, userspace can easily trigger this via
reading those stat files. And normal userspace stats tools (like
cadvisor, nomad, systemd) spawn threads reading io.stat, cpu.stat and
memory.stat, likely without realizing that kernel side they share same
global lock...

I'm working on a code solution/proposal for "ratelimiting" global lock
access when reading io.stat and cpu.stat.


>> I already have a upstream devel kernel doing this in my
>> testlab, but I need to test this in prod to see the effects. Can you
>> recommend a test I should run in my testlab?
>
> I don't know of any existing test/benchmark. What I used to do is run
> a synthetic test with a lot of concurrent reclaim activity (some in
> the same cgroups, some in different ones) to stress in-kernel
> flushers, and a synthetic test with a lot of concurrent userspace
> reads.
>
> I would mainly look into the time it took for concurrent reclaim
> operations to complete and the userspace reads latency histograms. I
> don't have the scripts I used now unfortunately, but I can help with
> more details if needed.
>
>>
>> I'm also looking at adding some instrumentation, as my bpftrace
>> script[2] need to be adjusted to every binary build.
>> Still hoping ACME will give me an easier approach to measuring lock wait
>> and hold time? (without having to instrument *all* lock in system).
>>
>>
>> [2]
>> https://github.com/xdp-project/xdp-project/blob/master/areas/latency/cgroup_rstat_latency_steroids.bt
>>
>>
>>>> I vaguely recall experimenting locally with changing that lock into a
>>>> mutex and not liking the results, but I can't remember much more. I
>>>> could be misremembering though.
>>>>
>>>> Currently, the lock is dropped in cgroup_rstat_flush_locked() between
>>>> CPU iterations if rescheduling is needed or the lock is being
>>>> contended (i.e. spin_needbreak() returns true). I had always wondered
>>>> if it's possible to introduce a similar primitive for IRQs? We could
>>>> also drop the lock (and re-enable IRQs) if IRQs are pending then.
>>>
>>> I am not sure if there is a way to check if a hardirq is pending, but we
>>> do have a local_softirq_pending() helper.
>>
>> The local_softirq_pending() might work well for me, as this is our prod
>> problem, that CPU local pending softirq's are getting starved.
>
> If my understanding is correct, softirqs are usually scheduled by
> IRQs, which means that local_softirq_pending() may return false if
> there are pending IRQs (that will schedule softirqs). Is this correct?
>

Yes, networking hard IRQ will raise softirq, but software often also
raise softirq.
I see where you are going with this... the cgroup_rstat_flush_locked()
loop "play nice" check happens with IRQ lock held, so you speculate that
IRQ handler will not be able to raise softirq, thus
local_softirq_pending() will not work inside IRQ lock.


>>
>> In production another problematic (but rarely occurring issue) is when
>> several CPUs contend on this lock. Yosry's recent work/patches have
>> already reduced the chances of this happening (thanks), BUT it still can
>> and does happen.
>> A simple solution to this, would be to do a spin_trylock() in
>> cgroup_rstat_flush(), and exit if we cannot get the lock, because we
>> know someone else will do the work.
>
> I am not sure I understand what you mean specifically with the checks
> below, but I generally don't like this (as you predicted :) ).
>
> On the memcg side, we used to have similar logic when we used to
> always flush the entire tree. This leaded to flushing being
> indeterministic. You would occasionally get stale stats because of the
> contention, which resulted in some inconsistencies (e.g. performing
> proactive reclaim successfully then reading the stats that do not
> reflect that).
>
> Now that we dropped the logic to always flush the entire tree, it is
> even more difficult because concurrent flushes could be in completely
> irrelevant subtrees.
>
> If we were to introduce some smart logic to figure out that the
> subtree we are trying to flush is already being flushed, I think we
> would need to wait for that ongoing flush to complete instead of just
> returning (e.g. using completions). But I think such implementations
> to find overlapping flushes and wait for them may be too compicated.
>

We will see if you hate my current code approach ;-)

>> I expect someone to complain here, as cgroup_rstat_flush() takes a
>> cgroup argument, so I might starve updates on some other cgroup. I
>> wonder if I can simply check if cgroup->rstat_flush_next is not NULL, to
>> determine if this cgroup is the one currently being processed?

--Jesper

2024-04-12 19:52:51

by Yosry Ahmed

[permalink] [raw]
Subject: Re: Advice on cgroup rstat lock

On Fri, Apr 12, 2024 at 12:26 PM Jesper Dangaard Brouer <[email protected]> wrote:
>
>
>
> On 11/04/2024 19.22, Yosry Ahmed wrote:
> > [..]
> >>>>>>
> >>>>>> How far can we go... could cgroup_rstat_lock be converted to a mutex?
> >> >>>
> >>>>> The cgroup_rstat_lock was originally a mutex. It was converted to a
> >>>>> spinlock in commit 0fa294fb1985 ("group: Replace cgroup_rstat_mutex with
> >>>>> a spinlock"). Irq was disabled to enable calling from atomic context.
> >>>>> Since commit 0a2dc6ac3329 ("cgroup: remove
> >>>>> cgroup_rstat_flush_atomic()"), the rstat API hadn't been called from
> >>>>> atomic context anymore. Theoretically, we could change it back to a
> >>>>> mutex or not disabling interrupt. That will require that the API cannot
> >>>>> be called from atomic context going forward.
> >> >>>
> >>>> I think we should avoid flushing from atomic contexts going forward
> >>>> anyway tbh. It's just too much work to do with IRQs disabled, and we
> >>>> observed hard lockups before in worst case scenarios.
> >>>>
> >>
> >> Appreciate the historic commits as documentation for how the code
> >> evolved. Sounds like we agree that the IRQ-disable can be lifted,
> >> at-least between the three of us.
> >
> > It can be lifted, but whether it should be or not is a different
> > story. I tried keeping it as a spinlock without disabling IRQs before
> > and Tejun pointed out possible problems, see below.
> >
>
> IMHO it *MUST* be lifted, as disabling IRQs here is hurting other parts
> of the system and actual production systems.
>
> The "offending" IRQ-spin_lock commit (0fa294fb1985) is from 2018, and
> GitHub noticed in 2019 (via blog[1]) and at Red Hat I backported[2]
> patches (which I now understand) only mitigate the issues. Our prod
> systems are on 6.1 and 6.6 where we still clearly see the issue
> occurring. Also Daniel's "rtla timerlat" tool for catching systems
> latency issues have "cgroup_rstat_flush_locked" as the poster child [3][4].

We have been bitten by the IRQ-spinlock before, so I cannot disagree,
although for us removing atomic flushes and allowing the lock to be
dropped between CPU flushes seems to be good enough (for now).

>
>
> [1] https://github.blog/2019-11-21-debugging-network-stalls-on-kubernetes/
> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1795049
> [3] https://bristot.me/linux-scheduling-latency-debug-and-analysis/
> [4] Documentation/tools/rtla/rtla-timerlat-top.rst
>
> >>
> >>>> I think one problem that was discussed before is that flushing is
> >>>> exercised from multiple contexts and could have very high concurrency
> >>>> (e.g. from reclaim when the system is under memory pressure). With a
> >>>> mutex, the flusher could sleep with the mutex held and block other
> >>>> threads for a while.
> >>>>
> >>
> >> Fair point, so in first iteration we keep the spin_lock but don't do the
> >> IRQ disable.
> >
> > I tried doing that before, and Tejun had some objections:
> > https://lore.kernel.org/lkml/ZBz%2FV5a7%[email protected]/
> >
> > My read of that thread is that Tejun would prefer we look into
> > converting cgroup_rsat_lock into a mutex again, or more aggressively
> > drop the lock on CPU boundaries. Perhaps we can unconditionally drop
> > the lock on each CPU boundary, but I am worried that contending the
> > lock too often may be an issue, which is why I suggested dropping the
> > lock if there are pending IRQs instead -- but I am not sure how to do
> > that :)
> >
>
> Like Tejun, I share the concern that keeping this a spinlock will
> can increase the chance of several CPUs contend on this lock (which is
> also a production issue we see). This is why I suggested to "exit" if
> (1) we see the lock have been taken by somebody else, or if (2) stats
> were flushed recently.

When you say "exit", do you mean abort the whole thing, or just don't
spin for the lock but wait for the ongoing flush?

>
> For (2), memcg have a mem_cgroup_flush_stats_ratelimited() system
> combined with memcg_vmstats_needs_flush(), which limits the pressure on
> the global lock (cgroup_rstat_lock).
> *BUT* other users of cgroup_rstat_flush() like when reading io.stat
> (blk-cgroup.c) and cpu.stat, don't have such a system to limit pressure
> on global lock. Further more, userspace can easily trigger this via
> reading those stat files. And normal userspace stats tools (like
> cadvisor, nomad, systemd) spawn threads reading io.stat, cpu.stat and
> memory.stat, likely without realizing that kernel side they share same
> global lock...
>
> I'm working on a code solution/proposal for "ratelimiting" global lock
> access when reading io.stat and cpu.stat.

I personally don't like mem_cgroup_flush_stats_ratelimited() very
much, because it is time-based (unlike memcg_vmstats_needs_flush()),
and a lot of changes can happen in a very short amount of time.
However, it seems like for some workloads it's a necessary evil :/

I briefly looked into a global scheme similar to
memcg_vmstats_needs_flush() in core cgroups code, but I gave up
quickly. Different subsystems have different incomparable stats, so we
cannot have a simple magnitude of pending updates on a cgroup-level
that represents all subsystems fairly.

I tried to have per-subsystem callbacks to update the pending stats
and check if flushing is required -- but it got complicated quickly
and performance was bad.

At some point, having different rstat trees for different subsystems
was brought up. I never looked into actually implementing it, but I
suppose if we do that we have a generic scheme similar to
memcg_vmstats_needs_flush() that can be customized by each subsystem
in a clean performant way? I am not sure.

[..]
> >>
> >>>> I vaguely recall experimenting locally with changing that lock into a
> >>>> mutex and not liking the results, but I can't remember much more. I
> >>>> could be misremembering though.
> >>>>
> >>>> Currently, the lock is dropped in cgroup_rstat_flush_locked() between
> >>>> CPU iterations if rescheduling is needed or the lock is being
> >>>> contended (i.e. spin_needbreak() returns true). I had always wondered
> >>>> if it's possible to introduce a similar primitive for IRQs? We could
> >>>> also drop the lock (and re-enable IRQs) if IRQs are pending then.
> >>>
> >>> I am not sure if there is a way to check if a hardirq is pending, but we
> >>> do have a local_softirq_pending() helper.
> >>
> >> The local_softirq_pending() might work well for me, as this is our prod
> >> problem, that CPU local pending softirq's are getting starved.
> >
> > If my understanding is correct, softirqs are usually scheduled by
> > IRQs, which means that local_softirq_pending() may return false if
> > there are pending IRQs (that will schedule softirqs). Is this correct?
> >
>
> Yes, networking hard IRQ will raise softirq, but software often also
> raise softirq.
> I see where you are going with this... the cgroup_rstat_flush_locked()
> loop "play nice" check happens with IRQ lock held, so you speculate that
> IRQ handler will not be able to raise softirq, thus
> local_softirq_pending() will not work inside IRQ lock.

Exactly.

I wonder if it would be okay to just unconditionally drop the lock at
each CPU boundary. Would be interesting to experiment with this. One
disadvantage of the mutex in this case (imo) is that outside of the
percpu spinlock critical section, we don't really need to be holding
the global lock/mutex. So sleeping while holding it is not needed and
only introduces problems. Dropping the spinlock at each boundary seems
like a way to circumvent that.

If the problems you are observing are mainly on CPUs that are holding
the lock and flushing, I suspect this should greatly. If the problems
are mainly on CPUs spinning for the lock, I suspect it will still help
redistribute the lock (and IRQs disablement) more often, but not as
much.

>
>
> >>
> >> In production another problematic (but rarely occurring issue) is when
> >> several CPUs contend on this lock. Yosry's recent work/patches have
> >> already reduced the chances of this happening (thanks), BUT it still can
> >> and does happen.
> >> A simple solution to this, would be to do a spin_trylock() in
> >> cgroup_rstat_flush(), and exit if we cannot get the lock, because we
> >> know someone else will do the work.
> >
> > I am not sure I understand what you mean specifically with the checks
> > below, but I generally don't like this (as you predicted :) ).
> >
> > On the memcg side, we used to have similar logic when we used to
> > always flush the entire tree. This leaded to flushing being
> > indeterministic. You would occasionally get stale stats because of the
> > contention, which resulted in some inconsistencies (e.g. performing
> > proactive reclaim successfully then reading the stats that do not
> > reflect that).
> >
> > Now that we dropped the logic to always flush the entire tree, it is
> > even more difficult because concurrent flushes could be in completely
> > irrelevant subtrees.
> >
> > If we were to introduce some smart logic to figure out that the
> > subtree we are trying to flush is already being flushed, I think we
> > would need to wait for that ongoing flush to complete instead of just
> > returning (e.g. using completions). But I think such implementations
> > to find overlapping flushes and wait for them may be too compicated.
> >
>
> We will see if you hate my current code approach ;-)

Just to be clear, if the spinlock was to be converted to a mutex, or
to be dropped at each CPU boundary, do you still think such
ratelimiting is still needed to mitigate lock contention -- even if
the IRQs latency problem is fixed?

2024-04-16 14:50:38

by Jesper Dangaard Brouer

[permalink] [raw]
Subject: Re: Advice on cgroup rstat lock



On 12/04/2024 21.51, Yosry Ahmed wrote:
> On Fri, Apr 12, 2024 at 12:26 PM Jesper Dangaard Brouer <[email protected]> wrote:
>>
>>
>> On 11/04/2024 19.22, Yosry Ahmed wrote:
>>> [..]
>>>>>>>>
>>>>>>>> How far can we go... could cgroup_rstat_lock be converted to a mutex?
>>>> >>>
>>>>>>> The cgroup_rstat_lock was originally a mutex. It was converted to a
>>>>>>> spinlock in commit 0fa294fb1985 ("group: Replace cgroup_rstat_mutex with
>>>>>>> a spinlock"). Irq was disabled to enable calling from atomic context.
>>>>>>> Since commit 0a2dc6ac3329 ("cgroup: remove
>>>>>>> cgroup_rstat_flush_atomic()"), the rstat API hadn't been called from
>>>>>>> atomic context anymore. Theoretically, we could change it back to a
>>>>>>> mutex or not disabling interrupt. That will require that the API cannot
>>>>>>> be called from atomic context going forward.
>>>> >>>
>>>>>> I think we should avoid flushing from atomic contexts going forward
>>>>>> anyway tbh. It's just too much work to do with IRQs disabled, and we
>>>>>> observed hard lockups before in worst case scenarios.
>>>>>>
>>>>
>>>> Appreciate the historic commits as documentation for how the code
>>>> evolved. Sounds like we agree that the IRQ-disable can be lifted,
>>>> at-least between the three of us.
>>>
>>> It can be lifted, but whether it should be or not is a different
>>> story. I tried keeping it as a spinlock without disabling IRQs before
>>> and Tejun pointed out possible problems, see below.
>>>
>>
>> IMHO it *MUST* be lifted, as disabling IRQs here is hurting other parts
>> of the system and actual production systems.
>>
>> The "offending" IRQ-spin_lock commit (0fa294fb1985) is from 2018, and
>> GitHub noticed in 2019 (via blog[1]) and at Red Hat I backported[2]
>> patches (which I now understand) only mitigate the issues. Our prod
>> systems are on 6.1 and 6.6 where we still clearly see the issue
>> occurring. Also Daniel's "rtla timerlat" tool for catching systems
>> latency issues have "cgroup_rstat_flush_locked" as the poster child [3][4].
>
> We have been bitten by the IRQ-spinlock before, so I cannot disagree,
> although for us removing atomic flushes and allowing the lock to be
> dropped between CPU flushes seems to be good enough (for now).
>
>>
>>
>> [1] https://github.blog/2019-11-21-debugging-network-stalls-on-kubernetes/
>> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1795049
>> [3] https://bristot.me/linux-scheduling-latency-debug-and-analysis/
>> [4] Documentation/tools/rtla/rtla-timerlat-top.rst
>>
>>>>
>>>>>> I think one problem that was discussed before is that flushing is
>>>>>> exercised from multiple contexts and could have very high concurrency
>>>>>> (e.g. from reclaim when the system is under memory pressure). With a
>>>>>> mutex, the flusher could sleep with the mutex held and block other
>>>>>> threads for a while.
>>>>>>
>>>>
>>>> Fair point, so in first iteration we keep the spin_lock but don't do the
>>>> IRQ disable.
>>>
>>> I tried doing that before, and Tejun had some objections:
>>> https://lore.kernel.org/lkml/ZBz%2FV5a7%[email protected]/
>>>
>>> My read of that thread is that Tejun would prefer we look into
>>> converting cgroup_rsat_lock into a mutex again, or more aggressively
>>> drop the lock on CPU boundaries. Perhaps we can unconditionally drop
>>> the lock on each CPU boundary, but I am worried that contending the
>>> lock too often may be an issue, which is why I suggested dropping the
>>> lock if there are pending IRQs instead -- but I am not sure how to do
>>> that :)
>>>
>>
>> Like Tejun, I share the concern that keeping this a spinlock will
>> can increase the chance of several CPUs contend on this lock (which is
>> also a production issue we see). This is why I suggested to "exit" if
>> (1) we see the lock have been taken by somebody else, or if (2) stats
>> were flushed recently.
>
> When you say "exit", do you mean abort the whole thing, or just don't
> spin for the lock but wait for the ongoing flush?
>

I like that we are considering a mutex lock, because it is not
reasonable to be waiting by spinning on the lock from remote CPUs,
because this cgroup_rstat_lock is held for too long (up to 64-128 ms in
[prod]).

Prod latency data mentioned earlier:
[prod]
https://lore.kernel.org/all/[email protected]/


>>
>> For (2), memcg have a mem_cgroup_flush_stats_ratelimited() system
>> combined with memcg_vmstats_needs_flush(), which limits the pressure on
>> the global lock (cgroup_rstat_lock).
>> *BUT* other users of cgroup_rstat_flush() like when reading io.stat
>> (blk-cgroup.c) and cpu.stat, don't have such a system to limit pressure
>> on global lock. Further more, userspace can easily trigger this via
>> reading those stat files. And normal userspace stats tools (like
>> cadvisor, nomad, systemd) spawn threads reading io.stat, cpu.stat and
>> memory.stat, likely without realizing that kernel side they share same
>> global lock...
>>
>> I'm working on a code solution/proposal for "ratelimiting" global lock
>> access when reading io.stat and cpu.stat.
>
> I personally don't like mem_cgroup_flush_stats_ratelimited() very
> much, because it is time-based (unlike memcg_vmstats_needs_flush()),
> and a lot of changes can happen in a very short amount of time.
> However, it seems like for some workloads it's a necessary evil :/
>

I like the combination of the two mem_cgroup_flush_stats_ratelimited()
and memcg_vmstats_needs_flush().
IMHO the jiffies rate limit 2*FLUSH_TIME is too high, looks like 4 sec?


> I briefly looked into a global scheme similar to
> memcg_vmstats_needs_flush() in core cgroups code, but I gave up
> quickly. Different subsystems have different incomparable stats, so we
> cannot have a simple magnitude of pending updates on a cgroup-level
> that represents all subsystems fairly.
>
> I tried to have per-subsystem callbacks to update the pending stats
> and check if flushing is required -- but it got complicated quickly
> and performance was bad.
>

I like the time-based limit because it doesn't require tracking pending
updates.

I'm looking at using a time-based limit, on how often userspace can take
the lock, but in the area of 50ms to 100 ms.

> At some point, having different rstat trees for different subsystems
> was brought up. I never looked into actually implementing it, but I
> suppose if we do that we have a generic scheme similar to
> memcg_vmstats_needs_flush() that can be customized by each subsystem
> in a clean performant way? I am not sure.
>
> [..]
>>>>
>>>>>> I vaguely recall experimenting locally with changing that lock into a
>>>>>> mutex and not liking the results, but I can't remember much more. I
>>>>>> could be misremembering though.
>>>>>>
>>>>>> Currently, the lock is dropped in cgroup_rstat_flush_locked() between
>>>>>> CPU iterations if rescheduling is needed or the lock is being
>>>>>> contended (i.e. spin_needbreak() returns true). I had always wondered
>>>>>> if it's possible to introduce a similar primitive for IRQs? We could
>>>>>> also drop the lock (and re-enable IRQs) if IRQs are pending then.
>>>>>
>>>>> I am not sure if there is a way to check if a hardirq is pending, but we
>>>>> do have a local_softirq_pending() helper.
>>>>
>>>> The local_softirq_pending() might work well for me, as this is our prod
>>>> problem, that CPU local pending softirq's are getting starved.
>>>
>>> If my understanding is correct, softirqs are usually scheduled by
>>> IRQs, which means that local_softirq_pending() may return false if
>>> there are pending IRQs (that will schedule softirqs). Is this correct?
>>>
>>
>> Yes, networking hard IRQ will raise softirq, but software often also
>> raise softirq.
>> I see where you are going with this... the cgroup_rstat_flush_locked()
>> loop "play nice" check happens with IRQ lock held, so you speculate that
>> IRQ handler will not be able to raise softirq, thus
>> local_softirq_pending() will not work inside IRQ lock.
>
> Exactly.
>
> I wonder if it would be okay to just unconditionally drop the lock at
> each CPU boundary. Would be interesting to experiment with this. One
> disadvantage of the mutex in this case (imo) is that outside of the
> percpu spinlock critical section, we don't really need to be holding
> the global lock/mutex. So sleeping while holding it is not needed and
> only introduces problems. Dropping the spinlock at each boundary seems
> like a way to circumvent that.
>

This sound interesting, to unconditionally drop the lock at each CPU
boundary. We should experiment with this.

> If the problems you are observing are mainly on CPUs that are holding
> the lock and flushing, I suspect this should greatly. If the problems
> are mainly on CPUs spinning for the lock, I suspect it will still help
> redistribute the lock (and IRQs disablement) more often, but not as
> much.
>
>>
>>
>>>>
>>>> In production another problematic (but rarely occurring issue) is when
>>>> several CPUs contend on this lock. Yosry's recent work/patches have
>>>> already reduced the chances of this happening (thanks), BUT it still can
>>>> and does happen.
>>>> A simple solution to this, would be to do a spin_trylock() in
>>>> cgroup_rstat_flush(), and exit if we cannot get the lock, because we
>>>> know someone else will do the work.
>>>
>>> I am not sure I understand what you mean specifically with the checks
>>> below, but I generally don't like this (as you predicted :) ).
>>>
>>> On the memcg side, we used to have similar logic when we used to
>>> always flush the entire tree. This leaded to flushing being
>>> indeterministic. You would occasionally get stale stats because of the
>>> contention, which resulted in some inconsistencies (e.g. performing
>>> proactive reclaim successfully then reading the stats that do not
>>> reflect that).
>>>
>>> Now that we dropped the logic to always flush the entire tree, it is
>>> even more difficult because concurrent flushes could be in completely
>>> irrelevant subtrees.
>>>
>>> If we were to introduce some smart logic to figure out that the
>>> subtree we are trying to flush is already being flushed, I think we
>>> would need to wait for that ongoing flush to complete instead of just
>>> returning (e.g. using completions). But I think such implementations
>>> to find overlapping flushes and wait for them may be too compicated.
>>>
>>
>> We will see if you hate my current code approach ;-)
>
> Just to be clear, if the spinlock was to be converted to a mutex, or
> to be dropped at each CPU boundary, do you still think such
> ratelimiting is still needed to mitigate lock contention -- even if
> the IRQs latency problem is fixed?

With a mutex lock contention will be less obvious, as converting this to
a mutex avoids multiple CPUs spinning while waiting for the lock, but
it doesn't remove the lock contention.

Userspace can easily triggered pressure on the global cgroup_rstat_lock
via simply reading io.stat and cpu.stat files (under /sys/fs/cgroup/).
I think we need a system to mitigate lock contention from userspace
(waiting on code compiling with a proposal). We see normal userspace
stats tools like cadvisor, nomad (and systemd) trigger this by reading
all the stat file on the system and even spawning parallel threads
without realizing that kernel side they share same global lock.

You have done a huge effort to mitigate lock contention from memcg,
thank you for that. It would be sad if userspace reading these stat
files can block memcg. On production I see shrink_node having a
congestion point happening on this global lock.

--Jesper

2024-04-16 18:41:37

by Shakeel Butt

[permalink] [raw]
Subject: Re: Advice on cgroup rstat lock

On Tue, Apr 16, 2024 at 04:22:51PM +0200, Jesper Dangaard Brouer wrote:

Sorry for the late response and I see there are patches posted as well
which I will take a look but let me put somethings in perspective.

>
>
> >
> > I personally don't like mem_cgroup_flush_stats_ratelimited() very
> > much, because it is time-based (unlike memcg_vmstats_needs_flush()),
> > and a lot of changes can happen in a very short amount of time.
> > However, it seems like for some workloads it's a necessary evil :/
> >

Other than obj_cgroup_may_zswap(), there is no other place which really
need very very accurate stats. IMO we should actually make ratelimited
version the default one for all the places. Stats will always be out of
sync for some time window even with non-ratelimited flush and I don't
see any place where 2 second old stat would be any issue.

>
> I like the combination of the two mem_cgroup_flush_stats_ratelimited()
> and memcg_vmstats_needs_flush().
> IMHO the jiffies rate limit 2*FLUSH_TIME is too high, looks like 4 sec?

4 sec is the worst case and I don't think anyone have seen or reported
that they are seeing 4 sec delayed flush and if it is happening, it
seems like no one cares.

>
>
> > I briefly looked into a global scheme similar to
> > memcg_vmstats_needs_flush() in core cgroups code, but I gave up
> > quickly. Different subsystems have different incomparable stats, so we
> > cannot have a simple magnitude of pending updates on a cgroup-level
> > that represents all subsystems fairly.
> >
> > I tried to have per-subsystem callbacks to update the pending stats
> > and check if flushing is required -- but it got complicated quickly
> > and performance was bad.
> >
>
> I like the time-based limit because it doesn't require tracking pending
> updates.
>
> I'm looking at using a time-based limit, on how often userspace can take
> the lock, but in the area of 50ms to 100 ms.

Sounds good to me and you might just need to check obj_cgroup_may_zswap
is not getting delayed or getting stale stats.

>
>
> With a mutex lock contention will be less obvious, as converting this to
> a mutex avoids multiple CPUs spinning while waiting for the lock, but
> it doesn't remove the lock contention.
>

I don't like global sleepable locks as those are source of priority
inversion issues on highly utilized multi-tenant systems but I still
need to see how you are handling that.

> Userspace can easily triggered pressure on the global cgroup_rstat_lock
> via simply reading io.stat and cpu.stat files (under /sys/fs/cgroup/).
> I think we need a system to mitigate lock contention from userspace
> (waiting on code compiling with a proposal). We see normal userspace
> stats tools like cadvisor, nomad (and systemd) trigger this by reading
> all the stat file on the system and even spawning parallel threads
> without realizing that kernel side they share same global lock.
>
> You have done a huge effort to mitigate lock contention from memcg,
> thank you for that. It would be sad if userspace reading these stat
> files can block memcg. On production I see shrink_node having a
> congestion point happening on this global lock.

Seems like another instance where we should use the ratelimited version
of the flush function.

2024-04-18 02:05:46

by Yosry Ahmed

[permalink] [raw]
Subject: Re: Advice on cgroup rstat lock

[..]

> > > I personally don't like mem_cgroup_flush_stats_ratelimited() very
> > > much, because it is time-based (unlike memcg_vmstats_needs_flush()),
> > > and a lot of changes can happen in a very short amount of time.
> > > However, it seems like for some workloads it's a necessary evil :/
> > >
>
> Other than obj_cgroup_may_zswap(), there is no other place which really
> need very very accurate stats. IMO we should actually make ratelimited
> version the default one for all the places. Stats will always be out of
> sync for some time window even with non-ratelimited flush and I don't
> see any place where 2 second old stat would be any issue.

We disagreed about this before, and I am not trying to get you to
debate this with me again :)

I just prefer that we avoid this if possible. We have seen cases where
the 2 sec window caused issues. Not because 2 sec is a long time, but
because userspace reads the stats after an event occurs (e.g.
proactive reclaim), but gets stats from before the event.

[..]
>
> >
> >
> > With a mutex lock contention will be less obvious, as converting this to
> > a mutex avoids multiple CPUs spinning while waiting for the lock, but
> > it doesn't remove the lock contention.
> >
>
> I don't like global sleepable locks as those are source of priority
> inversion issues on highly utilized multi-tenant systems but I still
> need to see how you are handling that.

For context, this was discussed before as well in [1].

[1]https://lore.kernel.org/lkml/CALvZod441xBoXzhqLWTZ+xnqDOFkHmvrzspr9NAr+nybqXgS-A@mail.gmail.com/