2023-03-27 08:10:37

by Aaron Lu

[permalink] [raw]
Subject: rq lock contention due to commit af7f588d8f73

Hi Mathieu,

I was doing some optimization work[1] for kernel scheduler using a
database workload: sysbench+postgres and before I submit my work, I
rebased my patch on top of latest v6.3-rc kernels to see if everything
still works expected and then I found rq's lock became very heavily
contended as compared to v6.2 based kernels.

Using the above mentioned workload, before commit af7f588d8f73("sched:
Introduce per-memory-map concurrency ID"), the profile looked like:

7.30% 0.71% [kernel.vmlinux] [k] __schedule
0.03% 0.03% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath

After that commit:

49.01% 0.87% [kernel.vmlinux] [k] __schedule
43.20% 43.18% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath

The above profile was captured with sysbench's nr_threads set to 56; if
I used more thread number, the contention would be more severe on that
2sockets/112core/224cpu Intel Sapphire Rapids server.

The docker image I used to do optimization work is not available outside
but I managed to reproduce this problem using only publicaly available
stuffs, here it goes:
1 docker pull postgres
2 sudo docker run --rm --name postgres-instance -e POSTGRES_PASSWORD=mypass -e POSTGRES_USER=sbtest -d postgres -c shared_buffers=80MB -c max_connections=250
3 go inside the container
sudo docker exec -it $the_just_started_container_id bash
4 install sysbench inside container
sudo apt update and sudo apt install sysbench
5 prepare
root@container:/# sysbench --db-driver=pgsql --pgsql-user=sbtest --pgsql_password=mypass --pgsql-db=sbtest --pgsql-port=5432 --tables=16 --table-size=10000 --threads=56 --time=60 --report-interval=2 /usr/share/sysbench/oltp_read_only.lua prepare
6 run
root@container:/# sysbench --db-driver=pgsql --pgsql-user=sbtest --pgsql_password=mypass --pgsql-db=sbtest --pgsql-port=5432 --tables=16 --table-size=10000 --threads=56 --time=60 --report-interval=2 /usr/share/sysbench/oltp_read_only.lua run

Let it warm up a little bit and after 10-20s you can do profile and see
the increased rq lock contention. You may need a machine that has at
least 56 cpus to see this, I didn't try on other machines.

Feel free to let me know if you need any other info.

[1]: https://lore.kernel.org/lkml/20230327053955.GA570404@ziqianlu-desk2/

Best wishes,
Aaron


2023-03-27 09:12:48

by Peter Zijlstra

[permalink] [raw]
Subject: Re: rq lock contention due to commit af7f588d8f73

On Mon, Mar 27, 2023 at 04:05:02PM +0800, Aaron Lu wrote:
> Hi Mathieu,
>
> I was doing some optimization work[1] for kernel scheduler using a
> database workload: sysbench+postgres and before I submit my work, I
> rebased my patch on top of latest v6.3-rc kernels to see if everything
> still works expected and then I found rq's lock became very heavily
> contended as compared to v6.2 based kernels.
>
> Using the above mentioned workload, before commit af7f588d8f73("sched:
> Introduce per-memory-map concurrency ID"), the profile looked like:
>
> 7.30% 0.71% [kernel.vmlinux] [k] __schedule
> 0.03% 0.03% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
>
> After that commit:
>
> 49.01% 0.87% [kernel.vmlinux] [k] __schedule
> 43.20% 43.18% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath

Unlike what the subject says, if it is due to that commit, then it's not
rq lock but that new cid_lock thing.

Can you frob init/Kconfig and make SCHED_MM_CID user configurable and
disable it to confirm?

(also, mathieu, when you do the below, you'll see it is in a weird spot)

diff --git a/init/Kconfig b/init/Kconfig
index 1fb5f313d18f..f2661f73f3dd 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1042,7 +1042,8 @@ config RT_GROUP_SCHED
endif #CGROUP_SCHED

config SCHED_MM_CID
- def_bool y
+ bool "RSEQ Concurrency ID"
+ default y
depends on SMP && RSEQ

config UCLAMP_TASK_GROUP

2023-03-27 10:22:52

by Aaron Lu

[permalink] [raw]
Subject: Re: rq lock contention due to commit af7f588d8f73

On Mon, Mar 27, 2023 at 11:09:51AM +0200, Peter Zijlstra wrote:
> On Mon, Mar 27, 2023 at 04:05:02PM +0800, Aaron Lu wrote:
> > Hi Mathieu,
> >
> > I was doing some optimization work[1] for kernel scheduler using a
> > database workload: sysbench+postgres and before I submit my work, I
> > rebased my patch on top of latest v6.3-rc kernels to see if everything
> > still works expected and then I found rq's lock became very heavily
> > contended as compared to v6.2 based kernels.
> >
> > Using the above mentioned workload, before commit af7f588d8f73("sched:
> > Introduce per-memory-map concurrency ID"), the profile looked like:
> >
> > 7.30% 0.71% [kernel.vmlinux] [k] __schedule
> > 0.03% 0.03% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> >
> > After that commit:
> >
> > 49.01% 0.87% [kernel.vmlinux] [k] __schedule
> > 43.20% 43.18% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
>
> Unlike what the subject says, if it is due to that commit, then it's not
> rq lock but that new cid_lock thing.

Ah my mistake. I didn't take a closer look at the commit and seeing
__schedule path has lock contention made me think it is rq->lock.

>
> Can you frob init/Kconfig and make SCHED_MM_CID user configurable and
> disable it to confirm?

Sure thing, compiling now and will let you know the result once done.

>
> (also, mathieu, when you do the below, you'll see it is in a weird spot)
>
> diff --git a/init/Kconfig b/init/Kconfig
> index 1fb5f313d18f..f2661f73f3dd 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1042,7 +1042,8 @@ config RT_GROUP_SCHED
> endif #CGROUP_SCHED
>
> config SCHED_MM_CID
> - def_bool y
> + bool "RSEQ Concurrency ID"
> + default y
> depends on SMP && RSEQ
>
> config UCLAMP_TASK_GROUP

2023-03-27 10:55:42

by Aaron Lu

[permalink] [raw]
Subject: Re: rq lock contention due to commit af7f588d8f73

On Mon, Mar 27, 2023 at 11:09:51AM +0200, Peter Zijlstra wrote:
> On Mon, Mar 27, 2023 at 04:05:02PM +0800, Aaron Lu wrote:
> > Hi Mathieu,
> >
> > I was doing some optimization work[1] for kernel scheduler using a
> > database workload: sysbench+postgres and before I submit my work, I
> > rebased my patch on top of latest v6.3-rc kernels to see if everything
> > still works expected and then I found rq's lock became very heavily
> > contended as compared to v6.2 based kernels.
> >
> > Using the above mentioned workload, before commit af7f588d8f73("sched:
> > Introduce per-memory-map concurrency ID"), the profile looked like:
> >
> > 7.30% 0.71% [kernel.vmlinux] [k] __schedule
> > 0.03% 0.03% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> >
> > After that commit:
> >
> > 49.01% 0.87% [kernel.vmlinux] [k] __schedule
> > 43.20% 43.18% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
>
> Unlike what the subject says, if it is due to that commit, then it's not
> rq lock but that new cid_lock thing.
>
> Can you frob init/Kconfig and make SCHED_MM_CID user configurable and
> disable it to confirm?

Confirmed the problem is gone after disabling it through menuconfig with
the below diff applied on top of v6.3-rc4.

>
> (also, mathieu, when you do the below, you'll see it is in a weird spot)
>
> diff --git a/init/Kconfig b/init/Kconfig
> index 1fb5f313d18f..f2661f73f3dd 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1042,7 +1042,8 @@ config RT_GROUP_SCHED
> endif #CGROUP_SCHED
>
> config SCHED_MM_CID
> - def_bool y
> + bool "RSEQ Concurrency ID"
> + default y
> depends on SMP && RSEQ
>
> config UCLAMP_TASK_GROUP

2023-03-27 13:24:09

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: rq lock contention due to commit af7f588d8f73

On 2023-03-27 04:05, Aaron Lu wrote:
> Hi Mathieu,
>
> I was doing some optimization work[1] for kernel scheduler using a
> database workload: sysbench+postgres and before I submit my work, I
> rebased my patch on top of latest v6.3-rc kernels to see if everything
> still works expected and then I found rq's lock became very heavily
> contended as compared to v6.2 based kernels.
>
> Using the above mentioned workload, before commit af7f588d8f73("sched:
> Introduce per-memory-map concurrency ID"), the profile looked like:
>
> 7.30% 0.71% [kernel.vmlinux] [k] __schedule
> 0.03% 0.03% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
>
> After that commit:
>
> 49.01% 0.87% [kernel.vmlinux] [k] __schedule
> 43.20% 43.18% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
>
> The above profile was captured with sysbench's nr_threads set to 56; if
> I used more thread number, the contention would be more severe on that
> 2sockets/112core/224cpu Intel Sapphire Rapids server.
>
> The docker image I used to do optimization work is not available outside
> but I managed to reproduce this problem using only publicaly available
> stuffs, here it goes:
> 1 docker pull postgres
> 2 sudo docker run --rm --name postgres-instance -e POSTGRES_PASSWORD=mypass -e POSTGRES_USER=sbtest -d postgres -c shared_buffers=80MB -c max_connections=250
> 3 go inside the container
> sudo docker exec -it $the_just_started_container_id bash
> 4 install sysbench inside container
> sudo apt update and sudo apt install sysbench
> 5 prepare
> root@container:/# sysbench --db-driver=pgsql --pgsql-user=sbtest --pgsql_password=mypass --pgsql-db=sbtest --pgsql-port=5432 --tables=16 --table-size=10000 --threads=56 --time=60 --report-interval=2 /usr/share/sysbench/oltp_read_only.lua prepare
> 6 run
> root@container:/# sysbench --db-driver=pgsql --pgsql-user=sbtest --pgsql_password=mypass --pgsql-db=sbtest --pgsql-port=5432 --tables=16 --table-size=10000 --threads=56 --time=60 --report-interval=2 /usr/share/sysbench/oltp_read_only.lua run
>
> Let it warm up a little bit and after 10-20s you can do profile and see
> the increased rq lock contention. You may need a machine that has at
> least 56 cpus to see this, I didn't try on other machines.
>
> Feel free to let me know if you need any other info.

While I setup my dev machine with this reproducer, here are a few
questions to help figure out the context:

I understand that pgsql is a multi-process database. Is it strictly
single-threaded per-process, or does each process have more than
one thread ?

I understand that your workload is scheduling between threads which
belong to different processes. Are there more heavily active threads
than there are scheduler runqueues (CPUs) on your machine ?

When I developed the mm_cid feature, I originally implemented two additional
optimizations:

Additional optimizations can be done if the spin locks added when
context switching between threads belonging to different memory maps end
up being a performance bottleneck. Those are left out of this patch
though. A performance impact would have to be clearly demonstrated to
justify the added complexity.

I suspect that your workload demonstrates the need for at least one of those
optimizations. I just wonder if we are in a purely single-threaded scenario
for each process, or if each process has many threads.

Thanks,

Mathieu


>
> [1]: https://lore.kernel.org/lkml/20230327053955.GA570404@ziqianlu-desk2/
>
> Best wishes,
> Aaron

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

2023-03-27 14:09:48

by Aaron Lu

[permalink] [raw]
Subject: Re: rq lock contention due to commit af7f588d8f73

On Mon, Mar 27, 2023 at 09:20:44AM -0400, Mathieu Desnoyers wrote:
> On 2023-03-27 04:05, Aaron Lu wrote:
> > Hi Mathieu,
> >
> > I was doing some optimization work[1] for kernel scheduler using a
> > database workload: sysbench+postgres and before I submit my work, I
> > rebased my patch on top of latest v6.3-rc kernels to see if everything
> > still works expected and then I found rq's lock became very heavily
> > contended as compared to v6.2 based kernels.
> >
> > Using the above mentioned workload, before commit af7f588d8f73("sched:
> > Introduce per-memory-map concurrency ID"), the profile looked like:
> >
> > 7.30% 0.71% [kernel.vmlinux] [k] __schedule
> > 0.03% 0.03% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> >
> > After that commit:
> >
> > 49.01% 0.87% [kernel.vmlinux] [k] __schedule
> > 43.20% 43.18% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> >
> > The above profile was captured with sysbench's nr_threads set to 56; if
> > I used more thread number, the contention would be more severe on that
> > 2sockets/112core/224cpu Intel Sapphire Rapids server.
> >
> > The docker image I used to do optimization work is not available outside
> > but I managed to reproduce this problem using only publicaly available
> > stuffs, here it goes:
> > 1 docker pull postgres
> > 2 sudo docker run --rm --name postgres-instance -e POSTGRES_PASSWORD=mypass -e POSTGRES_USER=sbtest -d postgres -c shared_buffers=80MB -c max_connections=250
> > 3 go inside the container
> > sudo docker exec -it $the_just_started_container_id bash
> > 4 install sysbench inside container
> > sudo apt update and sudo apt install sysbench
> > 5 prepare
> > root@container:/# sysbench --db-driver=pgsql --pgsql-user=sbtest --pgsql_password=mypass --pgsql-db=sbtest --pgsql-port=5432 --tables=16 --table-size=10000 --threads=56 --time=60 --report-interval=2 /usr/share/sysbench/oltp_read_only.lua prepare
> > 6 run
> > root@container:/# sysbench --db-driver=pgsql --pgsql-user=sbtest --pgsql_password=mypass --pgsql-db=sbtest --pgsql-port=5432 --tables=16 --table-size=10000 --threads=56 --time=60 --report-interval=2 /usr/share/sysbench/oltp_read_only.lua run
> >
> > Let it warm up a little bit and after 10-20s you can do profile and see
> > the increased rq lock contention. You may need a machine that has at
> > least 56 cpus to see this, I didn't try on other machines.
> >
> > Feel free to let me know if you need any other info.
>
> While I setup my dev machine with this reproducer, here are a few
> questions to help figure out the context:
>
> I understand that pgsql is a multi-process database. Is it strictly
> single-threaded per-process, or does each process have more than
> one thread ?

I do not know the details of Postgres, according to this:
https://wiki.postgresql.org/wiki/FAQ#How_does_PostgreSQL_use_CPU_resources.3F
I think it is single-threaded per-process.

The client, sysbench, is single process multi-threaded IIUC.

>
> I understand that your workload is scheduling between threads which
> belong to different processes. Are there more heavily active threads
> than there are scheduler runqueues (CPUs) on your machine ?

In the reproducer I described above, 56 threads are started on the
client side and if each client thread is served by a server process,
there would be about 112 tasks. I don't think the client thread and
the server process are active at the same time but even if they are,
112 is still smaller than the machine's CPU number: 224.

>
> When I developed the mm_cid feature, I originally implemented two additional
> optimizations:
>
> Additional optimizations can be done if the spin locks added when
> context switching between threads belonging to different memory maps end
> up being a performance bottleneck. Those are left out of this patch
> though. A performance impact would have to be clearly demonstrated to
> justify the added complexity.
>
> I suspect that your workload demonstrates the need for at least one of those
> optimizations. I just wonder if we are in a purely single-threaded scenario
> for each process, or if each process has many threads.

My understanding is: the server side is single threaded and the client
side is multi threaded.

Thanks,
Aaron

2023-03-27 14:12:24

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: rq lock contention due to commit af7f588d8f73

On 2023-03-27 10:04, Aaron Lu wrote:
> On Mon, Mar 27, 2023 at 09:20:44AM -0400, Mathieu Desnoyers wrote:
>> On 2023-03-27 04:05, Aaron Lu wrote:
>>> Hi Mathieu,
>>>
>>> I was doing some optimization work[1] for kernel scheduler using a
>>> database workload: sysbench+postgres and before I submit my work, I
>>> rebased my patch on top of latest v6.3-rc kernels to see if everything
>>> still works expected and then I found rq's lock became very heavily
>>> contended as compared to v6.2 based kernels.
>>>
>>> Using the above mentioned workload, before commit af7f588d8f73("sched:
>>> Introduce per-memory-map concurrency ID"), the profile looked like:
>>>
>>> 7.30% 0.71% [kernel.vmlinux] [k] __schedule
>>> 0.03% 0.03% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
>>>
>>> After that commit:
>>>
>>> 49.01% 0.87% [kernel.vmlinux] [k] __schedule
>>> 43.20% 43.18% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
>>>
>>> The above profile was captured with sysbench's nr_threads set to 56; if
>>> I used more thread number, the contention would be more severe on that
>>> 2sockets/112core/224cpu Intel Sapphire Rapids server.
>>>
>>> The docker image I used to do optimization work is not available outside
>>> but I managed to reproduce this problem using only publicaly available
>>> stuffs, here it goes:
>>> 1 docker pull postgres
>>> 2 sudo docker run --rm --name postgres-instance -e POSTGRES_PASSWORD=mypass -e POSTGRES_USER=sbtest -d postgres -c shared_buffers=80MB -c max_connections=250
>>> 3 go inside the container
>>> sudo docker exec -it $the_just_started_container_id bash
>>> 4 install sysbench inside container
>>> sudo apt update and sudo apt install sysbench
>>> 5 prepare
>>> root@container:/# sysbench --db-driver=pgsql --pgsql-user=sbtest --pgsql_password=mypass --pgsql-db=sbtest --pgsql-port=5432 --tables=16 --table-size=10000 --threads=56 --time=60 --report-interval=2 /usr/share/sysbench/oltp_read_only.lua prepare
>>> 6 run
>>> root@container:/# sysbench --db-driver=pgsql --pgsql-user=sbtest --pgsql_password=mypass --pgsql-db=sbtest --pgsql-port=5432 --tables=16 --table-size=10000 --threads=56 --time=60 --report-interval=2 /usr/share/sysbench/oltp_read_only.lua run
>>>
>>> Let it warm up a little bit and after 10-20s you can do profile and see
>>> the increased rq lock contention. You may need a machine that has at
>>> least 56 cpus to see this, I didn't try on other machines.
>>>
>>> Feel free to let me know if you need any other info.
>>
>> While I setup my dev machine with this reproducer, here are a few
>> questions to help figure out the context:
>>
>> I understand that pgsql is a multi-process database. Is it strictly
>> single-threaded per-process, or does each process have more than
>> one thread ?
>
> I do not know the details of Postgres, according to this:
> https://wiki.postgresql.org/wiki/FAQ#How_does_PostgreSQL_use_CPU_resources.3F
> I think it is single-threaded per-process.
>
> The client, sysbench, is single process multi-threaded IIUC.
>
>>
>> I understand that your workload is scheduling between threads which
>> belong to different processes. Are there more heavily active threads
>> than there are scheduler runqueues (CPUs) on your machine ?
>
> In the reproducer I described above, 56 threads are started on the
> client side and if each client thread is served by a server process,
> there would be about 112 tasks. I don't think the client thread and
> the server process are active at the same time but even if they are,
> 112 is still smaller than the machine's CPU number: 224.
>
>>
>> When I developed the mm_cid feature, I originally implemented two additional
>> optimizations:
>>
>> Additional optimizations can be done if the spin locks added when
>> context switching between threads belonging to different memory maps end
>> up being a performance bottleneck. Those are left out of this patch
>> though. A performance impact would have to be clearly demonstrated to
>> justify the added complexity.
>>
>> I suspect that your workload demonstrates the need for at least one of those
>> optimizations. I just wonder if we are in a purely single-threaded scenario
>> for each process, or if each process has many threads.
>
> My understanding is: the server side is single threaded and the client
> side is multi threaded.

Indeed, I just validated this by successfully running your reproducer
locally, and htop confirms that the client is single-process with many
threads, and the server is multi-process, each with a single thread.

So in this case, the simple "single-threaded process" optimization would
not work, because the client-side is multi-threaded. So the scheduler
will switch back and forth between the client process and the server
processes.

So this appears to call for my mm_cid runqueue cache.

Thanks,

Mathieu


>
> Thanks,
> Aaron

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

2023-03-27 20:07:59

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: rq lock contention due to commit af7f588d8f73

On 2023-03-27 10:04, Aaron Lu wrote:
> On Mon, Mar 27, 2023 at 09:20:44AM -0400, Mathieu Desnoyers wrote:
>> On 2023-03-27 04:05, Aaron Lu wrote:
>>> Hi Mathieu,
>>>
>>> I was doing some optimization work[1] for kernel scheduler using a
>>> database workload: sysbench+postgres and before I submit my work, I
>>> rebased my patch on top of latest v6.3-rc kernels to see if everything
>>> still works expected and then I found rq's lock became very heavily
>>> contended as compared to v6.2 based kernels.
>>>
>>> Using the above mentioned workload, before commit af7f588d8f73("sched:
>>> Introduce per-memory-map concurrency ID"), the profile looked like:
>>>
>>> 7.30% 0.71% [kernel.vmlinux] [k] __schedule
>>> 0.03% 0.03% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
>>>
>>> After that commit:
>>>
>>> 49.01% 0.87% [kernel.vmlinux] [k] __schedule
>>> 43.20% 43.18% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
>>>
>>> The above profile was captured with sysbench's nr_threads set to 56; if
>>> I used more thread number, the contention would be more severe on that
>>> 2sockets/112core/224cpu Intel Sapphire Rapids server.
>>>
>>> The docker image I used to do optimization work is not available outside
>>> but I managed to reproduce this problem using only publicaly available
>>> stuffs, here it goes:
>>> 1 docker pull postgres
>>> 2 sudo docker run --rm --name postgres-instance -e POSTGRES_PASSWORD=mypass -e POSTGRES_USER=sbtest -d postgres -c shared_buffers=80MB -c max_connections=250
>>> 3 go inside the container
>>> sudo docker exec -it $the_just_started_container_id bash
>>> 4 install sysbench inside container
>>> sudo apt update and sudo apt install sysbench
>>> 5 prepare
>>> root@container:/# sysbench --db-driver=pgsql --pgsql-user=sbtest --pgsql_password=mypass --pgsql-db=sbtest --pgsql-port=5432 --tables=16 --table-size=10000 --threads=56 --time=60 --report-interval=2 /usr/share/sysbench/oltp_read_only.lua prepare
>>> 6 run
>>> root@container:/# sysbench --db-driver=pgsql --pgsql-user=sbtest --pgsql_password=mypass --pgsql-db=sbtest --pgsql-port=5432 --tables=16 --table-size=10000 --threads=56 --time=60 --report-interval=2 /usr/share/sysbench/oltp_read_only.lua run
>>>
>>> Let it warm up a little bit and after 10-20s you can do profile and see
>>> the increased rq lock contention. You may need a machine that has at
>>> least 56 cpus to see this, I didn't try on other machines.
>>>
>>> Feel free to let me know if you need any other info.
>>
>> While I setup my dev machine with this reproducer, here are a few
>> questions to help figure out the context:
>>
>> I understand that pgsql is a multi-process database. Is it strictly
>> single-threaded per-process, or does each process have more than
>> one thread ?
>
> I do not know the details of Postgres, according to this:
> https://wiki.postgresql.org/wiki/FAQ#How_does_PostgreSQL_use_CPU_resources.3F
> I think it is single-threaded per-process.
>
> The client, sysbench, is single process multi-threaded IIUC.
>
>>
>> I understand that your workload is scheduling between threads which
>> belong to different processes. Are there more heavily active threads
>> than there are scheduler runqueues (CPUs) on your machine ?
>
> In the reproducer I described above, 56 threads are started on the
> client side and if each client thread is served by a server process,
> there would be about 112 tasks. I don't think the client thread and
> the server process are active at the same time but even if they are,
> 112 is still smaller than the machine's CPU number: 224.
>
>>
>> When I developed the mm_cid feature, I originally implemented two additional
>> optimizations:
>>
>> Additional optimizations can be done if the spin locks added when
>> context switching between threads belonging to different memory maps end
>> up being a performance bottleneck. Those are left out of this patch
>> though. A performance impact would have to be clearly demonstrated to
>> justify the added complexity.
>>
>> I suspect that your workload demonstrates the need for at least one of those
>> optimizations. I just wonder if we are in a purely single-threaded scenario
>> for each process, or if each process has many threads.
>
> My understanding is: the server side is single threaded and the client
> side is multi threaded.

OK.

I've just resuscitated my per-runqueue concurrency ID cache patch from an older
patchset, and posted it as RFC. So far it passed one round of rseq selftests. Can
you test it in your environment to see if I'm on the right track ?

https://lore.kernel.org/lkml/[email protected]/

Thanks!

Mathieu


>
> Thanks,
> Aaron

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

2023-03-28 07:07:33

by Aaron Lu

[permalink] [raw]
Subject: Re: rq lock contention due to commit af7f588d8f73

On Mon, Mar 27, 2023 at 03:57:43PM -0400, Mathieu Desnoyers wrote:

> I've just resuscitated my per-runqueue concurrency ID cache patch from an older
> patchset, and posted it as RFC. So far it passed one round of rseq selftests. Can
> you test it in your environment to see if I'm on the right track ?
>
> https://lore.kernel.org/lkml/[email protected]/

There are improvements with this patch.

When running the client side sysbench with nr_thread=56, the lock contention
is gone%; with nr_thread=224(=nr_cpu of this machine), the lock contention
dropped from 75% to 27%.

v6.3.0-rc4:

75.21% 75.20% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
37.30% native_queued_spin_lock_slowpath;_raw_spin_lock;__schedule;schedule;schedule_hrtimeout_range_clock;schedule_hrtimeout_range;do_sys_poll;__x64_sys_poll;do_syscall_64;entry_SYSCALL_64_after_hwframe;__poll;0x7f943d6fcff8;PQgetResult;0x7f943d6f9a2b;0x55c7f9bde88b
26.01% native_queued_spin_lock_slowpath;_raw_spin_lock;__schedule;schedule_idle;do_idle;cpu_startup_entry;start_secondary;secondary_startup_64_no_verify
11.36% native_queued_spin_lock_slowpath;_raw_spin_lock;__schedule;schedule;schedule_hrtimeout_range_clock;schedule_hrtimeout_range;do_epoll_wait;__x64_sys_epoll_wait;do_syscall_64;entry_SYSCALL_64_after_hwframe;epoll_wait;secure_read;0x55c4d1363867;pq_getbyte;PostgresMain;0x55c4d140c828;PostmasterMain;main;__libc_start_main;0x5541d68949564100

v6.3.0-rc4+the_above_patch:

27.86% 27.85% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
7.31% native_queued_spin_lock_slowpath;_raw_spin_lock;try_to_wake_up;default_wake_function;pollwake;__wake_up_common;__wake_up_common_lock;__wake_up_sync_key;sock_def_readable;tcp_data_ready;tcp_rcv_established;tcp_v4_do_rcv;tcp_v4_rcv;ip_protocol_deliver_rcu;ip_local_deliver_finish;ip_local_deliver;ip_rcv;__netif_receive_skb_one_core;__netif_receive_skb;process_backlog;__napi_poll;net_rx_action;__do_softirq;do_softirq.part.0;__local_bh_enable_ip;ip_finish_output2;__ip_finish_output;ip_finish_output;ip_output;ip_local_out;__ip_queue_xmit;ip_queue_xmit;__tcp_transmit_skb;tcp_write_xmit;__tcp_push_pending_frames;tcp_push;tcp_sendmsg_locked;tcp_sendmsg;inet_sendmsg;sock_sendmsg;__sys_sendto;__x64_sys_sendto;do_syscall_64;entry_SYSCALL_64;__libc_send;0x5637a602e4cd;0x5637a602e98d;PostgresMain;0x5637a60d7828;PostmasterMain;main;__libc_start_main;0x5541d68949564100
4.62% native_queued_spin_lock_slowpath;_raw_spin_lock;__schedule;schedule_idle;do_idle;cpu_startup_entry;start_secondary;secondary_startup_64_no_verify
4.20% native_queued_spin_lock_slowpath;_raw_spin_lock;__schedule;schedule;schedule_hrtimeout_range_clock;schedule_hrtimeout_range;do_epoll_wait;__x64_sys_epoll_wait;do_syscall_64;entry_SYSCALL_64;epoll_wait;secure_read;0x5637a602e867;pq_getbyte;PostgresMain;0x5637a60d7828;PostmasterMain;main;__libc_start_main;0x5541d68949564100
1.66% native_queued_spin_lock_slowpath;_raw_spin_lock;raw_spin_rq_lock_nested;_raw_spin_rq_lock_irqsave;try_to_wake_up;default_wake_function;ep_autoremove_wake_function;__wake_up_common;__wake_up_common_lock;__wake_up;ep_poll_callback;__wake_up_common;__wake_up_common_lock;__wake_up_sync_key;sock_def_readable;tcp_data_ready;tcp_rcv_established;tcp_v4_do_rcv;tcp_v4_rcv;ip_protocol_deliver_rcu;ip_local_deliver_finish;ip_local_deliver;ip_rcv;__netif_receive_skb_one_core;__netif_receive_skb;process_backlog;__napi_poll;net_rx_action;__do_softirq;do_softirq.part.0;__local_bh_enable_ip;ip_finish_output2;__ip_finish_output;ip_finish_output;ip_output;ip_local_out;__ip_queue_xmit;ip_queue_xmit;__tcp_transmit_skb;tcp_write_xmit;__tcp_push_pending_frames;tcp_push;tcp_sendmsg_locked;tcp_sendmsg;inet_sendmsg;sock_sendmsg;__sys_sendto;__x64_sys_sendto;do_syscall_64;entry_SYSCALL_64;__libc_send
1.65% native_queued_spin_lock_slowpath;_raw_spin_lock;try_to_wake_up;default_wake_function;pollwake;__wake_up_common;__wake_up_common_lock;__wake_up_sync_key;sock_def_readable;tcp_data_ready;tcp_data_queue;tcp_rcv_established;tcp_v4_do_rcv;tcp_v4_rcv;ip_protocol_deliver_rcu;ip_local_deliver_finish;ip_local_deliver;ip_rcv;__netif_receive_skb_one_core;__netif_receive_skb;process_backlog;__napi_poll;net_rx_action;__do_softirq;do_softirq.part.0;__local_bh_enable_ip;ip_finish_output2;__ip_finish_output;ip_finish_output;ip_output;ip_local_out;__ip_queue_xmit;ip_queue_xmit;__tcp_transmit_skb;tcp_write_xmit;__tcp_push_pending_frames;tcp_push;tcp_sendmsg_locked;tcp_sendmsg;inet_sendmsg;sock_sendmsg;__sys_sendto;__x64_sys_sendto;do_syscall_64;entry_SYSCALL_64;__libc_send;0x5637a602e4cd;0x5637a602e69e;0x5637a602e7ae;0x5637a5e5f62e;standard_ExecutorRun;0x5637a615abeb;PortalRun;PostgresMain;0x5637a60d7828;PostmasterMain;main;__libc_start_main;0x5541d68949564100
1.63% native_queued_spin_lock_slowpath;_raw_spin_lock;raw_spin_rq_lock_nested;__schedule;schedule;schedule_hrtimeout_range_clock;schedule_hrtimeout_range;do_epoll_wait;__x64_sys_epoll_wait;do_syscall_64;entry_SYSCALL_64;epoll_wait;secure_read;0x5637a602e867;pq_getbyte;PostgresMain;0x5637a60d7828;PostmasterMain;main;__libc_start_main;0x5541d68949564100
1.40% native_queued_spin_lock_slowpath;_raw_spin_lock;raw_spin_rq_lock_nested;try_to_wake_up;default_wake_function;pollwake;__wake_up_common;__wake_up_common_lock;__wake_up_sync_key;sock_def_readable;tcp_data_ready;tcp_rcv_established;tcp_v4_do_rcv;tcp_v4_rcv;ip_protocol_deliver_rcu;ip_local_deliver_finish;ip_local_deliver;ip_rcv;__netif_receive_skb_one_core;__netif_receive_skb;process_backlog;__napi_poll;net_rx_action;__do_softirq;do_softirq.part.0;__local_bh_enable_ip;ip_finish_output2;__ip_finish_output;ip_finish_output;ip_output;ip_local_out;__ip_queue_xmit;ip_queue_xmit;__tcp_transmit_skb;tcp_write_xmit;__tcp_push_pending_frames;tcp_push;tcp_sendmsg_locked;tcp_sendmsg;inet_sendmsg;sock_sendmsg;__sys_sendto;__x64_sys_sendto;do_syscall_64;entry_SYSCALL_64;__libc_send;0x5637a602e4cd;0x5637a602e98d;PostgresMain;0x5637a60d7828;PostmasterMain;main;__libc_start_main;0x5541d68949564100
1.20% native_queued_spin_lock_slowpath;_raw_spin_lock;raw_spin_rq_lock_nested;_raw_spin_rq_lock_irqsave;try_to_wake_up;default_wake_function;pollwake;__wake_up_common;__wake_up_common_lock;__wake_up_sync_key;sock_def_readable;tcp_data_ready;tcp_rcv_established;tcp_v4_do_rcv;tcp_v4_rcv;ip_protocol_deliver_rcu;ip_local_deliver_finish;ip_local_deliver;ip_rcv;__netif_receive_skb_one_core;__netif_receive_skb;process_backlog;__napi_poll;net_rx_action;__do_softirq;do_softirq.part.0;__local_bh_enable_ip;ip_finish_output2;__ip_finish_output;ip_finish_output;ip_output;ip_local_out;__ip_queue_xmit;ip_queue_xmit;__tcp_transmit_skb;tcp_write_xmit;__tcp_push_pending_frames;tcp_push;tcp_sendmsg_locked;tcp_sendmsg;inet_sendmsg;sock_sendmsg;__sys_sendto;__x64_sys_sendto;do_syscall_64;entry_SYSCALL_64;__libc_send;0x5637a602e4cd;0x5637a602e98d;PostgresMain;0x5637a60d7828;PostmasterMain;main;__libc_start_main;0x5541d68949564100
0.83% native_queued_spin_lock_slowpath;_raw_spin_lock;raw_spin_rq_lock_nested;__task_rq_lock;try_to_wake_up;default_wake_function;ep_autoremove_wake_function;__wake_up_common;__wake_up_common_lock;__wake_up;ep_poll_callback;__wake_up_common;__wake_up_common_lock;__wake_up_sync_key;sock_def_readable;tcp_data_ready;tcp_rcv_established;tcp_v4_do_rcv;tcp_v4_rcv;ip_protocol_deliver_rcu;ip_local_deliver_finish;ip_local_deliver;ip_rcv;__netif_receive_skb_one_core;__netif_receive_skb;process_backlog;__napi_poll;net_rx_action;__do_softirq;do_softirq.part.0;__local_bh_enable_ip;ip_finish_output2;__ip_finish_output;ip_finish_output;ip_output;ip_local_out;__ip_queue_xmit;ip_queue_xmit;__tcp_transmit_skb;tcp_write_xmit;__tcp_push_pending_frames;tcp_push;tcp_sendmsg_locked;tcp_sendmsg;inet_sendmsg;sock_sendmsg;__sys_sendto;__x64_sys_sendto;do_syscall_64;entry_SYSCALL_64;__libc_send
0.65% native_queued_spin_lock_slowpath;_raw_spin_lock;raw_spin_rq_lock_nested;__schedule;schedule_idle;do_idle;cpu_startup_entry;start_secondary;secondary_startup_64_no_verify

2023-03-28 12:43:57

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: rq lock contention due to commit af7f588d8f73

On 2023-03-28 02:58, Aaron Lu wrote:
> On Mon, Mar 27, 2023 at 03:57:43PM -0400, Mathieu Desnoyers wrote:
>
>> I've just resuscitated my per-runqueue concurrency ID cache patch from an older
>> patchset, and posted it as RFC. So far it passed one round of rseq selftests. Can
>> you test it in your environment to see if I'm on the right track ?
>>
>> https://lore.kernel.org/lkml/[email protected]/
>
> There are improvements with this patch.
>
> When running the client side sysbench with nr_thread=56, the lock contention
> is gone%; with nr_thread=224(=nr_cpu of this machine), the lock contention
> dropped from 75% to 27%.

This is a good start!

Can you compare this with Peter's approach to modify init/Kconfig, make
SCHED_MM_CID a bool, and set it =n in the kernel config ?

I just want to see what baseline we should compare against.

Another test we would want to try here: there is an arbitrary choice for
the runqueue cache array size in my own patch:

kernel/sched/sched.h:
# define RQ_CID_CACHE_SIZE 8

Can you try changing this value for 16 or 32 instead and see if it helps?

Thanks,

Mathieu

>
> v6.3.0-rc4:
>
> 75.21% 75.20% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> 37.30% native_queued_spin_lock_slowpath;_raw_spin_lock;__schedule;schedule;schedule_hrtimeout_range_clock;schedule_hrtimeout_range;do_sys_poll;__x64_sys_poll;do_syscall_64;entry_SYSCALL_64_after_hwframe;__poll;0x7f943d6fcff8;PQgetResult;0x7f943d6f9a2b;0x55c7f9bde88b
> 26.01% native_queued_spin_lock_slowpath;_raw_spin_lock;__schedule;schedule_idle;do_idle;cpu_startup_entry;start_secondary;secondary_startup_64_no_verify
> 11.36% native_queued_spin_lock_slowpath;_raw_spin_lock;__schedule;schedule;schedule_hrtimeout_range_clock;schedule_hrtimeout_range;do_epoll_wait;__x64_sys_epoll_wait;do_syscall_64;entry_SYSCALL_64_after_hwframe;epoll_wait;secure_read;0x55c4d1363867;pq_getbyte;PostgresMain;0x55c4d140c828;PostmasterMain;main;__libc_start_main;0x5541d68949564100
>
> v6.3.0-rc4+the_above_patch:
>
> 27.86% 27.85% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> 7.31% native_queued_spin_lock_slowpath;_raw_spin_lock;try_to_wake_up;default_wake_function;pollwake;__wake_up_common;__wake_up_common_lock;__wake_up_sync_key;sock_def_readable;tcp_data_ready;tcp_rcv_established;tcp_v4_do_rcv;tcp_v4_rcv;ip_protocol_deliver_rcu;ip_local_deliver_finish;ip_local_deliver;ip_rcv;__netif_receive_skb_one_core;__netif_receive_skb;process_backlog;__napi_poll;net_rx_action;__do_softirq;do_softirq.part.0;__local_bh_enable_ip;ip_finish_output2;__ip_finish_output;ip_finish_output;ip_output;ip_local_out;__ip_queue_xmit;ip_queue_xmit;__tcp_transmit_skb;tcp_write_xmit;__tcp_push_pending_frames;tcp_push;tcp_sendmsg_locked;tcp_sendmsg;inet_sendmsg;sock_sendmsg;__sys_sendto;__x64_sys_sendto;do_syscall_64;entry_SYSCALL_64;__libc_send;0x5637a602e4cd;0x5637a602e98d;PostgresMain;0x5637a60d7828;PostmasterMain;main;__libc_start_main;0x5541d68949564100
> 4.62% native_queued_spin_lock_slowpath;_raw_spin_lock;__schedule;schedule_idle;do_idle;cpu_startup_entry;start_secondary;secondary_startup_64_no_verify
> 4.20% native_queued_spin_lock_slowpath;_raw_spin_lock;__schedule;schedule;schedule_hrtimeout_range_clock;schedule_hrtimeout_range;do_epoll_wait;__x64_sys_epoll_wait;do_syscall_64;entry_SYSCALL_64;epoll_wait;secure_read;0x5637a602e867;pq_getbyte;PostgresMain;0x5637a60d7828;PostmasterMain;main;__libc_start_main;0x5541d68949564100
> 1.66% native_queued_spin_lock_slowpath;_raw_spin_lock;raw_spin_rq_lock_nested;_raw_spin_rq_lock_irqsave;try_to_wake_up;default_wake_function;ep_autoremove_wake_function;__wake_up_common;__wake_up_common_lock;__wake_up;ep_poll_callback;__wake_up_common;__wake_up_common_lock;__wake_up_sync_key;sock_def_readable;tcp_data_ready;tcp_rcv_established;tcp_v4_do_rcv;tcp_v4_rcv;ip_protocol_deliver_rcu;ip_local_deliver_finish;ip_local_deliver;ip_rcv;__netif_receive_skb_one_core;__netif_receive_skb;process_backlog;__napi_poll;net_rx_action;__do_softirq;do_softirq.part.0;__local_bh_enable_ip;ip_finish_output2;__ip_finish_output;ip_finish_output;ip_output;ip_local_out;__ip_queue_xmit;ip_queue_xmit;__tcp_transmit_skb;tcp_write_xmit;__tcp_push_pending_frames;tcp_push;tcp_sendmsg_locked;tcp_sendmsg;inet_sendmsg;sock_sendmsg;__sys_sendto;__x64_sys_sendto;do_syscall_64;entry_SYSCALL_64;__libc_send
> 1.65% native_queued_spin_lock_slowpath;_raw_spin_lock;try_to_wake_up;default_wake_function;pollwake;__wake_up_common;__wake_up_common_lock;__wake_up_sync_key;sock_def_readable;tcp_data_ready;tcp_data_queue;tcp_rcv_established;tcp_v4_do_rcv;tcp_v4_rcv;ip_protocol_deliver_rcu;ip_local_deliver_finish;ip_local_deliver;ip_rcv;__netif_receive_skb_one_core;__netif_receive_skb;process_backlog;__napi_poll;net_rx_action;__do_softirq;do_softirq.part.0;__local_bh_enable_ip;ip_finish_output2;__ip_finish_output;ip_finish_output;ip_output;ip_local_out;__ip_queue_xmit;ip_queue_xmit;__tcp_transmit_skb;tcp_write_xmit;__tcp_push_pending_frames;tcp_push;tcp_sendmsg_locked;tcp_sendmsg;inet_sendmsg;sock_sendmsg;__sys_sendto;__x64_sys_sendto;do_syscall_64;entry_SYSCALL_64;__libc_send;0x5637a602e4cd;0x5637a602e69e;0x5637a602e7ae;0x5637a5e5f62e;standard_ExecutorRun;0x5637a615abeb;PortalRun;PostgresMain;0x5637a60d7828;PostmasterMain;main;__libc_start_main;0x5541d68949564100
> 1.63% native_queued_spin_lock_slowpath;_raw_spin_lock;raw_spin_rq_lock_nested;__schedule;schedule;schedule_hrtimeout_range_clock;schedule_hrtimeout_range;do_epoll_wait;__x64_sys_epoll_wait;do_syscall_64;entry_SYSCALL_64;epoll_wait;secure_read;0x5637a602e867;pq_getbyte;PostgresMain;0x5637a60d7828;PostmasterMain;main;__libc_start_main;0x5541d68949564100
> 1.40% native_queued_spin_lock_slowpath;_raw_spin_lock;raw_spin_rq_lock_nested;try_to_wake_up;default_wake_function;pollwake;__wake_up_common;__wake_up_common_lock;__wake_up_sync_key;sock_def_readable;tcp_data_ready;tcp_rcv_established;tcp_v4_do_rcv;tcp_v4_rcv;ip_protocol_deliver_rcu;ip_local_deliver_finish;ip_local_deliver;ip_rcv;__netif_receive_skb_one_core;__netif_receive_skb;process_backlog;__napi_poll;net_rx_action;__do_softirq;do_softirq.part.0;__local_bh_enable_ip;ip_finish_output2;__ip_finish_output;ip_finish_output;ip_output;ip_local_out;__ip_queue_xmit;ip_queue_xmit;__tcp_transmit_skb;tcp_write_xmit;__tcp_push_pending_frames;tcp_push;tcp_sendmsg_locked;tcp_sendmsg;inet_sendmsg;sock_sendmsg;__sys_sendto;__x64_sys_sendto;do_syscall_64;entry_SYSCALL_64;__libc_send;0x5637a602e4cd;0x5637a602e98d;PostgresMain;0x5637a60d7828;PostmasterMain;main;__libc_start_main;0x5541d68949564100
> 1.20% native_queued_spin_lock_slowpath;_raw_spin_lock;raw_spin_rq_lock_nested;_raw_spin_rq_lock_irqsave;try_to_wake_up;default_wake_function;pollwake;__wake_up_common;__wake_up_common_lock;__wake_up_sync_key;sock_def_readable;tcp_data_ready;tcp_rcv_established;tcp_v4_do_rcv;tcp_v4_rcv;ip_protocol_deliver_rcu;ip_local_deliver_finish;ip_local_deliver;ip_rcv;__netif_receive_skb_one_core;__netif_receive_skb;process_backlog;__napi_poll;net_rx_action;__do_softirq;do_softirq.part.0;__local_bh_enable_ip;ip_finish_output2;__ip_finish_output;ip_finish_output;ip_output;ip_local_out;__ip_queue_xmit;ip_queue_xmit;__tcp_transmit_skb;tcp_write_xmit;__tcp_push_pending_frames;tcp_push;tcp_sendmsg_locked;tcp_sendmsg;inet_sendmsg;sock_sendmsg;__sys_sendto;__x64_sys_sendto;do_syscall_64;entry_SYSCALL_64;__libc_send;0x5637a602e4cd;0x5637a602e98d;PostgresMain;0x5637a60d7828;PostmasterMain;main;__libc_start_main;0x5541d68949564100
> 0.83% native_queued_spin_lock_slowpath;_raw_spin_lock;raw_spin_rq_lock_nested;__task_rq_lock;try_to_wake_up;default_wake_function;ep_autoremove_wake_function;__wake_up_common;__wake_up_common_lock;__wake_up;ep_poll_callback;__wake_up_common;__wake_up_common_lock;__wake_up_sync_key;sock_def_readable;tcp_data_ready;tcp_rcv_established;tcp_v4_do_rcv;tcp_v4_rcv;ip_protocol_deliver_rcu;ip_local_deliver_finish;ip_local_deliver;ip_rcv;__netif_receive_skb_one_core;__netif_receive_skb;process_backlog;__napi_poll;net_rx_action;__do_softirq;do_softirq.part.0;__local_bh_enable_ip;ip_finish_output2;__ip_finish_output;ip_finish_output;ip_output;ip_local_out;__ip_queue_xmit;ip_queue_xmit;__tcp_transmit_skb;tcp_write_xmit;__tcp_push_pending_frames;tcp_push;tcp_sendmsg_locked;tcp_sendmsg;inet_sendmsg;sock_sendmsg;__sys_sendto;__x64_sys_sendto;do_syscall_64;entry_SYSCALL_64;__libc_send
> 0.65% native_queued_spin_lock_slowpath;_raw_spin_lock;raw_spin_rq_lock_nested;__schedule;schedule_idle;do_idle;cpu_startup_entry;start_secondary;secondary_startup_64_no_verify

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

2023-03-28 13:11:50

by Aaron Lu

[permalink] [raw]
Subject: Re: rq lock contention due to commit af7f588d8f73

On Tue, Mar 28, 2023 at 08:39:41AM -0400, Mathieu Desnoyers wrote:
> On 2023-03-28 02:58, Aaron Lu wrote:
> > On Mon, Mar 27, 2023 at 03:57:43PM -0400, Mathieu Desnoyers wrote:
> > > I've just resuscitated my per-runqueue concurrency ID cache patch from an older
> > > patchset, and posted it as RFC. So far it passed one round of rseq selftests. Can
> > > you test it in your environment to see if I'm on the right track ?
> > >
> > > https://lore.kernel.org/lkml/[email protected]/
> >
> > There are improvements with this patch.
> >
> > When running the client side sysbench with nr_thread=56, the lock contention
> > is gone%; with nr_thread=224(=nr_cpu of this machine), the lock contention
> > dropped from 75% to 27%.
>
> This is a good start!

Yes it is.

>
> Can you compare this with Peter's approach to modify init/Kconfig, make
> SCHED_MM_CID a bool, and set it =n in the kernel config ?

I did it yesterday and IIRC, when SCHED_MM_CID is disabled then lock
contention is also gone for nr_thread=224.

>
> I just want to see what baseline we should compare against.

Baseline is, when there is no cid_lock, there is (almost) no lock
contention for this workload :-)

>
> Another test we would want to try here: there is an arbitrary choice for the
> runqueue cache array size in my own patch:
>
> kernel/sched/sched.h:
> # define RQ_CID_CACHE_SIZE 8
>
> Can you try changing this value for 16 or 32 instead and see if it helps?

Yes sure.

Can't promise I can do this tonight but should be able to finish them
tomorrow.

Thanks,
Aaron

2023-03-29 07:49:08

by Aaron Lu

[permalink] [raw]
Subject: Re: rq lock contention due to commit af7f588d8f73

On Tue, Mar 28, 2023 at 08:39:41AM -0400, Mathieu Desnoyers wrote:
> On 2023-03-28 02:58, Aaron Lu wrote:
> > On Mon, Mar 27, 2023 at 03:57:43PM -0400, Mathieu Desnoyers wrote:
> > > I've just resuscitated my per-runqueue concurrency ID cache patch from an older
> > > patchset, and posted it as RFC. So far it passed one round of rseq selftests. Can
> > > you test it in your environment to see if I'm on the right track ?
> > >
> > > https://lore.kernel.org/lkml/[email protected]/
> >
> > There are improvements with this patch.
> >
> > When running the client side sysbench with nr_thread=56, the lock contention
> > is gone%; with nr_thread=224(=nr_cpu of this machine), the lock contention
> > dropped from 75% to 27%.
>
> This is a good start!
>
> Can you compare this with Peter's approach to modify init/Kconfig, make
> SCHED_MM_CID a bool, and set it =n in the kernel config ?
>
> I just want to see what baseline we should compare against.
>
> Another test we would want to try here: there is an arbitrary choice for the
> runqueue cache array size in my own patch:
>
> kernel/sched/sched.h:
> # define RQ_CID_CACHE_SIZE 8
>
> Can you try changing this value for 16 or 32 instead and see if it helps?

I tried 32. The short answer is: for nr_thread=224 case, using a larger
value doesn't show obvious difference.

Here is more detailed info.

During a 5 minutes run, I captued 5s perf every 30 seconds. To avoid
getting too huge data recorded by perf since this machine has 224 cpus,
I picked 4 cpus of each node when doing perf record and here are the results:

Your RFC patch that did mm_cid rq cache:
node0_1.profile: 26.07% 26.06% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_2.profile: 28.38% 28.37% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_3.profile: 25.44% 25.44% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_4.profile: 16.14% 16.13% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_5.profile: 15.17% 15.16% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_6.profile: 5.23% 5.23% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_7.profile: 2.64% 2.64% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_8.profile: 2.87% 2.87% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_9.profile: 2.73% 2.73% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_1.profile: 23.78% 23.77% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_2.profile: 25.11% 25.10% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_3.profile: 21.97% 21.95% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_4.profile: 19.37% 19.35% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_5.profile: 18.85% 18.84% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_6.profile: 11.22% 11.20% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_7.profile: 1.65% 1.64% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_8.profile: 1.68% 1.67% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_9.profile: 1.57% 1.56% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath

Changing RQ_CID_CACHE_SIZE to 32:
node0_1.profile: 29.25% 29.24% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_2.profile: 26.87% 26.87% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_3.profile: 24.23% 24.23% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_4.profile: 17.31% 17.30% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_5.profile: 3.61% 3.60% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_6.profile: 2.60% 2.59% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_7.profile: 1.77% 1.77% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_8.profile: 2.14% 2.13% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_9.profile: 2.20% 2.20% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_1.profile: 27.25% 27.24% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_2.profile: 25.12% 25.11% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_3.profile: 25.27% 25.26% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_4.profile: 19.48% 19.47% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_5.profile: 10.21% 10.20% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_6.profile: 3.01% 3.00% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_7.profile: 1.47% 1.47% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_8.profile: 1.52% 1.51% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_9.profile: 1.58% 1.56% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath

This workload has a characteristic that in the initial ~2 minutes, it has
more wakeups and task migrations and that probably can explain why lock
contention dropped in later profiles.

As comparison, the vanilla v6.3-rc4:
node0_1.profile: 71.27% 71.26% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_2.profile: 72.14% 72.13% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_3.profile: 72.68% 72.67% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_4.profile: 73.30% 73.29% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_5.profile: 77.54% 77.53% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_6.profile: 76.05% 76.04% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_7.profile: 75.08% 75.07% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_8.profile: 75.78% 75.77% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_9.profile: 75.30% 75.30% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_1.profile: 68.40% 68.40% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_2.profile: 69.19% 69.18% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_3.profile: 68.74% 68.74% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_4.profile: 59.99% 59.98% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_5.profile: 56.81% 56.80% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_6.profile: 53.46% 53.45% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_7.profile: 28.90% 28.88% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_8.profile: 27.70% 27.67% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_9.profile: 27.17% 27.14% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath

And when CONFIG_SCHED_MM_CID is off on top of v6.3-rc4:
node0_1.profile: 0.09% 0.08% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_2.profile: 0.08% 0.08% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_3.profile: 0.09% 0.09% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_4.profile: 0.10% 0.10% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_5.profile: 0.07% 0.07% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_6.profile: 0.09% 0.09% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_7.profile: 0.15% 0.15% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_8.profile: 0.08% 0.08% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node0_9.profile: 0.08% 0.08% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_1.profile: 0.23% 0.22% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_2.profile: 0.28% 0.28% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_3.profile: 2.80% 2.80% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_4.profile: 4.29% 4.29% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_5.profile: 4.05% 4.05% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_6.profile: 2.93% 2.92% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_7.profile: 0.07% 0.07% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_8.profile: 0.07% 0.07% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
node1_9.profile: 0.07% 0.06% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
As for the few profiles on node1 where lock contention is more than
0.3%, I've checked those are from pkg_thermal_notify() which should
be a separate issue.

Thanks,
Aaron

2023-03-29 18:08:49

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: rq lock contention due to commit af7f588d8f73

On 2023-03-29 03:45, Aaron Lu wrote:
> On Tue, Mar 28, 2023 at 08:39:41AM -0400, Mathieu Desnoyers wrote:
>> On 2023-03-28 02:58, Aaron Lu wrote:
>>> On Mon, Mar 27, 2023 at 03:57:43PM -0400, Mathieu Desnoyers wrote:
>>>> I've just resuscitated my per-runqueue concurrency ID cache patch from an older
>>>> patchset, and posted it as RFC. So far it passed one round of rseq selftests. Can
>>>> you test it in your environment to see if I'm on the right track ?
>>>>
>>>> https://lore.kernel.org/lkml/[email protected]/
>>>
>>> There are improvements with this patch.
>>>
>>> When running the client side sysbench with nr_thread=56, the lock contention
>>> is gone%; with nr_thread=224(=nr_cpu of this machine), the lock contention
>>> dropped from 75% to 27%.
>>
>> This is a good start!
>>
>> Can you compare this with Peter's approach to modify init/Kconfig, make
>> SCHED_MM_CID a bool, and set it =n in the kernel config ?
>>
>> I just want to see what baseline we should compare against.
>>
>> Another test we would want to try here: there is an arbitrary choice for the
>> runqueue cache array size in my own patch:
>>
>> kernel/sched/sched.h:
>> # define RQ_CID_CACHE_SIZE 8
>>
>> Can you try changing this value for 16 or 32 instead and see if it helps?
>
> I tried 32. The short answer is: for nr_thread=224 case, using a larger
> value doesn't show obvious difference.
>
> Here is more detailed info.
>
> During a 5 minutes run, I captued 5s perf every 30 seconds. To avoid
> getting too huge data recorded by perf since this machine has 224 cpus,
> I picked 4 cpus of each node when doing perf record and here are the results:
>
> Your RFC patch that did mm_cid rq cache:
> node0_1.profile: 26.07% 26.06% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> node0_2.profile: 28.38% 28.37% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> node0_3.profile: 25.44% 25.44% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> node0_4.profile: 16.14% 16.13% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> node0_5.profile: 15.17% 15.16% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> node0_6.profile: 5.23% 5.23% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> node0_7.profile: 2.64% 2.64% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> node0_8.profile: 2.87% 2.87% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> node0_9.profile: 2.73% 2.73% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> node1_1.profile: 23.78% 23.77% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> node1_2.profile: 25.11% 25.10% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> node1_3.profile: 21.97% 21.95% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> node1_4.profile: 19.37% 19.35% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> node1_5.profile: 18.85% 18.84% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> node1_6.profile: 11.22% 11.20% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> node1_7.profile: 1.65% 1.64% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> node1_8.profile: 1.68% 1.67% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> node1_9.profile: 1.57% 1.56% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
>
> Changing RQ_CID_CACHE_SIZE to 32:
> node0_1.profile: 29.25% 29.24% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> node0_2.profile: 26.87% 26.87% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> node0_3.profile: 24.23% 24.23% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> node0_4.profile: 17.31% 17.30% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> node0_5.profile: 3.61% 3.60% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> node0_6.profile: 2.60% 2.59% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> node0_7.profile: 1.77% 1.77% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> node0_8.profile: 2.14% 2.13% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> node0_9.profile: 2.20% 2.20% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> node1_1.profile: 27.25% 27.24% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> node1_2.profile: 25.12% 25.11% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> node1_3.profile: 25.27% 25.26% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> node1_4.profile: 19.48% 19.47% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> node1_5.profile: 10.21% 10.20% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> node1_6.profile: 3.01% 3.00% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> node1_7.profile: 1.47% 1.47% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> node1_8.profile: 1.52% 1.51% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> node1_9.profile: 1.58% 1.56% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
>
> This workload has a characteristic that in the initial ~2 minutes, it has
> more wakeups and task migrations and that probably can explain why lock
> contention dropped in later profiles.

Yeah my RFC patch adds a rq lock on try to wakeup migrations, which I
suspect is causing this performance regression.

I've come up with a design for an alternative scheme which should be
much more lightweight locking-wise. I'll see if I can make it work and
let you know when I have something to test.

Thanks,

Mathieu


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

2023-04-04 09:57:34

by Thorsten Leemhuis

[permalink] [raw]
Subject: Re: rq lock contention due to commit af7f588d8f73

[CCing the regression list, as it should be in the loop for regressions:
https://docs.kernel.org/admin-guide/reporting-regressions.html]

[TLDR: I'm adding this report to the list of tracked Linux kernel
regressions; the text you find below is based on a few templates
paragraphs you might have encountered already in similar form.
See link in footer if these mails annoy you.]

On 27.03.23 10:05, Aaron Lu wrote:
>
> I was doing some optimization work[1] for kernel scheduler using a
> database workload: sysbench+postgres and before I submit my work, I
> rebased my patch on top of latest v6.3-rc kernels to see if everything
> still works expected and then I found rq's lock became very heavily
> contended as compared to v6.2 based kernels.
>
> Using the above mentioned workload, before commit af7f588d8f73("sched:
> Introduce per-memory-map concurrency ID"), the profile looked like:
>
> 7.30% 0.71% [kernel.vmlinux] [k] __schedule
> 0.03% 0.03% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
>
> After that commit:
>
> 49.01% 0.87% [kernel.vmlinux] [k] __schedule
> 43.20% 43.18% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
>

Thanks for the report. To be sure the issue doesn't fall through the
cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
tracking bot:

#regzbot ^introduced af7f588d8f73
#regzbot monitor
https://lore.kernel.org/lkml/[email protected]/
#regzbot monitor
https://lore.kernel.org/all/[email protected]/
#regzbot title sched: PostgreSQL performance regression introduced by mm_cid
#regzbot ignore-activity

This isn't a regression? This issue or a fix for it are already
discussed somewhere else? It was fixed already? You want to clarify when
the regression started to happen? Or point out I got the title or
something else totally wrong? Then just reply and tell me -- ideally
while also telling regzbot about it, as explained by the page listed in
the footer of this mail.

Developers: When fixing the issue, remember to add 'Link:' tags pointing
to the report (the parent of this mail). See page linked in footer for
details.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.