LinuxLists.cc - [PATCH] x86/resctrl: Clear the staged configs when destroying schemata list

2022-09-27 03:00:42

Subject: [PATCH] x86/resctrl: Clear the staged configs when destroying schemata list

Array staged_config in struct rdt_domain still maintains the original value when
resctrl is unmounted. If resctrl is mounted with cdp option and then remounted
without cdp option, field have_new_ctrl in staged_config[CDP_CODE] and
staged_config[CDP_DATA] will still be true. Since resctrl_arch_update_domains()
traverses all resctrl_conf_type, it will continue to update CDP_CODE and
CDP_DATA configurations, which can cause overflow problem.

The problem can be reproduced by the following commands:
# A system with 16 usable closids and mba disabled
mount -t resctrl resctrl -o cdp /sys/fs/resctrl
mkdir /sys/fs/resctrl/p{1..7}
umount /sys/fs/resctrl/
mount -t resctrl resctrl /sys/fs/resctrl
mkdir /sys/fs/resctrl/p{1..8}

dmesg will generate the following error:
[ 6180.939345] unchecked MSR access error: WRMSR to 0xca0 (tried to write
0x00000000000007ff) at rIP: 0xffffffff82249142 (cat_wrmsr+0x32/0x60)
[ 6180.951983] Call Trace:
[ 6180.954516] <IRQ>
[ 6180.956619] __flush_smp_call_function_queue+0x11d/0x170
[ 6180.962028] __sysvec_call_function+0x24/0xd0
[ 6180.966485] sysvec_call_function+0x89/0xc0
[ 6180.970760] </IRQ>
[ 6180.972947] <TASK>
[ 6180.975131] asm_sysvec_call_function+0x16/0x20
[ 6180.979757] RIP: 0010:cpuidle_enter_state+0xcd/0x400
[ 6180.984821] Code: 49 89 c5 0f 1f 44 00 00 31 ff e8 1e e5 77 ff 45 84
ff 74 12 9c 58 f6 c4 02 0f 85 13 03 00 00 31 ff e8 67 70 7d ff fb 45 85
f6 <0f> 88 75 01 00 00 49 63 c6 4c 2b 2c 24 48 8d 14 40 48 8d 14 90 49
[ 6181.003710] RSP: 0018:ffffffff83a03e48 EFLAGS: 00000202
[ 6181.009028] RAX: ffff943400800000 RBX: 0000000000000001 RCX: 000000000000001f
[ 6181.016261] RDX: 0000000000000000 RSI: ffffffff83795059 RDI: ffffffff837c101e
[ 6181.023490] RBP: ffff9434c9352000 R08: 0000059f1cb1a05e R09: 0000000000000008
[ 6181.030717] R10: 0000000000000001 R11: 0000000000005c66 R12: ffffffff83bbf3a0
[ 6181.037944] R13: 0000059f1cb1a05e R14: 0000000000000001 R15: 0000000000000000
[ 6181.045202] ? cpuidle_enter_state+0xb2/0x400
[ 6181.049678] cpuidle_enter+0x24/0x40
[ 6181.053370] do_idle+0x1dd/0x260
[ 6181.056713] cpu_startup_entry+0x14/0x20
[ 6181.060753] rest_init+0xbb/0xc0
[ 6181.064097] arch_call_rest_init+0x5/0xa
[ 6181.068137] start_kernel+0x668/0x691
[ 6181.071914] secondary_startup_64_no_verify+0xe0/0xeb
[ 6181.077086] </TASK>

We fix this issue by clearing the staged configs when destroying schemata list.

Signed-off-by: Shawn Wang <[email protected]>
Suggested-by: Xin Hao <[email protected]>
---
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index f276aff521e8..b4a817ae83ab 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2127,8 +2127,15 @@ static int schemata_list_create(void)
static void schemata_list_destroy(void)
{
struct resctrl_schema *s, *tmp;
+ struct rdt_domain *dom;

list_for_each_entry_safe(s, tmp, &resctrl_schema_all, list) {
+ /*
+ * Clear staged_config on each domain before schemata list is
+ * destroyed.
+ */
+ list_for_each_entry(dom, &s->res->domains, list)
+ memset(dom->staged_config, 0, sizeof(dom->staged_config));
list_del(&s->list);
kfree(s);
}
--
2.27.0

2022-09-27 14:15:34

by James Morse

[permalink] [raw]

Subject: Re: [PATCH] x86/resctrl: Clear the staged configs when destroying schemata list

Hi Shawn,

On 27/09/2022 03:54, Shawn Wang wrote:
> Array staged_config in struct rdt_domain still maintains the original value when
> resctrl is unmounted. If resctrl is mounted with cdp option and then remounted
> without cdp option, field have_new_ctrl in staged_config[CDP_CODE] and
> staged_config[CDP_DATA] will still be true.

staged_config[CDP_DATA] is an array - its always 'true'. I think you mean
staged_config[CDP_DATA].have_new_ctrl, which will still be true because it is only
memset() when the schemata file is written to.

> Since resctrl_arch_update_domains()
> traverses all resctrl_conf_type, it will continue to update CDP_CODE and
> CDP_DATA configurations, which can cause overflow problem.

Only if its called with a stale staged config, and it should only be called when the
schemata file is written to, which would memset() the staged config first.

> The problem can be reproduced by the following commands:
> # A system with 16 usable closids and mba disabled
> mount -t resctrl resctrl -o cdp /sys/fs/resctrl
> mkdir /sys/fs/resctrl/p{1..7}
> umount /sys/fs/resctrl/
> mount -t resctrl resctrl /sys/fs/resctrl
> mkdir /sys/fs/resctrl/p{1..8}

Thanks for the reproducer - but I don't see what could set have_new_ctrl in this sequence.
You can't call apply_config() to set CPUs in the mask without that being set.

Creating a new control group, (your mkdir step) shouldn't touch the hardware at all, as it
should be left in its reset state from the last umount(), or setup.

I can't reproduce this on v6.0-rc7.
Even if I dirty the configuration by writing to the schemata file, I can't reproduce this.

(I have mba enabled, but all this should affect is the number of closid available)

> dmesg will generate the following error:

Which kernel version is this?

> [ 6180.939345] unchecked MSR access error: WRMSR to 0xca0 (tried to write
> 0x00000000000007ff) at rIP: 0xffffffff82249142 (cat_wrmsr+0x32/0x60)

Is 0x7ff the default CBM bitmap for this CPU? Or was it written in a step missing from the
reproducer above?

The rest of this splat isn't helpful as its the result of an IPI...

> [ 6180.951983] Call Trace:
> [ 6180.954516] <IRQ>
> [ 6180.956619] __flush_smp_call_function_queue+0x11d/0x170
> [ 6180.962028] __sysvec_call_function+0x24/0xd0
> [ 6180.966485] sysvec_call_function+0x89/0xc0
> [ 6180.970760] </IRQ>
> [ 6180.972947] <TASK>
> [ 6180.975131] asm_sysvec_call_function+0x16/0x20
> [ 6180.979757] RIP: 0010:cpuidle_enter_state+0xcd/0x400
> [ 6180.984821] Code: 49 89 c5 0f 1f 44 00 00 31 ff e8 1e e5 77 ff 45 84
> ff 74 12 9c 58 f6 c4 02 0f 85 13 03 00 00 31 ff e8 67 70 7d ff fb 45 85
> f6 <0f> 88 75 01 00 00 49 63 c6 4c 2b 2c 24 48 8d 14 40 48 8d 14 90 49
> [ 6181.003710] RSP: 0018:ffffffff83a03e48 EFLAGS: 00000202
> [ 6181.009028] RAX: ffff943400800000 RBX: 0000000000000001 RCX: 000000000000001f
> [ 6181.016261] RDX: 0000000000000000 RSI: ffffffff83795059 RDI: ffffffff837c101e
> [ 6181.023490] RBP: ffff9434c9352000 R08: 0000059f1cb1a05e R09: 0000000000000008
> [ 6181.030717] R10: 0000000000000001 R11: 0000000000005c66 R12: ffffffff83bbf3a0
> [ 6181.037944] R13: 0000059f1cb1a05e R14: 0000000000000001 R15: 0000000000000000
> [ 6181.045202] ? cpuidle_enter_state+0xb2/0x400
> [ 6181.049678] cpuidle_enter+0x24/0x40
> [ 6181.053370] do_idle+0x1dd/0x260
> [ 6181.056713] cpu_startup_entry+0x14/0x20
> [ 6181.060753] rest_init+0xbb/0xc0
> [ 6181.064097] arch_call_rest_init+0x5/0xa
> [ 6181.068137] start_kernel+0x668/0x691
> [ 6181.071914] secondary_startup_64_no_verify+0xe0/0xeb
> [ 6181.077086] </TASK>

It would be good to know what triggered this IPI. It may not have been
resctrl_arch_update_domains(). This pattern also happens from reset_all_ctrls() which
happens during umount(). (and that would write the default CBM bitmap)

If you can reproduce this easily, could you add dump_stack() to update_config() to see if
any path is setting have_new_ctrl. You aren't writing to the schemata file in your reproducer.

> We fix this issue by clearing the staged configs when destroying schemata list.

> Signed-off-by: Shawn Wang <[email protected]>
> Suggested-by: Xin Hao <[email protected]>

If we can work out why you are seeing this, it would need a Fixes tag.

Otherwise I agree it makes sense to make this more robust, but it would need a different
commit message.

Thanks,

James

> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index f276aff521e8..b4a817ae83ab 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -2127,8 +2127,15 @@ static int schemata_list_create(void)
> static void schemata_list_destroy(void)
> {
> struct resctrl_schema *s, *tmp;
> + struct rdt_domain *dom;
>
> list_for_each_entry_safe(s, tmp, &resctrl_schema_all, list) {
> + /*
> + * Clear staged_config on each domain before schemata list is
> + * destroyed.
> + */
> + list_for_each_entry(dom, &s->res->domains, list)
> + memset(dom->staged_config, 0, sizeof(dom->staged_config));
> list_del(&s->list);
> kfree(s);
> }

2022-09-27 21:28:48

by Reinette Chatre

[permalink] [raw]

Subject: Re: [PATCH] x86/resctrl: Clear the staged configs when destroying schemata list

Hi James and Shawn,

On 9/27/2022 6:06 AM, James Morse wrote:
> On 27/09/2022 03:54, Shawn Wang wrote:
>> Array staged_config in struct rdt_domain still maintains the original value when
>> resctrl is unmounted. If resctrl is mounted with cdp option and then remounted
>> without cdp option, field have_new_ctrl in staged_config[CDP_CODE] and
>> staged_config[CDP_DATA] will still be true.
>
> staged_config[CDP_DATA] is an array - its always 'true'. I think you mean
> staged_config[CDP_DATA].have_new_ctrl, which will still be true because it is only
> memset() when the schemata file is written to.
>
>
>> Since resctrl_arch_update_domains()
>> traverses all resctrl_conf_type, it will continue to update CDP_CODE and
>> CDP_DATA configurations, which can cause overflow problem.
>
> Only if its called with a stale staged config, and it should only be called when the
> schemata file is written to, which would memset() the staged config first.
>
>
>> The problem can be reproduced by the following commands:
>> # A system with 16 usable closids and mba disabled
>> mount -t resctrl resctrl -o cdp /sys/fs/resctrl
>> mkdir /sys/fs/resctrl/p{1..7}
>> umount /sys/fs/resctrl/
>> mount -t resctrl resctrl /sys/fs/resctrl
>> mkdir /sys/fs/resctrl/p{1..8}
>
> Thanks for the reproducer - but I don't see what could set have_new_ctrl in this sequence.
> You can't call apply_config() to set CPUs in the mask without that being set.
>
> Creating a new control group, (your mkdir step) shouldn't touch the hardware at all, as it
> should be left in its reset state from the last umount(), or setup.

There is an attempt to configure the hardware in the mkdir path:
rdtgroup_mkdir()->rdtgroup_mkdir_ctrl_mon()->rdtgroup_init_alloc()->resctrl_arch_update_domains()

This is required in support of the different resource group modes. When a new
resource group is created via mkdir the configuration should respect any
exclusive resource groups that exist at that point.

>
> I can't reproduce this on v6.0-rc7.
> Even if I dirty the configuration by writing to the schemata file, I can't reproduce this.
>

From what I can tell the reproducer is dependent on (a) whether hardware
supports CDP, and (b) the number of CLOSIDs supported by the system. The reproducer
works for hardware that has 16 CLOSIDs (see later).

> (I have mba enabled, but all this should affect is the number of closid available)
>
>
>> dmesg will generate the following error:
>
> Which kernel version is this?
>
>> [ 6180.939345] unchecked MSR access error: WRMSR to 0xca0 (tried to write
>> 0x00000000000007ff) at rIP: 0xffffffff82249142 (cat_wrmsr+0x32/0x60)
>
> Is 0x7ff the default CBM bitmap for this CPU? Or was it written in a step missing from the
> reproducer above?

The value of interest here is the register it tries to write to ... 0xca0.
On a system with 16 CLOSIDs the range of registers available to set the CBM
would be 0xc90 to 0xc9f that corresponds to CLOSID 0 to CLOSID 15. The error is
an attempt to write to an unsupported register - there appears to have been an
attempt to configure non-existent CLOSID 16.

As Shawn already root-caused, this is because the staged_config contains data from
the previous run when CDP was enabled and it is never cleared before the resource group
creation flow (triggered by mkdir).

CDP enabled:
mkdir <resource group>
domainX.staged_config[CDP_NONE].have_new_ctrl = false
domainX.staged_config[CDP_NONE].new_ctrl = 0
domainX.staged_config[CDP_CODE].have_new_ctrl = true
domainX.staged_config[CDP_CODE].new_ctrl = 0x7ff
domainX.staged_config[CDP_DATA].have_new_ctrl = true
domainX.staged_config[CDP_CODE].new_ctrl = 0x7ff

unmount/remount resctrl (CDP disabled):
mkdir <resource group>
domainX.staged_config[CDP_NONE].have_new_ctrl = true
domainX.staged_config[CDP_NONE].new_ctrl = 0x7ff
domainX.staged_config[CDP_CODE].have_new_ctrl = true /* stale */
domainX.staged_config[CDP_CODE].new_ctrl = 0x7ff /* stale */
domainX.staged_config[CDP_DATA].have_new_ctrl = true /* stale */
domainX.staged_config[CDP_CODE].new_ctrl = 0x7ff /* stale */

Above becomes an issue when the resource group being created is
for a CLOSID # that is more than half of the CLOSIDs supported.
In the reproducer the issue was encountered when creating resource
group for CLOSID 8 on a system that supports 16 CLOSIDs.

In this case get_config_index() called from
resctrl_arch_update_domains() will return 16 and 17 when processing
this resource group and that translated to an invalid register - 0xca0 in this
scenario.

> The rest of this splat isn't helpful as its the result of an IPI...
>
>> [ 6180.951983] Call Trace:
>> [ 6180.954516] <IRQ>
>> [ 6180.956619] __flush_smp_call_function_queue+0x11d/0x170
>> [ 6180.962028] __sysvec_call_function+0x24/0xd0
>> [ 6180.966485] sysvec_call_function+0x89/0xc0
>> [ 6180.970760] </IRQ>
>> [ 6180.972947] <TASK>
>> [ 6180.975131] asm_sysvec_call_function+0x16/0x20
>> [ 6180.979757] RIP: 0010:cpuidle_enter_state+0xcd/0x400
>> [ 6180.984821] Code: 49 89 c5 0f 1f 44 00 00 31 ff e8 1e e5 77 ff 45 84
>> ff 74 12 9c 58 f6 c4 02 0f 85 13 03 00 00 31 ff e8 67 70 7d ff fb 45 85
>> f6 <0f> 88 75 01 00 00 49 63 c6 4c 2b 2c 24 48 8d 14 40 48 8d 14 90 49
>> [ 6181.003710] RSP: 0018:ffffffff83a03e48 EFLAGS: 00000202
>> [ 6181.009028] RAX: ffff943400800000 RBX: 0000000000000001 RCX: 000000000000001f
>> [ 6181.016261] RDX: 0000000000000000 RSI: ffffffff83795059 RDI: ffffffff837c101e
>> [ 6181.023490] RBP: ffff9434c9352000 R08: 0000059f1cb1a05e R09: 0000000000000008
>> [ 6181.030717] R10: 0000000000000001 R11: 0000000000005c66 R12: ffffffff83bbf3a0
>> [ 6181.037944] R13: 0000059f1cb1a05e R14: 0000000000000001 R15: 0000000000000000
>> [ 6181.045202] ? cpuidle_enter_state+0xb2/0x400
>> [ 6181.049678] cpuidle_enter+0x24/0x40
>> [ 6181.053370] do_idle+0x1dd/0x260
>> [ 6181.056713] cpu_startup_entry+0x14/0x20
>> [ 6181.060753] rest_init+0xbb/0xc0
>> [ 6181.064097] arch_call_rest_init+0x5/0xa
>> [ 6181.068137] start_kernel+0x668/0x691
>> [ 6181.071914] secondary_startup_64_no_verify+0xe0/0xeb
>> [ 6181.077086] </TASK>
>
> It would be good to know what triggered this IPI. It may not have been
> resctrl_arch_update_domains(). This pattern also happens from reset_all_ctrls() which

I believe this is indeed from resctrl_arch_update_domains() since it calls
smp_call_function_many() to configure all the domains.

> happens during umount(). (and that would write the default CBM bitmap)
>
> If you can reproduce this easily, could you add dump_stack() to update_config() to see if
> any path is setting have_new_ctrl. You aren't writing to the schemata file in your reproducer.
>
>
>> We fix this issue by clearing the staged configs when destroying schemata list.
>
>
>> Signed-off-by: Shawn Wang <[email protected]>
>> Suggested-by: Xin Hao <[email protected]>
>
> If we can work out why you are seeing this, it would need a Fixes tag.
>
> Otherwise I agree it makes sense to make this more robust, but it would need a different
> commit message.

What do you think about clearing the staged config within resctrl_arch_update_domains()
after the configuration is complete and there is no more need for it? That may reduce
complexity where each caller no longer need to remember to do so.
I see "staged_config" as a temporary storage and it my help to understand the code better
if it is treated as such.

>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> index f276aff521e8..b4a817ae83ab 100644
>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> @@ -2127,8 +2127,15 @@ static int schemata_list_create(void)
>> static void schemata_list_destroy(void)
>> {
>> struct resctrl_schema *s, *tmp;
>> + struct rdt_domain *dom;
>>
>> list_for_each_entry_safe(s, tmp, &resctrl_schema_all, list) {
>> + /*
>> + * Clear staged_config on each domain before schemata list is
>> + * destroyed.
>> + */
>> + list_for_each_entry(dom, &s->res->domains, list)
>> + memset(dom->staged_config, 0, sizeof(dom->staged_config));
>> list_del(&s->list);
>> kfree(s);
>> }
>

Reinette

2022-09-28 14:44:13

by James Morse

[permalink] [raw]

Subject: Re: [PATCH] x86/resctrl: Clear the staged configs when destroying schemata list

Hi Reinette,

On 27/09/2022 22:21, Reinette Chatre wrote:
> On 9/27/2022 6:06 AM, James Morse wrote:
>> On 27/09/2022 03:54, Shawn Wang wrote:

>>> The problem can be reproduced by the following commands:
>>> # A system with 16 usable closids and mba disabled
>>> mount -t resctrl resctrl -o cdp /sys/fs/resctrl
>>> mkdir /sys/fs/resctrl/p{1..7}
>>> umount /sys/fs/resctrl/
>>> mount -t resctrl resctrl /sys/fs/resctrl
>>> mkdir /sys/fs/resctrl/p{1..8}
>>
>> Thanks for the reproducer - but I don't see what could set have_new_ctrl in this sequence.
>> You can't call apply_config() to set CPUs in the mask without that being set.
>>
>> Creating a new control group, (your mkdir step) shouldn't touch the hardware at all, as it
>> should be left in its reset state from the last umount(), or setup.
>
> There is an attempt to configure the hardware in the mkdir path:
> rdtgroup_mkdir()->rdtgroup_mkdir_ctrl_mon()->rdtgroup_init_alloc()->resctrl_arch_update_domains()

Aha! I'm not sure why my grepping around didn't find this.

This is a path that doesn't memset() the staged config first, so that explains it.

[..]

> What do you think about clearing the staged config within resctrl_arch_update_domains()
> after the configuration is complete and there is no more need for it? That may reduce
> complexity where each caller no longer need to remember to do so.
> I see "staged_config" as a temporary storage and it my help to understand the code better
> if it is treated as such.

Yup, that would it with the idea of the value being consumed by
resctrl_arch_update_domains(), which is how I think of it.

Thanks,

James

2022-09-28 14:52:45

by Shawn Wang

[permalink] [raw]

Subject: Re: [PATCH] x86/resctrl: Clear the staged configs when destroying schemata list

Hi Reinette and James,

Sorry for some mistakes in the last reply email, so I send it again.

On 9/28/22 5:21 AM, Reinette Chatre wrote:
> Hi James and Shawn,
>
> On 9/27/2022 6:06 AM, James Morse wrote:
>> On 27/09/2022 03:54, Shawn Wang wrote:
>>> Array staged_config in struct rdt_domain still maintains the original value when
>>> resctrl is unmounted. If resctrl is mounted with cdp option and then remounted
>>> without cdp option, field have_new_ctrl in staged_config[CDP_CODE] and
>>> staged_config[CDP_DATA] will still be true.
>>
>> staged_config[CDP_DATA] is an array - its always 'true'. I think you mean
>> staged_config[CDP_DATA].have_new_ctrl, which will still be true because it is only
>> memset() when the schemata file is written to.
>>
>>
>>> Since resctrl_arch_update_domains()
>>> traverses all resctrl_conf_type, it will continue to update CDP_CODE and
>>> CDP_DATA configurations, which can cause overflow problem.
>>
>> Only if its called with a stale staged config, and it should only be called when the
>> schemata file is written to, which would memset() the staged config first.
>>
>>
>>> The problem can be reproduced by the following commands:
>>> # A system with 16 usable closids and mba disabled
>>> mount -t resctrl resctrl -o cdp /sys/fs/resctrl
>>> mkdir /sys/fs/resctrl/p{1..7}
>>> umount /sys/fs/resctrl/
>>> mount -t resctrl resctrl /sys/fs/resctrl
>>> mkdir /sys/fs/resctrl/p{1..8}
>>
>> Thanks for the reproducer - but I don't see what could set have_new_ctrl in this sequence.
>> You can't call apply_config() to set CPUs in the mask without that being set.
>>
>> Creating a new control group, (your mkdir step) shouldn't touch the hardware at all, as it
>> should be left in its reset state from the last umount(), or setup.
>
> There is an attempt to configure the hardware in the mkdir path:
> rdtgroup_mkdir()->rdtgroup_mkdir_ctrl_mon()->rdtgroup_init_alloc()->resctrl_arch_update_domains()
>
> This is required in support of the different resource group modes. When a new
> resource group is created via mkdir the configuration should respect any
> exclusive resource groups that exist at that point.
>
>>
>> I can't reproduce this on v6.0-rc7.
>> Even if I dirty the configuration by writing to the schemata file, I can't reproduce this.
>>
>
> From what I can tell the reproducer is dependent on (a) whether hardware
> supports CDP, and (b) the number of CLOSIDs supported by the system. The reproducer
> works for hardware that has 16 CLOSIDs (see later).
>
>> (I have mba enabled, but all this should affect is the number of closid available)
>>

I reproduce this on v6.0-rc6. The key to reproduction is to ensure that
the number of usable groups is different between CDP enabled and CDP
disabled.

The system I use has 16 CLOSIDs for L3 CAT and 8 CLOSIDs for MBA. MBA
limits the max number of groups to 8, even if CDP is disabled. This is
the reason why I disable MBA.

>>
>>> dmesg will generate the following error:
>>
>> Which kernel version is this?
>>
>>> [ 6180.939345] unchecked MSR access error: WRMSR to 0xca0 (tried to write
>>> 0x00000000000007ff) at rIP: 0xffffffff82249142 (cat_wrmsr+0x32/0x60)
>>
>> Is 0x7ff the default CBM bitmap for this CPU? Or was it written in a step missing from the
>> reproducer above?
>
> The value of interest here is the register it tries to write to ... 0xca0.
> On a system with 16 CLOSIDs the range of registers available to set the CBM
> would be 0xc90 to 0xc9f that corresponds to CLOSID 0 to CLOSID 15. The error is
> an attempt to write to an unsupported register - there appears to have been an
> attempt to configure non-existent CLOSID 16.
>
> As Shawn already root-caused, this is because the staged_config contains data from
> the previous run when CDP was enabled and it is never cleared before the resource group
> creation flow (triggered by mkdir).
>
>
> CDP enabled:
> mkdir <resource group>
> domainX.staged_config[CDP_NONE].have_new_ctrl = false
> domainX.staged_config[CDP_NONE].new_ctrl = 0
> domainX.staged_config[CDP_CODE].have_new_ctrl = true
> domainX.staged_config[CDP_CODE].new_ctrl = 0x7ff
> domainX.staged_config[CDP_DATA].have_new_ctrl = true
> domainX.staged_config[CDP_CODE].new_ctrl = 0x7ff
>
> unmount/remount resctrl (CDP disabled):
> mkdir <resource group>
> domainX.staged_config[CDP_NONE].have_new_ctrl = true
> domainX.staged_config[CDP_NONE].new_ctrl = 0x7ff
> domainX.staged_config[CDP_CODE].have_new_ctrl = true /* stale */
> domainX.staged_config[CDP_CODE].new_ctrl = 0x7ff /* stale */
> domainX.staged_config[CDP_DATA].have_new_ctrl = true /* stale */
> domainX.staged_config[CDP_CODE].new_ctrl = 0x7ff /* stale */
>
> Above becomes an issue when the resource group being created is
> for a CLOSID # that is more than half of the CLOSIDs supported.
> In the reproducer the issue was encountered when creating resource
> group for CLOSID 8 on a system that supports 16 CLOSIDs.
>
> In this case get_config_index() called from
> resctrl_arch_update_domains() will return 16 and 17 when processing
> this resource group and that translated to an invalid register - 0xca0 in this
> scenario.
>

Thanks for the detailed explanation. That's exactly what I mean.

>
>> The rest of this splat isn't helpful as its the result of an IPI...
>>
>>> [ 6180.951983] Call Trace:
>>> [ 6180.954516] <IRQ>
>>> [ 6180.956619] __flush_smp_call_function_queue+0x11d/0x170
>>> [ 6180.962028] __sysvec_call_function+0x24/0xd0
>>> [ 6180.966485] sysvec_call_function+0x89/0xc0
>>> [ 6180.970760] </IRQ>
>>> [ 6180.972947] <TASK>
>>> [ 6180.975131] asm_sysvec_call_function+0x16/0x20
>>> [ 6180.979757] RIP: 0010:cpuidle_enter_state+0xcd/0x400
>>> [ 6180.984821] Code: 49 89 c5 0f 1f 44 00 00 31 ff e8 1e e5 77 ff 45 84
>>> ff 74 12 9c 58 f6 c4 02 0f 85 13 03 00 00 31 ff e8 67 70 7d ff fb 45 85
>>> f6 <0f> 88 75 01 00 00 49 63 c6 4c 2b 2c 24 48 8d 14 40 48 8d 14 90 49
>>> [ 6181.003710] RSP: 0018:ffffffff83a03e48 EFLAGS: 00000202
>>> [ 6181.009028] RAX: ffff943400800000 RBX: 0000000000000001 RCX: 000000000000001f
>>> [ 6181.016261] RDX: 0000000000000000 RSI: ffffffff83795059 RDI: ffffffff837c101e
>>> [ 6181.023490] RBP: ffff9434c9352000 R08: 0000059f1cb1a05e R09: 0000000000000008
>>> [ 6181.030717] R10: 0000000000000001 R11: 0000000000005c66 R12: ffffffff83bbf3a0
>>> [ 6181.037944] R13: 0000059f1cb1a05e R14: 0000000000000001 R15: 0000000000000000
>>> [ 6181.045202] ? cpuidle_enter_state+0xb2/0x400
>>> [ 6181.049678] cpuidle_enter+0x24/0x40
>>> [ 6181.053370] do_idle+0x1dd/0x260
>>> [ 6181.056713] cpu_startup_entry+0x14/0x20
>>> [ 6181.060753] rest_init+0xbb/0xc0
>>> [ 6181.064097] arch_call_rest_init+0x5/0xa
>>> [ 6181.068137] start_kernel+0x668/0x691
>>> [ 6181.071914] secondary_startup_64_no_verify+0xe0/0xeb
>>> [ 6181.077086] </TASK>
>>
>> It would be good to know what triggered this IPI. It may not have been
>> resctrl_arch_update_domains(). This pattern also happens from reset_all_ctrls() which
>
> I believe this is indeed from resctrl_arch_update_domains() since it calls
> smp_call_function_many() to configure all the domains.
>
>> happens during umount(). (and that would write the default CBM bitmap)
>>
>> If you can reproduce this easily, could you add dump_stack() to update_config() to see if
>> any path is setting have_new_ctrl. You aren't writing to the schemata file in your reproducer.
>>
>>
>>> We fix this issue by clearing the staged configs when destroying schemata list.
>>
>>
>>> Signed-off-by: Shawn Wang <[email protected]>
>>> Suggested-by: Xin Hao <[email protected]>
>>
>> If we can work out why you are seeing this, it would need a Fixes tag.
>>
>> Otherwise I agree it makes sense to make this more robust, but it would need a different
>> commit message.
>
> What do you think about clearing the staged config within resctrl_arch_update_domains()
> after the configuration is complete and there is no more need for it? That may reduce
> complexity where each caller no longer need to remember to do so.
> I see "staged_config" as a temporary storage and it my help to understand the code better
> if it is treated as such.
>

I think this is better. I have tested it and will give a new version.

>>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>> index f276aff521e8..b4a817ae83ab 100644
>>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>>> @@ -2127,8 +2127,15 @@ static int schemata_list_create(void)
>>> static void schemata_list_destroy(void)
>>> {
>>> struct resctrl_schema *s, *tmp;
>>> + struct rdt_domain *dom;
>>>
>>> list_for_each_entry_safe(s, tmp, &resctrl_schema_all, list) {
>>> + /*
>>> + * Clear staged_config on each domain before schemata list is
>>> + * destroyed.
>>> + */
>>> + list_for_each_entry(dom, &s->res->domains, list)
>>> + memset(dom->staged_config, 0, sizeof(dom->staged_config));
>>> list_del(&s->list);
>>> kfree(s);
>>> }
>>
>
> Reinette

Thanks,

Shawn

2022-09-28 15:11:22

by James Morse

[permalink] [raw]

Subject: Re: [PATCH] x86/resctrl: Clear the staged configs when destroying schemata list

Hi Shawn,

On 28/09/2022 14:37, Shawn Wang wrote:
> On 9/28/22 5:21 AM, Reinette Chatre wrote:
>> On 9/27/2022 6:06 AM, James Morse wrote:
>>> On 27/09/2022 03:54, Shawn Wang wrote:
>>>> Array staged_config in struct rdt_domain still maintains the original value when
>>>> resctrl is unmounted. If resctrl is mounted with cdp option and then remounted
>>>> without cdp option, field have_new_ctrl in staged_config[CDP_CODE] and
>>>> staged_config[CDP_DATA] will still be true.
>>>
>>> staged_config[CDP_DATA] is an array - its always 'true'. I think you mean
>>> staged_config[CDP_DATA].have_new_ctrl, which will still be true because it is only
>>> memset() when the schemata file is written to.
>>>
>>>
>>>> Since resctrl_arch_update_domains()
>>>> traverses all resctrl_conf_type, it will continue to update CDP_CODE and
>>>> CDP_DATA configurations, which can cause overflow problem.
>>>
>>> Only if its called with a stale staged config, and it should only be called when the
>>> schemata file is written to, which would memset() the staged config first.
>>>
>>>
>>>> The problem can be reproduced by the following commands:
>>>>      # A system with 16 usable closids and mba disabled
>>>>      mount -t resctrl resctrl -o cdp /sys/fs/resctrl
>>>>      mkdir /sys/fs/resctrl/p{1..7}
>>>>      umount /sys/fs/resctrl/
>>>>      mount -t resctrl resctrl /sys/fs/resctrl
>>>>      mkdir /sys/fs/resctrl/p{1..8}
>>>
>>> Thanks for the reproducer - but I don't see what could set have_new_ctrl in this sequence.
>>> You can't call apply_config() to set CPUs in the mask without that being set.
>>>
>>> Creating a new control group, (your mkdir step) shouldn't touch the hardware at all, as it
>>> should be left in its reset state from the last umount(), or setup.
>>
>> There is an attempt to configure the hardware in the mkdir path:
>> rdtgroup_mkdir()->rdtgroup_mkdir_ctrl_mon()->rdtgroup_init_alloc()->resctrl_arch_update_domains()
>>
>>
>> This is required in support of the different resource group modes. When a new
>> resource group is created via mkdir the configuration should respect any
>> exclusive resource groups that exist at that point.
>>
>>>
>>> I can't reproduce this on v6.0-rc7.
>>> Even if I dirty the configuration by writing to the schemata file, I can't reproduce this.
>>>
>>
>> From what I can tell the reproducer is dependent on (a) whether hardware
>> supports CDP, and (b) the number of CLOSIDs supported by the system. The reproducer
>> works for hardware that has 16 CLOSIDs (see later).
>>
>>> (I have mba enabled, but all this should affect is the number of closid available)

> I reproduce this on v6.0-rc6. The key to reproduction is to ensure that the number of
> usable groups is different between CDP enabled and CDP disabled.
>
> The system I use has 16 CLOSIDs for L3 CAT and 8 CLOSIDs for MBA. MBA limits the max
> number of groups to 8, even if CDP is disabled. This is the reason why I disable MBA.

bingo - could that detail be included in the commit message?

[..]

>> What do you think about clearing the staged config within resctrl_arch_update_domains()
>> after the configuration is complete and there is no more need for it? That may reduce
>> complexity where each caller no longer need to remember to do so.
>> I see "staged_config" as a temporary storage and it my help to understand the code better
>> if it is treated as such.
>>
>
> I think this is better. I have tested it and will give a new version.

Great, thanks!

Could you mention have_new_ctrl in the commit message, and this path via
rdtgroup_init_alloc().

I think you also need:
Fixes: 75408e43509ed ("x86/resctrl: Allow different CODE/DATA configurations to be staged")
Cc: <[email protected]>

Thanks,

James