2014-07-29 11:46:14

by Prarit Bhargava

[permalink] [raw]
Subject: [PATCH] cpufreq, store_scaling_governor requires policy->rwsem to be held for duration of changing governors [v2]

[v2]: Fixed wording of commit, and added additional testing information.

A while ago we added a test to mimic some of our users' userspace governor
programs which monitor system behaviour and will switch governors on the
fly. The decision process for this includes looking at time of day,
expected system load, etc. For some time now we have had reports of
system panics in the cpufreq code when using the userspace governor
utility.

The userspace utility writes
/sys/devices/system/cpu/cpuX/cpufreq/scaling_governor and sets the
governor. In some cases this can happen rapidly, and under heavy load
there are occasions where the changes are delayed. This can mean that
several governor changes may occur within a short period of time.

This has exposed a bug in the store_scaling_governor path. When the sysfs
file is written to,

store()
->down_write(&policy->rwsem);
->store_scaling_governor()
-> cpufreq_set_policy()
up_write(&policy->rwsem);
__cpufreq_governor(policy, CPUFREQ_GOV_POLICY_EXIT);
down_write(&policy->rwsem);

The release of the policy->rwsem results in a situation where another
write to the scaling_governor file can now start and overwrite pointers
and cause corruption.

ex)

Oops: 0000 [#1] SMP
Modules linked in: sg e1000e igb iTCO_wdt iTCO_vendor_support coretemp kvm_intel ptp kvm pps_core i7300_edac edac_core ses lpc_ich enclosure ioatdma mfd_core shpchp dca i2c_i801 serio_raw acpi_cpufreq pcspkr mperf nfsd auth_rpcgss nfs_acl lockd sunrpc xfs libcrc32c sr_mod cdrom sd_mod ata_generic crc_t10dif crct10dif_common pata_acpi radeon i2c_algo_bit drm_kms_helper ttm drm ata_piix libata i2c_core megaraid_sas dm_mirror dm_region_hash dm_log dm_mod
CPU: 7 PID: 565 Comm: kworker/7:2 Not tainted 3.10.0-121.el7.x86_64 #1
Hardware name: Intel MP Server/S7000FC4UR, BIOS SFC4UR.86B.01.00.0031.021820102042 02/18/2010
Workqueue: events od_dbs_timer
task: ffff88046544cfa0 ti: ffff880465438000 task.ti: ffff880465438000
RIP: 0010:[<ffffffff81489bed>] [<ffffffff81489bed>] od_dbs_timer+0x2d/0x160
RSP: 0018:ffff880465439de0 EFLAGS: 00010282
RAX: ffff88007f03d500 RBX: 0000000000010de0 RCX: 0000000000000006
RDX: 0000000000000006 RSI: 7040000000000000 RDI: ffff88046f2f0e08
e1000e 0000:06:00.0: irq 79 for MSI/MSI-X
IPv6: ADDRCONF(NETDEV_UP): enp6s0f0: link is not ready
RBP: ffff880465439e18 R08: ffff88046f2f0e10 R09: dfedcb48370f0e08
R10: dfedcb48370f0e08 R11: 0000000000000000 R12: ffff880465a69200
R13: 0000000000000000 R14: ffff88046f2f0e08 R15: 00000000000001c0
e1000e 0000:06:00.1: irq 80 for MSI/MSI-X
FS: 0000000000000000(0000) GS:ffff88046f2e0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000010 CR3: 0000000466be5000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Stack:
ffff88045ef8d140 ffff880468e1ad80 ffff88046f2f0e08 ffff880465a69200
ffff88046f2f3e00 ffff88046f2f7f00 00000000000001c0 ffff880465439e60
ffffffff8107e02b 000000006f2f3e18 0000000000000000 ffff88046f2f3e18
Call Trace:
[<ffffffff8107e02b>] process_one_work+0x17b/0x460
[<ffffffff8107edfb>] worker_thread+0x11b/0x400
[<ffffffff8107ece0>] ? rescuer_thread+0x400/0x400
[<ffffffff81085aef>] kthread+0xcf/0xe0
[<ffffffff81085a20>] ? kthread_create_on_node+0x140/0x140
e1000e 0000:06:00.1: irq 80 for MSI/MSI-X
IPv6: ADDRCONF(NETDEV_UP): enp6s0f1: link is not ready
[<ffffffff815fc76c>] ret_from_fork+0x7c/0xb0
[<ffffffff81085a20>] ? kthread_create_on_node+0x140/0x140
Code: 66 66 90 55 48 89 e5 41 57 41 56 49 89 fe 41 55 41 54 53 48 c7 c3 e0 0d 01 00 48 83 ec 10 48 8b 47 f8 8b 50 14 4c 8b 68 40 89 d1 <4d> 8b 7d 10 89 55 cc 48 03 1c cd 80 78 9e 81 48 8d 83 a8 00 00
RIP [<ffffffff81489bed>] od_dbs_timer+0x2d/0x160
RSP <ffff880465439de0>
CR2: 0000000000000010
BUG: unable to handle kernel ---[ end trace c9a1ca82e01a4294 ]---
Kernel panic - not syncing: Fatal exception
NULL pointer dereference at 0000000000000014
IP: [<ffffffff81489be4>] od_dbs_timer+0x24/0x160
PGD 46661d067 PUD 465540067 PMD 0
Oops: 0000 [#2] SMP
Modules linked in: sg e1000e igb iTCO_wdt iTCO_vendor_support coretemp kvm_intel ptp kvm pps_core i7300_edac edac_core ses lpc_ich enclosure ioatdma mfd_core shpchp dca i2c_i801 serio_raw acpi_cpufreq pcspkr mperf nfsd auth_rpcgss nfs_acl lockd sunrpc xfs libcrc32c sr_mod cdrom sd_mod ata_generic crc_t10dif crct10dif_common pata_acpi radeon i2c_algo_bit drm_kms_helper ttm drm ata_piix libata i2c_core megaraid_sas dm_mirror dm_region_hash dm_log dm_mod
CPU: 6 PID: 506 Comm: kworker/6:2 Tainted: G D -------------- 3.10.0-121.el7.x86_64 #1
Hardware name: Intel MP Server/S7000FC4UR, BIOS SFC4UR.86B.01.00.0031.021820102042 02/18/2010
Workqueue: events od_dbs_timer
task: ffff880465448b60 ti: ffff880463730000 task.ti: ffff880463730000
RIP: 0010:[<ffffffff81489be4>] [<ffffffff81489be4>] od_dbs_timer+0x24/0x160
RSP: 0018:ffff880463731de0 EFLAGS: 00010282
RAX: 0000000000000000 RBX: 0000000000010de0 RCX: dead000000200200
RDX: 0000000000000001 RSI: 7040000000000000 RDI: ffff88046f2d0e08
RBP: ffff880463731e18 R08: ffff88046f2d0e10 R09: dfedcb50370d0e08
R10: dfedcb50370d0e08 R11: 0000000000000000 R12: ffff880465bd2b80
R13: ffff88046f2d3e00 R14: ffff88046f2d0e08 R15: 0000000000000180
FS: 0000000000000000(0000) GS:ffff88046f2c0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000014 CR3: 0000000466be5000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Stack:
ffff88045ef8d140 ffff880468dcf1c0 ffff88046f2d0e08 ffff880465bd2b80
ffff88046f2d3e00 ffff88046f2d7f00 0000000000000180 ffff880463731e60
ffffffff8107e02b 000000006f2d3e18 0000000000000000 ffff88046f2d3e18
Call Trace:
[<ffffffff8107e02b>] process_one_work+0x17b/0x460
[<ffffffff8107edfb>] worker_thread+0x11b/0x400
[<ffffffff8107ece0>] ? rescuer_thread+0x400/0x400
[<ffffffff81085aef>] kthread+0xcf/0xe0
[<ffffffff81085a20>] ? kthread_create_on_node+0x140/0x140
[<ffffffff815fc76c>] ret_from_fork+0x7c/0xb0
[<ffffffff81085a20>] ? kthread_create_on_node+0x140/0x140
Code: 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 89 e5 41 57 41 56 49 89 fe 41 55 41 54 53 48 c7 c3 e0 0d 01 00 48 83 ec 10 48 8b 47 f8 <8b> 50 14 4c 8b 68 40 89 d1 4d 8b 7d 10 89 55 cc 48 03 1c cd 80
RIP [<ffffffff81489be4>] od_dbs_timer+0x24/0x160
RSP <ffff880463731de0>
CR2: 0000000000000014
drm_kms_helper: panic occurred, switching back to text console

There are other panics associated with this, including list corruption and
various flavors of panics in the od_dbs_timer & work code.

As a test I introduced a separate mutex around the cpufreq_set_policy() call.
After doing so system behaves as expected, and the panics no longer happen.
I narrowed the mutex down to the "governor switch" code in
cpufreq_set_policy() and saw that everything stabilized.

I then reverted commit 955ef4833574636819cd269cfbae12f79cbde63a, cpufreq: Drop
rwsem lock around CPUFREQ_GOV_POLICY_EXIT, which added the dropping of the
policy->rwsem and again, the panics no longer occurred.

The closer I looked at commit 955ef483, the more questions I have. The first
thing is that it appears that the stacktrace includes function calls that
are not, and never have been, part of the linux.git tree, ie) the call trace
shows

[<c0055a79>] lock_acquire+0x61/0xbc
[<c00fabf1>] sysfs_addrm_finish+0xc1/0x128
[<c00f9819>] sysfs_hash_and_remove+0x35/0x64
[<c00fbe6f>] remove_files.isra.0+0x1b/0x24
[<c00fbea5>] sysfs_remove_group+0x2d/0xa8
[<c02f9a0b>] cpufreq_governor_interactive+0x13b/0x35c
[<c02f61df>] __cpufreq_governor+0x2b/0x8c
[<c02f6579>] __cpufreq_set_policy+0xa9/0xf8
[<c02f6b75>] store_scaling_governor+0x61/0x100
[<c02f6f4d>] store+0x39/0x60
[<c00f9b81>] sysfs_write_file+0xed/0x114
[<c00b3fd1>] vfs_write+0x65/0xd8
[<c00b424b>] sys_write+0x2f/0x50
[<c000cdc1>] ret_fast_syscall+0x1/0x52

and the top part of the stack from cpufreq_governor_interactive() appear to
be for a driver that is not in the linux.git tree. Google search does show
that it exists in various android trees, however, my concern is that the "core"
scaling governors are now broken because of an out tree driver.

It seems like the patch fixed this problem incorrectly by breaking the existing
locking. The whole point of the policy->rwsem is to single thread updates
to the policy struct; this is no longer the case after that commit as we
are breaking the locking right in the middle of a policy->governor update.
The fix should have been to the cpufreq_governor_interactive out-of-tree
driver and fix the locking there in its release code. In any case, the current
linux.git code no longer can reproduce the original failure; the locking in the
sysfs release code has changed.

In case anyone is interested in reproducing, the reproducer I used was

i=0
while [ True ]; do
i=$((i+1))
echo "ondemand" > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor &
echo "performance" > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor &
echo "ondemand" > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor &
echo "performance" > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor &
if [ $((i % 100)) = 0 ]; then
num_ps=$(ps aux | grep doit | wc -l)
echo "$i running, $num_ps outstanding"
fi
done

and the system usually panics within 500 iterations.

I have tested this extensively across 15 systems of various configurations.
In the unpatched kernel I see oopses usually within 500 iterations. In the
patched kernel which fixes the broken locking, I can go several hours
without any issues.

This patch effectively reverts commit 955ef483.

Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Lenny Szubowicz <[email protected]>
Cc: [email protected]
Signed-off-by: Prarit Bhargava <[email protected]>
---
drivers/cpufreq/cpufreq.c | 8 +-------
include/linux/cpufreq.h | 4 ----
2 files changed, 1 insertion(+), 11 deletions(-)

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 6f02485..4869f0e 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -2198,12 +2198,8 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy,
/* save old, working values */
old_gov = policy->governor;
/* end old governor */
- if (old_gov) {
- __cpufreq_governor(policy, CPUFREQ_GOV_STOP);
- up_write(&policy->rwsem);
+ if (old_gov)
__cpufreq_governor(policy, CPUFREQ_GOV_POLICY_EXIT);
- down_write(&policy->rwsem);
- }

/* start new governor */
policy->governor = new_policy->governor;
@@ -2211,9 +2207,7 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy,
if (!__cpufreq_governor(policy, CPUFREQ_GOV_START))
goto out;

- up_write(&policy->rwsem);
__cpufreq_governor(policy, CPUFREQ_GOV_POLICY_EXIT);
- down_write(&policy->rwsem);
}

/* new governor failed, so re-start old one */
diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
index 8f8ae95..b450c43 100644
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -100,10 +100,6 @@ struct cpufreq_policy {
* - Any routine that will write to the policy structure and/or may take away
* the policy altogether (eg. CPU hotplug), will hold this lock in write
* mode before doing so.
- *
- * Additional rules:
- * - Lock should not be held across
- * __cpufreq_governor(data, CPUFREQ_GOV_POLICY_EXIT);
*/
struct rw_semaphore rwsem;

--
1.7.9.3


2014-07-29 23:45:17

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH] cpufreq, store_scaling_governor requires policy->rwsem to be held for duration of changing governors [v2]

On Tuesday, July 29, 2014 07:46:02 AM Prarit Bhargava wrote:
> [v2]: Fixed wording of commit, and added additional testing information.
>
> A while ago we added a test to mimic some of our users' userspace governor
> programs which monitor system behaviour and will switch governors on the
> fly. The decision process for this includes looking at time of day,
> expected system load, etc. For some time now we have had reports of
> system panics in the cpufreq code when using the userspace governor
> utility.
>
> The userspace utility writes
> /sys/devices/system/cpu/cpuX/cpufreq/scaling_governor and sets the
> governor. In some cases this can happen rapidly, and under heavy load
> there are occasions where the changes are delayed. This can mean that
> several governor changes may occur within a short period of time.
>
> This has exposed a bug in the store_scaling_governor path. When the sysfs
> file is written to,
>
> store()
> ->down_write(&policy->rwsem);
> ->store_scaling_governor()
> -> cpufreq_set_policy()
> up_write(&policy->rwsem);
> __cpufreq_governor(policy, CPUFREQ_GOV_POLICY_EXIT);
> down_write(&policy->rwsem);
>
> The release of the policy->rwsem results in a situation where another
> write to the scaling_governor file can now start and overwrite pointers
> and cause corruption.
>
> ex)
>
> Oops: 0000 [#1] SMP
> Modules linked in: sg e1000e igb iTCO_wdt iTCO_vendor_support coretemp kvm_intel ptp kvm pps_core i7300_edac edac_core ses lpc_ich enclosure ioatdma mfd_core shpchp dca i2c_i801 serio_raw acpi_cpufreq pcspkr mperf nfsd auth_rpcgss nfs_acl lockd sunrpc xfs libcrc32c sr_mod cdrom sd_mod ata_generic crc_t10dif crct10dif_common pata_acpi radeon i2c_algo_bit drm_kms_helper ttm drm ata_piix libata i2c_core megaraid_sas dm_mirror dm_region_hash dm_log dm_mod
> CPU: 7 PID: 565 Comm: kworker/7:2 Not tainted 3.10.0-121.el7.x86_64 #1
> Hardware name: Intel MP Server/S7000FC4UR, BIOS SFC4UR.86B.01.00.0031.021820102042 02/18/2010
> Workqueue: events od_dbs_timer
> task: ffff88046544cfa0 ti: ffff880465438000 task.ti: ffff880465438000
> RIP: 0010:[<ffffffff81489bed>] [<ffffffff81489bed>] od_dbs_timer+0x2d/0x160
> RSP: 0018:ffff880465439de0 EFLAGS: 00010282
> RAX: ffff88007f03d500 RBX: 0000000000010de0 RCX: 0000000000000006
> RDX: 0000000000000006 RSI: 7040000000000000 RDI: ffff88046f2f0e08
> e1000e 0000:06:00.0: irq 79 for MSI/MSI-X
> IPv6: ADDRCONF(NETDEV_UP): enp6s0f0: link is not ready
> RBP: ffff880465439e18 R08: ffff88046f2f0e10 R09: dfedcb48370f0e08
> R10: dfedcb48370f0e08 R11: 0000000000000000 R12: ffff880465a69200
> R13: 0000000000000000 R14: ffff88046f2f0e08 R15: 00000000000001c0
> e1000e 0000:06:00.1: irq 80 for MSI/MSI-X
> FS: 0000000000000000(0000) GS:ffff88046f2e0000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 0000000000000010 CR3: 0000000466be5000 CR4: 00000000000007e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Stack:
> ffff88045ef8d140 ffff880468e1ad80 ffff88046f2f0e08 ffff880465a69200
> ffff88046f2f3e00 ffff88046f2f7f00 00000000000001c0 ffff880465439e60
> ffffffff8107e02b 000000006f2f3e18 0000000000000000 ffff88046f2f3e18
> Call Trace:
> [<ffffffff8107e02b>] process_one_work+0x17b/0x460
> [<ffffffff8107edfb>] worker_thread+0x11b/0x400
> [<ffffffff8107ece0>] ? rescuer_thread+0x400/0x400
> [<ffffffff81085aef>] kthread+0xcf/0xe0
> [<ffffffff81085a20>] ? kthread_create_on_node+0x140/0x140
> e1000e 0000:06:00.1: irq 80 for MSI/MSI-X
> IPv6: ADDRCONF(NETDEV_UP): enp6s0f1: link is not ready
> [<ffffffff815fc76c>] ret_from_fork+0x7c/0xb0
> [<ffffffff81085a20>] ? kthread_create_on_node+0x140/0x140
> Code: 66 66 90 55 48 89 e5 41 57 41 56 49 89 fe 41 55 41 54 53 48 c7 c3 e0 0d 01 00 48 83 ec 10 48 8b 47 f8 8b 50 14 4c 8b 68 40 89 d1 <4d> 8b 7d 10 89 55 cc 48 03 1c cd 80 78 9e 81 48 8d 83 a8 00 00
> RIP [<ffffffff81489bed>] od_dbs_timer+0x2d/0x160
> RSP <ffff880465439de0>
> CR2: 0000000000000010
> BUG: unable to handle kernel ---[ end trace c9a1ca82e01a4294 ]---
> Kernel panic - not syncing: Fatal exception
> NULL pointer dereference at 0000000000000014
> IP: [<ffffffff81489be4>] od_dbs_timer+0x24/0x160
> PGD 46661d067 PUD 465540067 PMD 0
> Oops: 0000 [#2] SMP
> Modules linked in: sg e1000e igb iTCO_wdt iTCO_vendor_support coretemp kvm_intel ptp kvm pps_core i7300_edac edac_core ses lpc_ich enclosure ioatdma mfd_core shpchp dca i2c_i801 serio_raw acpi_cpufreq pcspkr mperf nfsd auth_rpcgss nfs_acl lockd sunrpc xfs libcrc32c sr_mod cdrom sd_mod ata_generic crc_t10dif crct10dif_common pata_acpi radeon i2c_algo_bit drm_kms_helper ttm drm ata_piix libata i2c_core megaraid_sas dm_mirror dm_region_hash dm_log dm_mod
> CPU: 6 PID: 506 Comm: kworker/6:2 Tainted: G D -------------- 3.10.0-121.el7.x86_64 #1
> Hardware name: Intel MP Server/S7000FC4UR, BIOS SFC4UR.86B.01.00.0031.021820102042 02/18/2010
> Workqueue: events od_dbs_timer
> task: ffff880465448b60 ti: ffff880463730000 task.ti: ffff880463730000
> RIP: 0010:[<ffffffff81489be4>] [<ffffffff81489be4>] od_dbs_timer+0x24/0x160
> RSP: 0018:ffff880463731de0 EFLAGS: 00010282
> RAX: 0000000000000000 RBX: 0000000000010de0 RCX: dead000000200200
> RDX: 0000000000000001 RSI: 7040000000000000 RDI: ffff88046f2d0e08
> RBP: ffff880463731e18 R08: ffff88046f2d0e10 R09: dfedcb50370d0e08
> R10: dfedcb50370d0e08 R11: 0000000000000000 R12: ffff880465bd2b80
> R13: ffff88046f2d3e00 R14: ffff88046f2d0e08 R15: 0000000000000180
> FS: 0000000000000000(0000) GS:ffff88046f2c0000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 0000000000000014 CR3: 0000000466be5000 CR4: 00000000000007e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Stack:
> ffff88045ef8d140 ffff880468dcf1c0 ffff88046f2d0e08 ffff880465bd2b80
> ffff88046f2d3e00 ffff88046f2d7f00 0000000000000180 ffff880463731e60
> ffffffff8107e02b 000000006f2d3e18 0000000000000000 ffff88046f2d3e18
> Call Trace:
> [<ffffffff8107e02b>] process_one_work+0x17b/0x460
> [<ffffffff8107edfb>] worker_thread+0x11b/0x400
> [<ffffffff8107ece0>] ? rescuer_thread+0x400/0x400
> [<ffffffff81085aef>] kthread+0xcf/0xe0
> [<ffffffff81085a20>] ? kthread_create_on_node+0x140/0x140
> [<ffffffff815fc76c>] ret_from_fork+0x7c/0xb0
> [<ffffffff81085a20>] ? kthread_create_on_node+0x140/0x140
> Code: 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 89 e5 41 57 41 56 49 89 fe 41 55 41 54 53 48 c7 c3 e0 0d 01 00 48 83 ec 10 48 8b 47 f8 <8b> 50 14 4c 8b 68 40 89 d1 4d 8b 7d 10 89 55 cc 48 03 1c cd 80
> RIP [<ffffffff81489be4>] od_dbs_timer+0x24/0x160
> RSP <ffff880463731de0>
> CR2: 0000000000000014
> drm_kms_helper: panic occurred, switching back to text console
>
> There are other panics associated with this, including list corruption and
> various flavors of panics in the od_dbs_timer & work code.
>
> As a test I introduced a separate mutex around the cpufreq_set_policy() call.
> After doing so system behaves as expected, and the panics no longer happen.
> I narrowed the mutex down to the "governor switch" code in
> cpufreq_set_policy() and saw that everything stabilized.
>
> I then reverted commit 955ef4833574636819cd269cfbae12f79cbde63a, cpufreq: Drop
> rwsem lock around CPUFREQ_GOV_POLICY_EXIT, which added the dropping of the
> policy->rwsem and again, the panics no longer occurred.
>
> The closer I looked at commit 955ef483, the more questions I have. The first
> thing is that it appears that the stacktrace includes function calls that
> are not, and never have been, part of the linux.git tree, ie) the call trace
> shows
>
> [<c0055a79>] lock_acquire+0x61/0xbc
> [<c00fabf1>] sysfs_addrm_finish+0xc1/0x128
> [<c00f9819>] sysfs_hash_and_remove+0x35/0x64
> [<c00fbe6f>] remove_files.isra.0+0x1b/0x24
> [<c00fbea5>] sysfs_remove_group+0x2d/0xa8
> [<c02f9a0b>] cpufreq_governor_interactive+0x13b/0x35c
> [<c02f61df>] __cpufreq_governor+0x2b/0x8c
> [<c02f6579>] __cpufreq_set_policy+0xa9/0xf8
> [<c02f6b75>] store_scaling_governor+0x61/0x100
> [<c02f6f4d>] store+0x39/0x60
> [<c00f9b81>] sysfs_write_file+0xed/0x114
> [<c00b3fd1>] vfs_write+0x65/0xd8
> [<c00b424b>] sys_write+0x2f/0x50
> [<c000cdc1>] ret_fast_syscall+0x1/0x52
>
> and the top part of the stack from cpufreq_governor_interactive() appear to
> be for a driver that is not in the linux.git tree. Google search does show
> that it exists in various android trees, however, my concern is that the "core"
> scaling governors are now broken because of an out tree driver.
>
> It seems like the patch fixed this problem incorrectly by breaking the existing
> locking. The whole point of the policy->rwsem is to single thread updates
> to the policy struct; this is no longer the case after that commit as we
> are breaking the locking right in the middle of a policy->governor update.
> The fix should have been to the cpufreq_governor_interactive out-of-tree
> driver and fix the locking there in its release code. In any case, the current
> linux.git code no longer can reproduce the original failure; the locking in the
> sysfs release code has changed.
>
> In case anyone is interested in reproducing, the reproducer I used was
>
> i=0
> while [ True ]; do
> i=$((i+1))
> echo "ondemand" > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor &
> echo "performance" > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor &
> echo "ondemand" > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor &
> echo "performance" > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor &
> if [ $((i % 100)) = 0 ]; then
> num_ps=$(ps aux | grep doit | wc -l)
> echo "$i running, $num_ps outstanding"
> fi
> done
>
> and the system usually panics within 500 iterations.
>
> I have tested this extensively across 15 systems of various configurations.
> In the unpatched kernel I see oopses usually within 500 iterations. In the
> patched kernel which fixes the broken locking, I can go several hours
> without any issues.
>
> This patch effectively reverts commit 955ef483.

OK, I'm convinced by this.

I suppose we should push it for -stable from 3.10 through 3.15.x, right?

> Cc: "Rafael J. Wysocki" <[email protected]>
> Cc: Viresh Kumar <[email protected]>
> Cc: Lenny Szubowicz <[email protected]>
> Cc: [email protected]
> Signed-off-by: Prarit Bhargava <[email protected]>
> ---
> drivers/cpufreq/cpufreq.c | 8 +-------
> include/linux/cpufreq.h | 4 ----
> 2 files changed, 1 insertion(+), 11 deletions(-)
>
> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> index 6f02485..4869f0e 100644
> --- a/drivers/cpufreq/cpufreq.c
> +++ b/drivers/cpufreq/cpufreq.c
> @@ -2198,12 +2198,8 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy,
> /* save old, working values */
> old_gov = policy->governor;
> /* end old governor */
> - if (old_gov) {
> - __cpufreq_governor(policy, CPUFREQ_GOV_STOP);
> - up_write(&policy->rwsem);
> + if (old_gov)
> __cpufreq_governor(policy, CPUFREQ_GOV_POLICY_EXIT);
> - down_write(&policy->rwsem);
> - }
>
> /* start new governor */
> policy->governor = new_policy->governor;
> @@ -2211,9 +2207,7 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy,
> if (!__cpufreq_governor(policy, CPUFREQ_GOV_START))
> goto out;
>
> - up_write(&policy->rwsem);
> __cpufreq_governor(policy, CPUFREQ_GOV_POLICY_EXIT);
> - down_write(&policy->rwsem);
> }
>
> /* new governor failed, so re-start old one */
> diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
> index 8f8ae95..b450c43 100644
> --- a/include/linux/cpufreq.h
> +++ b/include/linux/cpufreq.h
> @@ -100,10 +100,6 @@ struct cpufreq_policy {
> * - Any routine that will write to the policy structure and/or may take away
> * the policy altogether (eg. CPU hotplug), will hold this lock in write
> * mode before doing so.
> - *
> - * Additional rules:
> - * - Lock should not be held across
> - * __cpufreq_governor(data, CPUFREQ_GOV_POLICY_EXIT);
> */
> struct rw_semaphore rwsem;
>
>

--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

2014-07-30 14:18:33

by Prarit Bhargava

[permalink] [raw]
Subject: Re: [PATCH] cpufreq, store_scaling_governor requires policy->rwsem to be held for duration of changing governors [v2]



On 07/29/2014 08:03 PM, Rafael J. Wysocki wrote:
> On Tuesday, July 29, 2014 07:46:02 AM Prarit Bhargava wrote:
>> [v2]: Fixed wording of commit, and added additional testing information.
>>
>> A while ago we added a test to mimic some of our users' userspace governor
>> programs which monitor system behaviour and will switch governors on the
>> fly. The decision process for this includes looking at time of day,
>> expected system load, etc. For some time now we have had reports of
>> system panics in the cpufreq code when using the userspace governor
>> utility.
>>
>> The userspace utility writes
>> /sys/devices/system/cpu/cpuX/cpufreq/scaling_governor and sets the
>> governor. In some cases this can happen rapidly, and under heavy load
>> there are occasions where the changes are delayed. This can mean that
>> several governor changes may occur within a short period of time.
>>
>> This has exposed a bug in the store_scaling_governor path. When the sysfs
>> file is written to,
>>
>> store()
>> ->down_write(&policy->rwsem);
>> ->store_scaling_governor()
>> -> cpufreq_set_policy()
>> up_write(&policy->rwsem);
>> __cpufreq_governor(policy, CPUFREQ_GOV_POLICY_EXIT);
>> down_write(&policy->rwsem);
>>
>> The release of the policy->rwsem results in a situation where another
>> write to the scaling_governor file can now start and overwrite pointers
>> and cause corruption.
>>
>> ex)
>>
>> Oops: 0000 [#1] SMP
>> Modules linked in: sg e1000e igb iTCO_wdt iTCO_vendor_support coretemp kvm_intel ptp kvm pps_core i7300_edac edac_core ses lpc_ich enclosure ioatdma mfd_core shpchp dca i2c_i801 serio_raw acpi_cpufreq pcspkr mperf nfsd auth_rpcgss nfs_acl lockd sunrpc xfs libcrc32c sr_mod cdrom sd_mod ata_generic crc_t10dif crct10dif_common pata_acpi radeon i2c_algo_bit drm_kms_helper ttm drm ata_piix libata i2c_core megaraid_sas dm_mirror dm_region_hash dm_log dm_mod
>> CPU: 7 PID: 565 Comm: kworker/7:2 Not tainted 3.10.0-121.el7.x86_64 #1
>> Hardware name: Intel MP Server/S7000FC4UR, BIOS SFC4UR.86B.01.00.0031.021820102042 02/18/2010
>> Workqueue: events od_dbs_timer
>> task: ffff88046544cfa0 ti: ffff880465438000 task.ti: ffff880465438000
>> RIP: 0010:[<ffffffff81489bed>] [<ffffffff81489bed>] od_dbs_timer+0x2d/0x160
>> RSP: 0018:ffff880465439de0 EFLAGS: 00010282
>> RAX: ffff88007f03d500 RBX: 0000000000010de0 RCX: 0000000000000006
>> RDX: 0000000000000006 RSI: 7040000000000000 RDI: ffff88046f2f0e08
>> e1000e 0000:06:00.0: irq 79 for MSI/MSI-X
>> IPv6: ADDRCONF(NETDEV_UP): enp6s0f0: link is not ready
>> RBP: ffff880465439e18 R08: ffff88046f2f0e10 R09: dfedcb48370f0e08
>> R10: dfedcb48370f0e08 R11: 0000000000000000 R12: ffff880465a69200
>> R13: 0000000000000000 R14: ffff88046f2f0e08 R15: 00000000000001c0
>> e1000e 0000:06:00.1: irq 80 for MSI/MSI-X
>> FS: 0000000000000000(0000) GS:ffff88046f2e0000(0000) knlGS:0000000000000000
>> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>> CR2: 0000000000000010 CR3: 0000000466be5000 CR4: 00000000000007e0
>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> Stack:
>> ffff88045ef8d140 ffff880468e1ad80 ffff88046f2f0e08 ffff880465a69200
>> ffff88046f2f3e00 ffff88046f2f7f00 00000000000001c0 ffff880465439e60
>> ffffffff8107e02b 000000006f2f3e18 0000000000000000 ffff88046f2f3e18
>> Call Trace:
>> [<ffffffff8107e02b>] process_one_work+0x17b/0x460
>> [<ffffffff8107edfb>] worker_thread+0x11b/0x400
>> [<ffffffff8107ece0>] ? rescuer_thread+0x400/0x400
>> [<ffffffff81085aef>] kthread+0xcf/0xe0
>> [<ffffffff81085a20>] ? kthread_create_on_node+0x140/0x140
>> e1000e 0000:06:00.1: irq 80 for MSI/MSI-X
>> IPv6: ADDRCONF(NETDEV_UP): enp6s0f1: link is not ready
>> [<ffffffff815fc76c>] ret_from_fork+0x7c/0xb0
>> [<ffffffff81085a20>] ? kthread_create_on_node+0x140/0x140
>> Code: 66 66 90 55 48 89 e5 41 57 41 56 49 89 fe 41 55 41 54 53 48 c7 c3 e0 0d 01 00 48 83 ec 10 48 8b 47 f8 8b 50 14 4c 8b 68 40 89 d1 <4d> 8b 7d 10 89 55 cc 48 03 1c cd 80 78 9e 81 48 8d 83 a8 00 00
>> RIP [<ffffffff81489bed>] od_dbs_timer+0x2d/0x160
>> RSP <ffff880465439de0>
>> CR2: 0000000000000010
>> BUG: unable to handle kernel ---[ end trace c9a1ca82e01a4294 ]---
>> Kernel panic - not syncing: Fatal exception
>> NULL pointer dereference at 0000000000000014
>> IP: [<ffffffff81489be4>] od_dbs_timer+0x24/0x160
>> PGD 46661d067 PUD 465540067 PMD 0
>> Oops: 0000 [#2] SMP
>> Modules linked in: sg e1000e igb iTCO_wdt iTCO_vendor_support coretemp kvm_intel ptp kvm pps_core i7300_edac edac_core ses lpc_ich enclosure ioatdma mfd_core shpchp dca i2c_i801 serio_raw acpi_cpufreq pcspkr mperf nfsd auth_rpcgss nfs_acl lockd sunrpc xfs libcrc32c sr_mod cdrom sd_mod ata_generic crc_t10dif crct10dif_common pata_acpi radeon i2c_algo_bit drm_kms_helper ttm drm ata_piix libata i2c_core megaraid_sas dm_mirror dm_region_hash dm_log dm_mod
>> CPU: 6 PID: 506 Comm: kworker/6:2 Tainted: G D -------------- 3.10.0-121.el7.x86_64 #1
>> Hardware name: Intel MP Server/S7000FC4UR, BIOS SFC4UR.86B.01.00.0031.021820102042 02/18/2010
>> Workqueue: events od_dbs_timer
>> task: ffff880465448b60 ti: ffff880463730000 task.ti: ffff880463730000
>> RIP: 0010:[<ffffffff81489be4>] [<ffffffff81489be4>] od_dbs_timer+0x24/0x160
>> RSP: 0018:ffff880463731de0 EFLAGS: 00010282
>> RAX: 0000000000000000 RBX: 0000000000010de0 RCX: dead000000200200
>> RDX: 0000000000000001 RSI: 7040000000000000 RDI: ffff88046f2d0e08
>> RBP: ffff880463731e18 R08: ffff88046f2d0e10 R09: dfedcb50370d0e08
>> R10: dfedcb50370d0e08 R11: 0000000000000000 R12: ffff880465bd2b80
>> R13: ffff88046f2d3e00 R14: ffff88046f2d0e08 R15: 0000000000000180
>> FS: 0000000000000000(0000) GS:ffff88046f2c0000(0000) knlGS:0000000000000000
>> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>> CR2: 0000000000000014 CR3: 0000000466be5000 CR4: 00000000000007e0
>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> Stack:
>> ffff88045ef8d140 ffff880468dcf1c0 ffff88046f2d0e08 ffff880465bd2b80
>> ffff88046f2d3e00 ffff88046f2d7f00 0000000000000180 ffff880463731e60
>> ffffffff8107e02b 000000006f2d3e18 0000000000000000 ffff88046f2d3e18
>> Call Trace:
>> [<ffffffff8107e02b>] process_one_work+0x17b/0x460
>> [<ffffffff8107edfb>] worker_thread+0x11b/0x400
>> [<ffffffff8107ece0>] ? rescuer_thread+0x400/0x400
>> [<ffffffff81085aef>] kthread+0xcf/0xe0
>> [<ffffffff81085a20>] ? kthread_create_on_node+0x140/0x140
>> [<ffffffff815fc76c>] ret_from_fork+0x7c/0xb0
>> [<ffffffff81085a20>] ? kthread_create_on_node+0x140/0x140
>> Code: 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 89 e5 41 57 41 56 49 89 fe 41 55 41 54 53 48 c7 c3 e0 0d 01 00 48 83 ec 10 48 8b 47 f8 <8b> 50 14 4c 8b 68 40 89 d1 4d 8b 7d 10 89 55 cc 48 03 1c cd 80
>> RIP [<ffffffff81489be4>] od_dbs_timer+0x24/0x160
>> RSP <ffff880463731de0>
>> CR2: 0000000000000014
>> drm_kms_helper: panic occurred, switching back to text console
>>
>> There are other panics associated with this, including list corruption and
>> various flavors of panics in the od_dbs_timer & work code.
>>
>> As a test I introduced a separate mutex around the cpufreq_set_policy() call.
>> After doing so system behaves as expected, and the panics no longer happen.
>> I narrowed the mutex down to the "governor switch" code in
>> cpufreq_set_policy() and saw that everything stabilized.
>>
>> I then reverted commit 955ef4833574636819cd269cfbae12f79cbde63a, cpufreq: Drop
>> rwsem lock around CPUFREQ_GOV_POLICY_EXIT, which added the dropping of the
>> policy->rwsem and again, the panics no longer occurred.
>>
>> The closer I looked at commit 955ef483, the more questions I have. The first
>> thing is that it appears that the stacktrace includes function calls that
>> are not, and never have been, part of the linux.git tree, ie) the call trace
>> shows
>>
>> [<c0055a79>] lock_acquire+0x61/0xbc
>> [<c00fabf1>] sysfs_addrm_finish+0xc1/0x128
>> [<c00f9819>] sysfs_hash_and_remove+0x35/0x64
>> [<c00fbe6f>] remove_files.isra.0+0x1b/0x24
>> [<c00fbea5>] sysfs_remove_group+0x2d/0xa8
>> [<c02f9a0b>] cpufreq_governor_interactive+0x13b/0x35c
>> [<c02f61df>] __cpufreq_governor+0x2b/0x8c
>> [<c02f6579>] __cpufreq_set_policy+0xa9/0xf8
>> [<c02f6b75>] store_scaling_governor+0x61/0x100
>> [<c02f6f4d>] store+0x39/0x60
>> [<c00f9b81>] sysfs_write_file+0xed/0x114
>> [<c00b3fd1>] vfs_write+0x65/0xd8
>> [<c00b424b>] sys_write+0x2f/0x50
>> [<c000cdc1>] ret_fast_syscall+0x1/0x52
>>
>> and the top part of the stack from cpufreq_governor_interactive() appear to
>> be for a driver that is not in the linux.git tree. Google search does show
>> that it exists in various android trees, however, my concern is that the "core"
>> scaling governors are now broken because of an out tree driver.
>>
>> It seems like the patch fixed this problem incorrectly by breaking the existing
>> locking. The whole point of the policy->rwsem is to single thread updates
>> to the policy struct; this is no longer the case after that commit as we
>> are breaking the locking right in the middle of a policy->governor update.
>> The fix should have been to the cpufreq_governor_interactive out-of-tree
>> driver and fix the locking there in its release code. In any case, the current
>> linux.git code no longer can reproduce the original failure; the locking in the
>> sysfs release code has changed.
>>
>> In case anyone is interested in reproducing, the reproducer I used was
>>
>> i=0
>> while [ True ]; do
>> i=$((i+1))
>> echo "ondemand" > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor &
>> echo "performance" > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor &
>> echo "ondemand" > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor &
>> echo "performance" > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor &
>> if [ $((i % 100)) = 0 ]; then
>> num_ps=$(ps aux | grep doit | wc -l)
>> echo "$i running, $num_ps outstanding"
>> fi
>> done
>>
>> and the system usually panics within 500 iterations.
>>
>> I have tested this extensively across 15 systems of various configurations.
>> In the unpatched kernel I see oopses usually within 500 iterations. In the
>> patched kernel which fixes the broken locking, I can go several hours
>> without any issues.
>>
>> This patch effectively reverts commit 955ef483.
>
> OK, I'm convinced by this.
>
> I suppose we should push it for -stable from 3.10 through 3.15.x, right?

Rafael, I think that is a good idea. I'm not sure what the protocol is for
adding [email protected] though ...

P.

2014-07-30 21:22:24

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH] cpufreq, store_scaling_governor requires policy->rwsem to be held for duration of changing governors [v2]

On Wednesday, July 30, 2014 10:18:25 AM Prarit Bhargava wrote:
>
> On 07/29/2014 08:03 PM, Rafael J. Wysocki wrote:
> > On Tuesday, July 29, 2014 07:46:02 AM Prarit Bhargava wrote:

[cut]

> >> This patch effectively reverts commit 955ef483.
> >
> > OK, I'm convinced by this.
> >
> > I suppose we should push it for -stable from 3.10 through 3.15.x, right?
>
> Rafael, I think that is a good idea. I'm not sure what the protocol is for
> adding [email protected] though ...

I'll take care of this, thanks!

--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

2014-07-31 01:36:05

by Saravana Kannan

[permalink] [raw]
Subject: Re: [PATCH] cpufreq, store_scaling_governor requires policy->rwsem to be held for duration of changing governors [v2]

On 07/30/2014 02:40 PM, Rafael J. Wysocki wrote:
> On Wednesday, July 30, 2014 10:18:25 AM Prarit Bhargava wrote:
>>
>> On 07/29/2014 08:03 PM, Rafael J. Wysocki wrote:
>>> On Tuesday, July 29, 2014 07:46:02 AM Prarit Bhargava wrote:
>
> [cut]
>
>>>> This patch effectively reverts commit 955ef483.

The issue reported in this patch is valid. We are seeing that internally
too. I believe I reported it in another thread (within the past month).

However, the original patch fixes a real deadlock issue (I'm too tired
to look it up now). We can revet the original, but it's going to bring
back the original issue. I just want to make sure Prarit and Raphael
realize this before proceeding.

I do have plans for a proper fix for the mainline (not stable branches),
but plan to do that after the current set of suspend/hotplug patches go
through. The fix would be easier to make after that.

>>>
>>> OK, I'm convinced by this.
>>>
>>> I suppose we should push it for -stable from 3.10 through 3.15.x, right?
>>
>> Rafael, I think that is a good idea. I'm not sure what the protocol is for
>> adding [email protected] though ...
>
> I'll take care of this, thanks!
>

But you aren't going to pull the in for the next release, right?

-Saravana

--
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
hosted by The Linux Foundation

2014-07-31 01:57:41

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH] cpufreq, store_scaling_governor requires policy->rwsem to be held for duration of changing governors [v2]

On Wednesday, July 30, 2014 06:36:00 PM Saravana Kannan wrote:
> On 07/30/2014 02:40 PM, Rafael J. Wysocki wrote:
> > On Wednesday, July 30, 2014 10:18:25 AM Prarit Bhargava wrote:
> >>
> >> On 07/29/2014 08:03 PM, Rafael J. Wysocki wrote:
> >>> On Tuesday, July 29, 2014 07:46:02 AM Prarit Bhargava wrote:
> >
> > [cut]
> >
> >>>> This patch effectively reverts commit 955ef483.
>
> The issue reported in this patch is valid. We are seeing that internally
> too. I believe I reported it in another thread (within the past month).
>
> However, the original patch fixes a real deadlock issue (I'm too tired
> to look it up now). We can revet the original, but it's going to bring
> back the original issue. I just want to make sure Prarit and Raphael
> realize this before proceeding.
>
> I do have plans for a proper fix for the mainline (not stable branches),
> but plan to do that after the current set of suspend/hotplug patches go
> through. The fix would be easier to make after that.
>
> >>>
> >>> OK, I'm convinced by this.
> >>>
> >>> I suppose we should push it for -stable from 3.10 through 3.15.x, right?
> >>
> >> Rafael, I think that is a good idea. I'm not sure what the protocol is for
> >> adding [email protected] though ...
> >
> > I'll take care of this, thanks!
> >
>
> But you aren't going to pull the in for the next release, right?

What do you mean?

Rafael

2014-07-31 02:07:10

by Saravana Kannan

[permalink] [raw]
Subject: Re: [PATCH] cpufreq, store_scaling_governor requires policy->rwsem to be held for duration of changing governors [v2]

On 07/30/2014 07:16 PM, Rafael J. Wysocki wrote:
> On Wednesday, July 30, 2014 06:36:00 PM Saravana Kannan wrote:
>> On 07/30/2014 02:40 PM, Rafael J. Wysocki wrote:
>>> On Wednesday, July 30, 2014 10:18:25 AM Prarit Bhargava wrote:
>>>>
>>>> On 07/29/2014 08:03 PM, Rafael J. Wysocki wrote:
>>>>> On Tuesday, July 29, 2014 07:46:02 AM Prarit Bhargava wrote:
>>>
>>> [cut]
>>>
>>>>>> This patch effectively reverts commit 955ef483.
>>
>> The issue reported in this patch is valid. We are seeing that internally
>> too. I believe I reported it in another thread (within the past month).
>>
>> However, the original patch fixes a real deadlock issue (I'm too tired
>> to look it up now). We can revet the original, but it's going to bring
>> back the original issue. I just want to make sure Prarit and Raphael
>> realize this before proceeding.
>>
>> I do have plans for a proper fix for the mainline (not stable branches),
>> but plan to do that after the current set of suspend/hotplug patches go
>> through. The fix would be easier to make after that.
>>
>>>>>
>>>>> OK, I'm convinced by this.
>>>>>
>>>>> I suppose we should push it for -stable from 3.10 through 3.15.x, right?
>>>>
>>>> Rafael, I think that is a good idea. I'm not sure what the protocol is for
>>>> adding [email protected] though ...
>>>
>>> I'll take care of this, thanks!
>>>
>>
>> But you aren't going to pull the in for the next release, right?
>
> What do you mean?
>

Reverting the commit will bring back another dead lock issue. So, you
don't want to revert it on mainline. Do I still not make sense because
I'm not using the right terms?

-Saravana

--
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
hosted by The Linux Foundation

2014-07-31 10:16:37

by Prarit Bhargava

[permalink] [raw]
Subject: Re: [PATCH] cpufreq, store_scaling_governor requires policy->rwsem to be held for duration of changing governors [v2]



On 07/30/2014 10:16 PM, Rafael J. Wysocki wrote:
> On Wednesday, July 30, 2014 06:36:00 PM Saravana Kannan wrote:
>> On 07/30/2014 02:40 PM, Rafael J. Wysocki wrote:
>>> On Wednesday, July 30, 2014 10:18:25 AM Prarit Bhargava wrote:
>>>>
>>>> On 07/29/2014 08:03 PM, Rafael J. Wysocki wrote:
>>>>> On Tuesday, July 29, 2014 07:46:02 AM Prarit Bhargava wrote:
>>>
>>> [cut]
>>>
>>>>>> This patch effectively reverts commit 955ef483.
>>
>> The issue reported in this patch is valid. We are seeing that internally
>> too. I believe I reported it in another thread (within the past month).
>>
>> However, the original patch fixes a real deadlock issue (I'm too tired
>> to look it up now). We can revet the original, but it's going to bring
>> back the original issue. I just want to make sure Prarit and Raphael
>> realize this before proceeding.

Hi Saravana,

Thanks for your input. I went back to the code and confirmed my original
statement about this patch.

Note: in a previous email I erroneously wrote "buffer->mutex" when I should
have identified the lock as sysfs_mutex. Sorry 'bout that, and apologies
for any confusion that may have caused.

>From my commit message:

"In any case, the current linux.git code no longer can reproduce the original
failure; the locking in the sysfs release code has changed."

The original patch attempted to fix this deadlock:

A cpufreq driver on a file read did:

-> #0 (&per_cpu(cpu_policy_rwsem, cpu)){+++++.}:
[<c0055253>] __lock_acquire+0xef3/0x13dc
[<c0055a79>] lock_acquire+0x61/0xbc
[<c03ee1f5>] down_read+0x25/0x30
[<c02f6179>] lock_policy_rwsem_read+0x25/0x34
[<c02f6edd>] show+0x21/0x58
[<c00f9c0f>] sysfs_read_file+0x67/0xcc
[<c00b40a7>] vfs_read+0x63/0xd8
[<c00b41fb>] sys_read+0x2f/0x50
[<c000cdc1>] ret_fast_syscall+0x1/0x52

lock(s_active#41) [ which is actually the acquisition of sysfs_mutex ]
lock(&per_cpu(cpu_policy_rwsem, cpu));

and on the governor switch (notably the EXIT of the existing governor), the
opposite occurs

-> #1 (s_active#41){++++.+}:
[<c0055a79>] lock_acquire+0x61/0xbc
[<c00fabf1>] sysfs_addrm_finish+0xc1/0x128
[<c00f9819>] sysfs_hash_and_remove+0x35/0x64
[<c00fbe6f>] remove_files.isra.0+0x1b/0x24
[<c00fbea5>] sysfs_remove_group+0x2d/0xa8
[<c02f9a0b>] cpufreq_governor_interactive+0x13b/0x35c
[<c02f61df>] __cpufreq_governor+0x2b/0x8c
[<c02f6579>] __cpufreq_set_policy+0xa9/0xf8
[<c02f6b75>] store_scaling_governor+0x61/0x100
[<c02f6f4d>] store+0x39/0x60
[<c00f9b81>] sysfs_write_file+0xed/0x114
[<c00b3fd1>] vfs_write+0x65/0xd8
[<c00b424b>] sys_write+0x2f/0x50
[<c000cdc1>] ret_fast_syscall+0x1/0x52


lock(&per_cpu(cpu_policy_rwsem, cpu));
lock(s_active#41) [ which is actually the acquisition of sysfs_mutex ]

The sysfs_mutex no longer blocks in the sysfs path, and I have built with
LOCKDEP on and off to confirm that I do not see any tracebacks or hangs. I
tested this by doing a few reads of the current governor, and then doing a
governor switch (to at least initiate the LOCKDEP warning). IIUC the traceback
above that is the way to reproduce this LOCKDEP warning.

IOW ... this deadlock situation no longer exists in the code (at least AFAICT).

Secondly, restoring the locking (and keeping in mind the semantic of the
read-write semaphore, specifically that all reads must be done before the write
lock is acquired by a thread) modifies the code such that the above situation (a
read and a write) can no longer occur in this code.

P.

2014-07-31 10:21:41

by Prarit Bhargava

[permalink] [raw]
Subject: Re: [PATCH] cpufreq, store_scaling_governor requires policy->rwsem to be held for duration of changing governors [v2]



On 07/31/2014 06:16 AM, Prarit Bhargava wrote:
>
>
> On 07/30/2014 10:16 PM, Rafael J. Wysocki wrote:
>> On Wednesday, July 30, 2014 06:36:00 PM Saravana Kannan wrote:
>>> On 07/30/2014 02:40 PM, Rafael J. Wysocki wrote:
>>>> On Wednesday, July 30, 2014 10:18:25 AM Prarit Bhargava wrote:
>>>>>
>>>>> On 07/29/2014 08:03 PM, Rafael J. Wysocki wrote:
>>>>>> On Tuesday, July 29, 2014 07:46:02 AM Prarit Bhargava wrote:
>>>>
>>>> [cut]
>>>>
>>>>>>> This patch effectively reverts commit 955ef483.
>>>
>>> The issue reported in this patch is valid. We are seeing that internally
>>> too. I believe I reported it in another thread (within the past month).
>>>
>>> However, the original patch fixes a real deadlock issue (I'm too tired
>>> to look it up now). We can revet the original, but it's going to bring
>>> back the original issue. I just want to make sure Prarit and Raphael
>>> realize this before proceeding.
>
> Hi Saravana,
>
> Thanks for your input. I went back to the code and confirmed my original
> statement about this patch.
>
> Note: in a previous email I erroneously wrote "buffer->mutex" when I should
> have identified the lock as sysfs_mutex. Sorry 'bout that, and apologies
> for any confusion that may have caused.
>
> From my commit message:
>
> "In any case, the current linux.git code no longer can reproduce the original
> failure; the locking in the sysfs release code has changed."
>
> The original patch attempted to fix this deadlock:
>
> A cpufreq driver on a file read did:
>
> -> #0 (&per_cpu(cpu_policy_rwsem, cpu)){+++++.}:
> [<c0055253>] __lock_acquire+0xef3/0x13dc
> [<c0055a79>] lock_acquire+0x61/0xbc
> [<c03ee1f5>] down_read+0x25/0x30
> [<c02f6179>] lock_policy_rwsem_read+0x25/0x34
> [<c02f6edd>] show+0x21/0x58
> [<c00f9c0f>] sysfs_read_file+0x67/0xcc
> [<c00b40a7>] vfs_read+0x63/0xd8
> [<c00b41fb>] sys_read+0x2f/0x50
> [<c000cdc1>] ret_fast_syscall+0x1/0x52
>
> lock(s_active#41) [ which is actually the acquisition of sysfs_mutex ]
> lock(&per_cpu(cpu_policy_rwsem, cpu));
>
> and on the governor switch (notably the EXIT of the existing governor), the
> opposite occurs
>
> -> #1 (s_active#41){++++.+}:
> [<c0055a79>] lock_acquire+0x61/0xbc
> [<c00fabf1>] sysfs_addrm_finish+0xc1/0x128
> [<c00f9819>] sysfs_hash_and_remove+0x35/0x64
> [<c00fbe6f>] remove_files.isra.0+0x1b/0x24
> [<c00fbea5>] sysfs_remove_group+0x2d/0xa8
> [<c02f9a0b>] cpufreq_governor_interactive+0x13b/0x35c
> [<c02f61df>] __cpufreq_governor+0x2b/0x8c
> [<c02f6579>] __cpufreq_set_policy+0xa9/0xf8
> [<c02f6b75>] store_scaling_governor+0x61/0x100
> [<c02f6f4d>] store+0x39/0x60
> [<c00f9b81>] sysfs_write_file+0xed/0x114
> [<c00b3fd1>] vfs_write+0x65/0xd8
> [<c00b424b>] sys_write+0x2f/0x50
> [<c000cdc1>] ret_fast_syscall+0x1/0x52
>
>
> lock(&per_cpu(cpu_policy_rwsem, cpu));
> lock(s_active#41) [ which is actually the acquisition of sysfs_mutex ]
>
> The sysfs_mutex no longer blocks in the sysfs path, and I have built with
> LOCKDEP on and off to confirm that I do not see any tracebacks or hangs. I
> tested this by doing a few reads of the current governor, and then doing a
> governor switch (to at least initiate the LOCKDEP warning). IIUC the traceback
> above that is the way to reproduce this LOCKDEP warning.

^^^ this should not be taken as 'I did only a few reads ...'. I tested quite
extensively across 15 different systems and added a read of the scaling_governor
files in my little reproducer.

P.

2014-07-31 10:23:26

by Prarit Bhargava

[permalink] [raw]
Subject: Re: [PATCH] cpufreq, store_scaling_governor requires policy->rwsem to be held for duration of changing governors [v2]



On 07/30/2014 10:16 PM, Rafael J. Wysocki wrote:
> On Wednesday, July 30, 2014 06:36:00 PM Saravana Kannan wrote:
>> On 07/30/2014 02:40 PM, Rafael J. Wysocki wrote:
>>> On Wednesday, July 30, 2014 10:18:25 AM Prarit Bhargava wrote:
>>>>
>>>> On 07/29/2014 08:03 PM, Rafael J. Wysocki wrote:
>>>>> On Tuesday, July 29, 2014 07:46:02 AM Prarit Bhargava wrote:
>>>
>>> [cut]
>>>
>>>>>> This patch effectively reverts commit 955ef483.
>>
>> The issue reported in this patch is valid. We are seeing that internally
>> too. I believe I reported it in another thread (within the past month).
>>
>> However, the original patch fixes a real deadlock issue (I'm too tired
>> to look it up now). We can revet the original, but it's going to bring
>> back the original issue. I just want to make sure Prarit and Raphael
>> realize this before proceeding.
>>
>> I do have plans for a proper fix for the mainline (not stable branches),
>> but plan to do that after the current set of suspend/hotplug patches go
>> through. The fix would be easier to make after that.
>>
>>>>>
>>>>> OK, I'm convinced by this.
>>>>>
>>>>> I suppose we should push it for -stable from 3.10 through 3.15.x, right?
>>>>
>>>> Rafael, I think that is a good idea. I'm not sure what the protocol is for
>>>> adding [email protected] though ...

Rafael, let me (again) re-write the patch description. I think Saravana has
raised an important issue that I have not clearly identified why it is safe to
remove this code in my patch description. Also, I want to clearly identify the
appropriate -stable releases to push this out to.

I'll submit a [v3] later today or tomorrow.

P.

>
> Rafael
>

2014-07-31 16:18:15

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH] cpufreq, store_scaling_governor requires policy->rwsem to be held for duration of changing governors [v2]

On Thursday, July 31, 2014 06:23:18 AM Prarit Bhargava wrote:
>
> On 07/30/2014 10:16 PM, Rafael J. Wysocki wrote:
> > On Wednesday, July 30, 2014 06:36:00 PM Saravana Kannan wrote:
> >> On 07/30/2014 02:40 PM, Rafael J. Wysocki wrote:
> >>> On Wednesday, July 30, 2014 10:18:25 AM Prarit Bhargava wrote:
> >>>>
> >>>> On 07/29/2014 08:03 PM, Rafael J. Wysocki wrote:
> >>>>> On Tuesday, July 29, 2014 07:46:02 AM Prarit Bhargava wrote:
> >>>
> >>> [cut]
> >>>
> >>>>>> This patch effectively reverts commit 955ef483.
> >>
> >> The issue reported in this patch is valid. We are seeing that internally
> >> too. I believe I reported it in another thread (within the past month).
> >>
> >> However, the original patch fixes a real deadlock issue (I'm too tired
> >> to look it up now). We can revet the original, but it's going to bring
> >> back the original issue. I just want to make sure Prarit and Raphael
> >> realize this before proceeding.
> >>
> >> I do have plans for a proper fix for the mainline (not stable branches),
> >> but plan to do that after the current set of suspend/hotplug patches go
> >> through. The fix would be easier to make after that.
> >>
> >>>>>
> >>>>> OK, I'm convinced by this.
> >>>>>
> >>>>> I suppose we should push it for -stable from 3.10 through 3.15.x, right?
> >>>>
> >>>> Rafael, I think that is a good idea. I'm not sure what the protocol is for
> >>>> adding [email protected] though ...
>
> Rafael, let me (again) re-write the patch description. I think Saravana has
> raised an important issue that I have not clearly identified why it is safe to
> remove this code in my patch description. Also, I want to clearly identify the
> appropriate -stable releases to push this out to.
>
> I'll submit a [v3] later today or tomorrow.

In any case that's too late for 3.16 final, unless there's an -rc8.

Thanks for doing that work!

Rafael

2014-07-31 17:57:38

by Prarit Bhargava

[permalink] [raw]
Subject: Re: [PATCH] cpufreq, store_scaling_governor requires policy->rwsem to be held for duration of changing governors [v2]



On 07/31/2014 12:36 PM, Rafael J. Wysocki wrote:
> On Thursday, July 31, 2014 06:23:18 AM Prarit Bhargava wrote:
>>
>> On 07/30/2014 10:16 PM, Rafael J. Wysocki wrote:
>>> On Wednesday, July 30, 2014 06:36:00 PM Saravana Kannan wrote:
>>>> On 07/30/2014 02:40 PM, Rafael J. Wysocki wrote:
>>>>> On Wednesday, July 30, 2014 10:18:25 AM Prarit Bhargava wrote:
>>>>>>
>>>>>> On 07/29/2014 08:03 PM, Rafael J. Wysocki wrote:
>>>>>>> On Tuesday, July 29, 2014 07:46:02 AM Prarit Bhargava wrote:
>>>>>
>>>>> [cut]
>>>>>
>>>>>>>> This patch effectively reverts commit 955ef483.
>>>>
>>>> The issue reported in this patch is valid. We are seeing that internally
>>>> too. I believe I reported it in another thread (within the past month).
>>>>
>>>> However, the original patch fixes a real deadlock issue (I'm too tired
>>>> to look it up now). We can revet the original, but it's going to bring
>>>> back the original issue. I just want to make sure Prarit and Raphael
>>>> realize this before proceeding.
>>>>
>>>> I do have plans for a proper fix for the mainline (not stable branches),
>>>> but plan to do that after the current set of suspend/hotplug patches go
>>>> through. The fix would be easier to make after that.
>>>>
>>>>>>>
>>>>>>> OK, I'm convinced by this.
>>>>>>>
>>>>>>> I suppose we should push it for -stable from 3.10 through 3.15.x, right?
>>>>>>
>>>>>> Rafael, I think that is a good idea. I'm not sure what the protocol is for
>>>>>> adding [email protected] though ...
>>
>> Rafael, let me (again) re-write the patch description. I think Saravana has
>> raised an important issue that I have not clearly identified why it is safe to
>> remove this code in my patch description. Also, I want to clearly identify the
>> appropriate -stable releases to push this out to.
>>
>> I'll submit a [v3] later today or tomorrow.
>
> In any case that's too late for 3.16 final, unless there's an -rc8.
>
> Thanks for doing that work!

Ugh ... I tried this (yet another) large system and hit another panic :(.

I'm investigating now, and I'm hoping this is just something "new".

P.

>
> Rafael
>

2014-07-31 18:19:50

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH] cpufreq, store_scaling_governor requires policy->rwsem to be held for duration of changing governors [v2]

On Thursday, July 31, 2014 01:57:29 PM Prarit Bhargava wrote:
>
> On 07/31/2014 12:36 PM, Rafael J. Wysocki wrote:
> > On Thursday, July 31, 2014 06:23:18 AM Prarit Bhargava wrote:
> >>
> >> On 07/30/2014 10:16 PM, Rafael J. Wysocki wrote:
> >>> On Wednesday, July 30, 2014 06:36:00 PM Saravana Kannan wrote:
> >>>> On 07/30/2014 02:40 PM, Rafael J. Wysocki wrote:
> >>>>> On Wednesday, July 30, 2014 10:18:25 AM Prarit Bhargava wrote:
> >>>>>>
> >>>>>> On 07/29/2014 08:03 PM, Rafael J. Wysocki wrote:
> >>>>>>> On Tuesday, July 29, 2014 07:46:02 AM Prarit Bhargava wrote:
> >>>>>
> >>>>> [cut]
> >>>>>
> >>>>>>>> This patch effectively reverts commit 955ef483.
> >>>>
> >>>> The issue reported in this patch is valid. We are seeing that internally
> >>>> too. I believe I reported it in another thread (within the past month).
> >>>>
> >>>> However, the original patch fixes a real deadlock issue (I'm too tired
> >>>> to look it up now). We can revet the original, but it's going to bring
> >>>> back the original issue. I just want to make sure Prarit and Raphael
> >>>> realize this before proceeding.
> >>>>
> >>>> I do have plans for a proper fix for the mainline (not stable branches),
> >>>> but plan to do that after the current set of suspend/hotplug patches go
> >>>> through. The fix would be easier to make after that.
> >>>>
> >>>>>>>
> >>>>>>> OK, I'm convinced by this.
> >>>>>>>
> >>>>>>> I suppose we should push it for -stable from 3.10 through 3.15.x, right?
> >>>>>>
> >>>>>> Rafael, I think that is a good idea. I'm not sure what the protocol is for
> >>>>>> adding [email protected] though ...
> >>
> >> Rafael, let me (again) re-write the patch description. I think Saravana has
> >> raised an important issue that I have not clearly identified why it is safe to
> >> remove this code in my patch description. Also, I want to clearly identify the
> >> appropriate -stable releases to push this out to.
> >>
> >> I'll submit a [v3] later today or tomorrow.
> >
> > In any case that's too late for 3.16 final, unless there's an -rc8.
> >
> > Thanks for doing that work!
>
> Ugh ... I tried this (yet another) large system and hit another panic :(.
>
> I'm investigating now, and I'm hoping this is just something "new".

Well, I've applied your patch as is and I can push it to Linus.

However, if you want to update the changelog, I'll not do that, but in that
case the patch will have to wait for the next week.

Rafael

2014-07-31 18:26:21

by Prarit Bhargava

[permalink] [raw]
Subject: Re: [PATCH] cpufreq, store_scaling_governor requires policy->rwsem to be held for duration of changing governors [v2]



On 07/31/2014 02:38 PM, Rafael J. Wysocki wrote:
> On Thursday, July 31, 2014 01:57:29 PM Prarit Bhargava wrote:
>>
>> On 07/31/2014 12:36 PM, Rafael J. Wysocki wrote:
>>> On Thursday, July 31, 2014 06:23:18 AM Prarit Bhargava wrote:
>>>>
>>>> On 07/30/2014 10:16 PM, Rafael J. Wysocki wrote:
>>>>> On Wednesday, July 30, 2014 06:36:00 PM Saravana Kannan wrote:
>>>>>> On 07/30/2014 02:40 PM, Rafael J. Wysocki wrote:
>>>>>>> On Wednesday, July 30, 2014 10:18:25 AM Prarit Bhargava wrote:
>>>>>>>>
>>>>>>>> On 07/29/2014 08:03 PM, Rafael J. Wysocki wrote:
>>>>>>>>> On Tuesday, July 29, 2014 07:46:02 AM Prarit Bhargava wrote:
>>>>>>>
>>>>>>> [cut]
>>>>>>>
>>>>>>>>>> This patch effectively reverts commit 955ef483.
>>>>>>
>>>>>> The issue reported in this patch is valid. We are seeing that internally
>>>>>> too. I believe I reported it in another thread (within the past month).
>>>>>>
>>>>>> However, the original patch fixes a real deadlock issue (I'm too tired
>>>>>> to look it up now). We can revet the original, but it's going to bring
>>>>>> back the original issue. I just want to make sure Prarit and Raphael
>>>>>> realize this before proceeding.
>>>>>>
>>>>>> I do have plans for a proper fix for the mainline (not stable branches),
>>>>>> but plan to do that after the current set of suspend/hotplug patches go
>>>>>> through. The fix would be easier to make after that.
>>>>>>
>>>>>>>>>
>>>>>>>>> OK, I'm convinced by this.
>>>>>>>>>
>>>>>>>>> I suppose we should push it for -stable from 3.10 through 3.15.x, right?
>>>>>>>>
>>>>>>>> Rafael, I think that is a good idea. I'm not sure what the protocol is for
>>>>>>>> adding [email protected] though ...
>>>>
>>>> Rafael, let me (again) re-write the patch description. I think Saravana has
>>>> raised an important issue that I have not clearly identified why it is safe to
>>>> remove this code in my patch description. Also, I want to clearly identify the
>>>> appropriate -stable releases to push this out to.
>>>>
>>>> I'll submit a [v3] later today or tomorrow.
>>>
>>> In any case that's too late for 3.16 final, unless there's an -rc8.
>>>
>>> Thanks for doing that work!
>>
>> Ugh ... I tried this (yet another) large system and hit another panic :(.
>>
>> I'm investigating now, and I'm hoping this is just something "new".
>
> Well, I've applied your patch as is and I can push it to Linus.
>
> However, if you want to update the changelog, I'll not do that, but in that
> case the patch will have to wait for the next week.

Rafael, please let it wait for next week. I _need_ to make sure this is correct
and I'd rather not pushed something half-done.

Thanks,

P.

>
> Rafael
>

2014-07-31 20:24:47

by Saravana Kannan

[permalink] [raw]
Subject: Re: [PATCH] cpufreq, store_scaling_governor requires policy->rwsem to be held for duration of changing governors [v2]

On 07/31/2014 11:26 AM, Prarit Bhargava wrote:
>
>
> On 07/31/2014 02:38 PM, Rafael J. Wysocki wrote:
>> On Thursday, July 31, 2014 01:57:29 PM Prarit Bhargava wrote:
>>>
>>> On 07/31/2014 12:36 PM, Rafael J. Wysocki wrote:
>>>> On Thursday, July 31, 2014 06:23:18 AM Prarit Bhargava wrote:
>>>>>
>>>>> On 07/30/2014 10:16 PM, Rafael J. Wysocki wrote:
>>>>>> On Wednesday, July 30, 2014 06:36:00 PM Saravana Kannan wrote:
>>>>>>> On 07/30/2014 02:40 PM, Rafael J. Wysocki wrote:
>>>>>>>> On Wednesday, July 30, 2014 10:18:25 AM Prarit Bhargava wrote:
>>>>>>>>>
>>>>>>>>> On 07/29/2014 08:03 PM, Rafael J. Wysocki wrote:
>>>>>>>>>> On Tuesday, July 29, 2014 07:46:02 AM Prarit Bhargava wrote:
>>>>>>>>
>>>>>>>> [cut]
>>>>>>>>
>>>>>>>>>>> This patch effectively reverts commit 955ef483.
>>>>>>>
>>>>>>> The issue reported in this patch is valid. We are seeing that internally
>>>>>>> too. I believe I reported it in another thread (within the past month).
>>>>>>>
>>>>>>> However, the original patch fixes a real deadlock issue (I'm too tired
>>>>>>> to look it up now). We can revet the original, but it's going to bring
>>>>>>> back the original issue. I just want to make sure Prarit and Raphael
>>>>>>> realize this before proceeding.
>>>>>>>
>>>>>>> I do have plans for a proper fix for the mainline (not stable branches),
>>>>>>> but plan to do that after the current set of suspend/hotplug patches go
>>>>>>> through. The fix would be easier to make after that.
>>>>>>>
>>>>>>>>>>
>>>>>>>>>> OK, I'm convinced by this.
>>>>>>>>>>
>>>>>>>>>> I suppose we should push it for -stable from 3.10 through 3.15.x, right?
>>>>>>>>>
>>>>>>>>> Rafael, I think that is a good idea. I'm not sure what the protocol is for
>>>>>>>>> adding [email protected] though ...
>>>>>
>>>>> Rafael, let me (again) re-write the patch description. I think Saravana has
>>>>> raised an important issue that I have not clearly identified why it is safe to
>>>>> remove this code in my patch description. Also, I want to clearly identify the
>>>>> appropriate -stable releases to push this out to.
>>>>>
>>>>> I'll submit a [v3] later today or tomorrow.
>>>>
>>>> In any case that's too late for 3.16 final, unless there's an -rc8.
>>>>
>>>> Thanks for doing that work!
>>>
>>> Ugh ... I tried this (yet another) large system and hit another panic :(.
>>>
>>> I'm investigating now, and I'm hoping this is just something "new".
>>
>> Well, I've applied your patch as is and I can push it to Linus.
>>
>> However, if you want to update the changelog, I'll not do that, but in that
>> case the patch will have to wait for the next week.
>
> Rafael, please let it wait for next week. I _need_ to make sure this is correct
> and I'd rather not pushed something half-done.
>

Prarit,

I'm not an expert on sysfs locking, but I would think the specific sysfs
lock would depend on the file/attribute group. So, can you please try to
hotplug a core in/out (to trigger the POLICY_EXIT) and then read a sysfs
file exported by the governor? scaling_governor doesn't cut it since
that file is not removed on policy exit event to governor. If it's
ondemand, try reading/write it's sampling rate file.

The main problem here is upon POLICY_EXIT to the governor, the governor
tries to remove its sysfs file. So, if you have the policy lock held
while sending POLICY_EXIT to the governor, you'll cause the:
lock policy
lock sysfs

But trying to read the same file would cause:
lock sysfs
lock policy

-Saravana

--
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
hosted by The Linux Foundation

2014-07-31 20:30:11

by Prarit Bhargava

[permalink] [raw]
Subject: Re: [PATCH] cpufreq, store_scaling_governor requires policy->rwsem to be held for duration of changing governors [v2]



On 07/31/2014 04:24 PM, Saravana Kannan wrote:
> On 07/31/2014 11:26 AM, Prarit Bhargava wrote:
>>
>>
>> On 07/31/2014 02:38 PM, Rafael J. Wysocki wrote:
>>> On Thursday, July 31, 2014 01:57:29 PM Prarit Bhargava wrote:
>>>>
>>>> On 07/31/2014 12:36 PM, Rafael J. Wysocki wrote:
>>>>> On Thursday, July 31, 2014 06:23:18 AM Prarit Bhargava wrote:
>>>>>>
>>>>>> On 07/30/2014 10:16 PM, Rafael J. Wysocki wrote:
>>>>>>> On Wednesday, July 30, 2014 06:36:00 PM Saravana Kannan wrote:
>>>>>>>> On 07/30/2014 02:40 PM, Rafael J. Wysocki wrote:
>>>>>>>>> On Wednesday, July 30, 2014 10:18:25 AM Prarit Bhargava wrote:
>>>>>>>>>>
>>>>>>>>>> On 07/29/2014 08:03 PM, Rafael J. Wysocki wrote:
>>>>>>>>>>> On Tuesday, July 29, 2014 07:46:02 AM Prarit Bhargava wrote:
>>>>>>>>>
>>>>>>>>> [cut]
>>>>>>>>>
>>>>>>>>>>>> This patch effectively reverts commit 955ef483.
>>>>>>>>
>>>>>>>> The issue reported in this patch is valid. We are seeing that internally
>>>>>>>> too. I believe I reported it in another thread (within the past month).
>>>>>>>>
>>>>>>>> However, the original patch fixes a real deadlock issue (I'm too tired
>>>>>>>> to look it up now). We can revet the original, but it's going to bring
>>>>>>>> back the original issue. I just want to make sure Prarit and Raphael
>>>>>>>> realize this before proceeding.
>>>>>>>>
>>>>>>>> I do have plans for a proper fix for the mainline (not stable branches),
>>>>>>>> but plan to do that after the current set of suspend/hotplug patches go
>>>>>>>> through. The fix would be easier to make after that.
>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> OK, I'm convinced by this.
>>>>>>>>>>>
>>>>>>>>>>> I suppose we should push it for -stable from 3.10 through 3.15.x, right?
>>>>>>>>>>
>>>>>>>>>> Rafael, I think that is a good idea. I'm not sure what the protocol
>>>>>>>>>> is for
>>>>>>>>>> adding [email protected] though ...
>>>>>>
>>>>>> Rafael, let me (again) re-write the patch description. I think Saravana has
>>>>>> raised an important issue that I have not clearly identified why it is
>>>>>> safe to
>>>>>> remove this code in my patch description. Also, I want to clearly
>>>>>> identify the
>>>>>> appropriate -stable releases to push this out to.
>>>>>>
>>>>>> I'll submit a [v3] later today or tomorrow.
>>>>>
>>>>> In any case that's too late for 3.16 final, unless there's an -rc8.
>>>>>
>>>>> Thanks for doing that work!
>>>>
>>>> Ugh ... I tried this (yet another) large system and hit another panic :(.
>>>>
>>>> I'm investigating now, and I'm hoping this is just something "new".
>>>
>>> Well, I've applied your patch as is and I can push it to Linus.
>>>
>>> However, if you want to update the changelog, I'll not do that, but in that
>>> case the patch will have to wait for the next week.
>>
>> Rafael, please let it wait for next week. I _need_ to make sure this is correct
>> and I'd rather not pushed something half-done.
>>
>
> Prarit,
>
> I'm not an expert on sysfs locking, but I would think the specific sysfs lock
> would depend on the file/attribute group. So, can you please try to hotplug a
> core in/out (to trigger the POLICY_EXIT) and then read a sysfs file exported by
> the governor? scaling_governor doesn't cut it since that file is not removed on
> policy exit event to governor. If it's ondemand, try reading/write it's sampling
> rate file.

Thanks Saravana -- will do. I will get back to you shortly on this.

P.

>
> -Saravana
>

2014-07-31 20:38:54

by Saravana Kannan

[permalink] [raw]
Subject: Re: [PATCH] cpufreq, store_scaling_governor requires policy->rwsem to be held for duration of changing governors [v2]

On 07/31/2014 01:30 PM, Prarit Bhargava wrote:
>
>
> On 07/31/2014 04:24 PM, Saravana Kannan wrote:
>>
>> Prarit,
>>
>> I'm not an expert on sysfs locking, but I would think the specific sysfs lock
>> would depend on the file/attribute group. So, can you please try to hotplug a
>> core in/out (to trigger the POLICY_EXIT) and then read a sysfs file exported by
>> the governor? scaling_governor doesn't cut it since that file is not removed on
>> policy exit event to governor. If it's ondemand, try reading/write it's sampling
>> rate file.
>
> Thanks Saravana -- will do. I will get back to you shortly on this.
>

Thanks. Btw, in case you weren't already aware of it. You'll have to
hoplug out all the CPUs in a cluster to trigger a POLICY_EXIT for that
cluster/policy.

-Saravana

--
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
hosted by The Linux Foundation

2014-07-31 21:08:18

by Prarit Bhargava

[permalink] [raw]
Subject: Re: [PATCH] cpufreq, store_scaling_governor requires policy->rwsem to be held for duration of changing governors [v2]



On 07/31/2014 04:38 PM, Saravana Kannan wrote:
> On 07/31/2014 01:30 PM, Prarit Bhargava wrote:
>>
>>
>> On 07/31/2014 04:24 PM, Saravana Kannan wrote:
>>>
>>> Prarit,
>>>
>>> I'm not an expert on sysfs locking, but I would think the specific sysfs lock
>>> would depend on the file/attribute group. So, can you please try to hotplug a
>>> core in/out (to trigger the POLICY_EXIT) and then read a sysfs file exported by
>>> the governor? scaling_governor doesn't cut it since that file is not removed on
>>> policy exit event to governor. If it's ondemand, try reading/write it's sampling
>>> rate file.
>>
>> Thanks Saravana -- will do. I will get back to you shortly on this.
>>
>
> Thanks. Btw, in case you weren't already aware of it. You'll have to hoplug out
> all the CPUs in a cluster to trigger a POLICY_EXIT for that cluster/policy.

Yep -- the affected_cpus file should show all the cpus in the policy IIRC. One
of the systems I have has 1 cpu/policy and has 48 threads so the POLICY_EXIT is
called.

I'll put something like

while [1];
do
echo ondemand > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor
cat /sys/devices/system/cpu/cpufreq/ondemand/sampling_rate
echo 20000 > /sys/devices/system/cpu/cpufreq/ondemand/sampling_rate
cat /sys/devices/system/cpu/cpufreq/ondemand/sampling_rate
echo 0 > /sys/devices/system/cpu/cpu1/online
sleep 1
echo 1 > /sys/devices/system/cpu/cpu1/online
sleep 1
done

and let it run unless you can think of something else. FWIW, manually trying
that yields no problems AFAICT. The sleeps are just me being paranoid and
keeping things in sequence.

P.
>
> -Saravana
>

2014-07-31 22:14:04

by Saravana Kannan

[permalink] [raw]
Subject: Re: [PATCH] cpufreq, store_scaling_governor requires policy->rwsem to be held for duration of changing governors [v2]

On 07/31/2014 02:08 PM, Prarit Bhargava wrote:
>
>
> On 07/31/2014 04:38 PM, Saravana Kannan wrote:
>> On 07/31/2014 01:30 PM, Prarit Bhargava wrote:
>>>
>>>
>>> On 07/31/2014 04:24 PM, Saravana Kannan wrote:
>>>>
>>>> Prarit,
>>>>
>>>> I'm not an expert on sysfs locking, but I would think the specific sysfs lock
>>>> would depend on the file/attribute group. So, can you please try to hotplug a
>>>> core in/out (to trigger the POLICY_EXIT) and then read a sysfs file exported by
>>>> the governor? scaling_governor doesn't cut it since that file is not removed on
>>>> policy exit event to governor. If it's ondemand, try reading/write it's sampling
>>>> rate file.
>>>
>>> Thanks Saravana -- will do. I will get back to you shortly on this.
>>>
>>
>> Thanks. Btw, in case you weren't already aware of it. You'll have to hoplug out
>> all the CPUs in a cluster to trigger a POLICY_EXIT for that cluster/policy.
>
> Yep -- the affected_cpus file should show all the cpus in the policy IIRC. One
> of the systems I have has 1 cpu/policy and has 48 threads so the POLICY_EXIT is
> called.
>
> I'll put something like
>
> while [1];
> do
> echo ondemand > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor
> cat /sys/devices/system/cpu/cpufreq/ondemand/sampling_rate
> echo 20000 > /sys/devices/system/cpu/cpufreq/ondemand/sampling_rate
> cat /sys/devices/system/cpu/cpufreq/ondemand/sampling_rate
> echo 0 > /sys/devices/system/cpu/cpu1/online
> sleep 1
> echo 1 > /sys/devices/system/cpu/cpu1/online
> sleep 1
> done
>

The actual race can only happen with 2 threads. I'm just trying to
trigger a lockdep warning here.

-Saravana

--
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
hosted by The Linux Foundation

2014-07-31 22:58:57

by Prarit Bhargava

[permalink] [raw]
Subject: Re: [PATCH] cpufreq, store_scaling_governor requires policy->rwsem to be held for duration of changing governors [v2]



On 07/31/2014 06:13 PM, Saravana Kannan wrote:
> On 07/31/2014 02:08 PM, Prarit Bhargava wrote:
>>
>>
>> On 07/31/2014 04:38 PM, Saravana Kannan wrote:
>>> On 07/31/2014 01:30 PM, Prarit Bhargava wrote:
>>>>
>>>>
>>>> On 07/31/2014 04:24 PM, Saravana Kannan wrote:
>>>>>
>>>>> Prarit,
>>>>>
>>>>> I'm not an expert on sysfs locking, but I would think the specific sysfs lock
>>>>> would depend on the file/attribute group. So, can you please try to hotplug a
>>>>> core in/out (to trigger the POLICY_EXIT) and then read a sysfs file
>>>>> exported by
>>>>> the governor? scaling_governor doesn't cut it since that file is not
>>>>> removed on
>>>>> policy exit event to governor. If it's ondemand, try reading/write it's
>>>>> sampling
>>>>> rate file.
>>>>
>>>> Thanks Saravana -- will do. I will get back to you shortly on this.
>>>>
>>>
>>> Thanks. Btw, in case you weren't already aware of it. You'll have to hoplug out
>>> all the CPUs in a cluster to trigger a POLICY_EXIT for that cluster/policy.
>>
>> Yep -- the affected_cpus file should show all the cpus in the policy IIRC. One
>> of the systems I have has 1 cpu/policy and has 48 threads so the POLICY_EXIT is
>> called.
>>
>> I'll put something like
>>
>> while [1];
>> do
>> echo ondemand > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor
>> cat /sys/devices/system/cpu/cpufreq/ondemand/sampling_rate
>> echo 20000 > /sys/devices/system/cpu/cpufreq/ondemand/sampling_rate
>> cat /sys/devices/system/cpu/cpufreq/ondemand/sampling_rate
>> echo 0 > /sys/devices/system/cpu/cpu1/online
>> sleep 1
>> echo 1 > /sys/devices/system/cpu/cpu1/online
>> sleep 1
>> done
>>
>
> The actual race can only happen with 2 threads. I'm just trying to trigger a
> lockdep warning here.

I ran the above in two separate terminals with cpuset -c 0 and cpuset -c 1 to
multi-thread it all. No deadlock or LOCKDEP trace after about 1/2 hour, so I
think we're in the clear on that concern.

P.

2014-08-01 00:55:49

by Saravana Kannan

[permalink] [raw]
Subject: Re: [PATCH] cpufreq, store_scaling_governor requires policy->rwsem to be held for duration of changing governors [v2]

On 07/31/2014 03:58 PM, Prarit Bhargava wrote:
>
>
> On 07/31/2014 06:13 PM, Saravana Kannan wrote:
>> On 07/31/2014 02:08 PM, Prarit Bhargava wrote:
>>>
>>>
>>> On 07/31/2014 04:38 PM, Saravana Kannan wrote:
>>>> On 07/31/2014 01:30 PM, Prarit Bhargava wrote:
>>>>>
>>>>>
>>>>> On 07/31/2014 04:24 PM, Saravana Kannan wrote:
>>>>>>
>>>>>> Prarit,
>>>>>>
>>>>>> I'm not an expert on sysfs locking, but I would think the specific sysfs lock
>>>>>> would depend on the file/attribute group. So, can you please try to hotplug a
>>>>>> core in/out (to trigger the POLICY_EXIT) and then read a sysfs file
>>>>>> exported by
>>>>>> the governor? scaling_governor doesn't cut it since that file is not
>>>>>> removed on
>>>>>> policy exit event to governor. If it's ondemand, try reading/write it's
>>>>>> sampling
>>>>>> rate file.
>>>>>
>>>>> Thanks Saravana -- will do. I will get back to you shortly on this.
>>>>>
>>>>
>>>> Thanks. Btw, in case you weren't already aware of it. You'll have to hoplug out
>>>> all the CPUs in a cluster to trigger a POLICY_EXIT for that cluster/policy.
>>>
>>> Yep -- the affected_cpus file should show all the cpus in the policy IIRC. One
>>> of the systems I have has 1 cpu/policy and has 48 threads so the POLICY_EXIT is
>>> called.
>>>
>>> I'll put something like
>>>
>>> while [1];
>>> do
>>> echo ondemand > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor
>>> cat /sys/devices/system/cpu/cpufreq/ondemand/sampling_rate
>>> echo 20000 > /sys/devices/system/cpu/cpufreq/ondemand/sampling_rate
>>> cat /sys/devices/system/cpu/cpufreq/ondemand/sampling_rate
>>> echo 0 > /sys/devices/system/cpu/cpu1/online
>>> sleep 1
>>> echo 1 > /sys/devices/system/cpu/cpu1/online
>>> sleep 1
>>> done
>>>
>>
>> The actual race can only happen with 2 threads. I'm just trying to trigger a
>> lockdep warning here.
>
> I ran the above in two separate terminals with cpuset -c 0 and cpuset -c 1 to
> multi-thread it all. No deadlock or LOCKDEP trace after about 1/2 hour, so I
> think we're in the clear on that concern.
>

I wasn't convinced. So, I took some help from Stephen to test it.

It's been a while, so I didn't remember the original issue clearly when
I gave you some test suggestions. Now that I looked at the code more
closely, I have a proper way to reproduce the original issue.

Nack for this patch for 2 reasons:
1. You seem to have accidentally removed a GOV_STOP in your patch. We
definitely can't do that. This broke changing governors and that's why
your patch didn't cause any issues. Because all your governor echos were
failing.
2. When we fixed that and actually tried a proper test (not the one I
gave you), we reproduced the original issue.

To reproduce original issue:
Preconditions:
* lockdep is enabled
* governor per policy is enabled

Steps:
1. Set governor to ondemand.
2. Cat one of the ondemand sysfs files.
3. Change governor to conservative.

When you do that, there's an AB, BA dead lock issue with one thread
trying to cat a governor sysfs file and another thread trying to change
governors.

-Saravana

--
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
hosted by The Linux Foundation

2014-10-13 10:43:52

by Viresh Kumar

[permalink] [raw]
Subject: Re: [PATCH] cpufreq, store_scaling_governor requires policy->rwsem to be held for duration of changing governors [v2]

On 1 August 2014 22:48, Stephen Boyd <[email protected]> wrote:
> On 08/01/14 03:27, Prarit Bhargava wrote:
>>
>> Can you send me the test and the trace of the deadlock? I'm not creating it with:
>>
>
> This was with conservative as the default, and switching to ondemand
>
> # cd /sys/devices/system/cpu/cpu2/cpufreq
> # ls
> affected_cpus scaling_available_governors
> conservative scaling_cur_freq
> cpuinfo_cur_freq scaling_driver
> cpuinfo_max_freq scaling_governor
> cpuinfo_min_freq scaling_max_freq
> cpuinfo_transition_latency scaling_min_freq
> related_cpus scaling_setspeed
> scaling_available_frequencies stats
> # cat conservative/down_threshold
> 20
> # echo ondemand > scaling_governor
>
> ======================================================
> [ INFO: possible circular locking dependency detected ]
> 3.16.0-rc3-00039-ge1e38f124d87 #47 Not tainted
> -------------------------------------------------------
> sh/75 is trying to acquire lock:
> (s_active#9){++++..}, at: [<c0358a94>] kernfs_remove_by_name_ns+0x3c/0x84

Can you please retry this on mainline? I wasn't able to reproduce it
now over 3.17.
I am trying this on Exynos b.L implementation..

2014-10-13 11:52:18

by Prarit Bhargava

[permalink] [raw]
Subject: Re: [PATCH] cpufreq, store_scaling_governor requires policy->rwsem to be held for duration of changing governors [v2]



On 10/13/2014 06:43 AM, Viresh Kumar wrote:
> On 1 August 2014 22:48, Stephen Boyd <[email protected]> wrote:
>> On 08/01/14 03:27, Prarit Bhargava wrote:
>>>
>>> Can you send me the test and the trace of the deadlock? I'm not creating it with:
>>>
>>
>> This was with conservative as the default, and switching to ondemand
>>
>> # cd /sys/devices/system/cpu/cpu2/cpufreq
>> # ls
>> affected_cpus scaling_available_governors
>> conservative scaling_cur_freq
>> cpuinfo_cur_freq scaling_driver
>> cpuinfo_max_freq scaling_governor
>> cpuinfo_min_freq scaling_max_freq
>> cpuinfo_transition_latency scaling_min_freq
>> related_cpus scaling_setspeed
>> scaling_available_frequencies stats
>> # cat conservative/down_threshold
>> 20
>> # echo ondemand > scaling_governor
>>
>> ======================================================
>> [ INFO: possible circular locking dependency detected ]
>> 3.16.0-rc3-00039-ge1e38f124d87 #47 Not tainted
>> -------------------------------------------------------
>> sh/75 is trying to acquire lock:
>> (s_active#9){++++..}, at: [<c0358a94>] kernfs_remove_by_name_ns+0x3c/0x84
>
> Can you please retry this on mainline? I wasn't able to reproduce it
> now over 3.17.
> I am trying this on Exynos b.L implementation..

I have 100% reproducibility on latest mainline.

Viresh, please see my next post on the locking issues in cpufreq.

P.

>